# Data Preprocessing 
- basicly a short pandas tutorial 

# Reading the Dataset

Comma-separated values (CSV) file format of storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several (comma-separated) fields, e.g., “Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics”

In [10]:
# we create a little CSV dataset
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [11]:
# now we can load the data into a pandas dataframe
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


In [12]:
# here we splt data into inputs and targets
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]

# we can use get_dummies and dummy_na=True to handle missing values
# this will add a new column for each possible value of the categorical column
# we had only Slate and NaN for RoofType, so we will have two new columns
# we can propably drop the column with NaN values

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)


   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


In [13]:
# now we can fill the missing values with the mean of the column
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


In [14]:
# now we can convert the inputs and targets to tensors 
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

In [15]:
# lets drop the column RoofType_nan
inputs = inputs.drop(columns='RoofType_nan')
X_modded = torch.tensor(inputs.to_numpy(dtype=float))
X_modded

tensor([[3., 0.],
        [2., 0.],
        [4., 1.],
        [3., 0.]], dtype=torch.float64)