# Data Preprocessing
To apply deep learning in the wild we must extract messy data stored in arbitrary formats, and preprocess it to suit our needs. Fortunately, the **pandas** library can do much of the heavy lifting.

## Reading Dataset
**.csv** - comma separated values.

In [5]:
import torch
import os


os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [6]:
import pandas


data = pandas.read_csv(data_file)
print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


## Data Preparation
In supervised learning, we train models to predict a designated target value, given some set of input values. Our first step in processing the dataset is to **separate out columns corresponding to input versus target values**. We can select columns either by name or via integer-location based indexing (iloc).

In [9]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs, targets

(   NumRooms RoofType
 0       NaN      NaN
 1       2.0      NaN
 2       4.0    Slate
 3       NaN      NaN,
 0    127500
 1    106000
 2    178100
 3    140000
 Name: Price, dtype: int64)

Depending upon the context, missing values might be handled either via **imputation or deletion**. Imputation replaces missing values with estimates of their values while deletion simply discards either those rows or those columns that contain missing values.

For categorical input fields, we can treat NaN as a category.

In [10]:
inputs = pandas.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN           False          True
1       2.0           False          True
2       4.0            True         False
3       NaN           False          True


For missing numerical values, one common heuristic is to replace the NaN entries with the mean value of the corresponding column.

In [11]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0           False          True
1       2.0           False          True
2       4.0            True         False
3       3.0           False          True


## Conversion to Tensor Format

In [12]:
X = torch.tensor(inputs.to_numpy(dtype=float))
Y = torch.tensor(targets.to_numpy(dtype=float))
X, Y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

## Exercises

In [14]:
from ucimlrepo import fetch_ucirepo 
  
abalone = fetch_ucirepo(id=1) 
  
X = abalone.data.features 
y = abalone.data.targets 
  
print(abalone.metadata) 
  
print(abalone.variables) 

{'uci_id': 1, 'name': 'Abalone', 'repository_url': 'https://archive.ics.uci.edu/dataset/1/abalone', 'data_url': 'https://archive.ics.uci.edu/static/public/1/data.csv', 'abstract': 'Predict the age of abalone from physical measurements', 'area': 'Biology', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Tabular'], 'num_instances': 4177, 'num_features': 8, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['Rings'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C55C7W', 'creators': ['Warwick Nash', 'Tracy Sellers', 'Simon Talbot', 'Andrew Cawthorn', 'Wes Ford'], 'intro_paper': None, 'additional_info': {'summary': 'Predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- 

In [15]:
X, Y

(     Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  \
 0      M   0.455     0.365   0.095        0.5140          0.2245   
 1      M   0.350     0.265   0.090        0.2255          0.0995   
 2      F   0.530     0.420   0.135        0.6770          0.2565   
 3      M   0.440     0.365   0.125        0.5160          0.2155   
 4      I   0.330     0.255   0.080        0.2050          0.0895   
 ...   ..     ...       ...     ...           ...             ...   
 4172   F   0.565     0.450   0.165        0.8870          0.3700   
 4173   M   0.590     0.440   0.135        0.9660          0.4390   
 4174   M   0.600     0.475   0.205        1.1760          0.5255   
 4175   F   0.625     0.485   0.150        1.0945          0.5310   
 4176   M   0.710     0.555   0.195        1.9485          0.9455   
 
       Viscera_weight  Shell_weight  
 0             0.1010        0.1500  
 1             0.0485        0.0700  
 2             0.1415        0.2100  
 3             0.1

In [25]:
X.iloc[5:8]['Sex'] # .iloc doesnt include last element, .loc does

5    I
6    F
7    F
Name: Sex, dtype: object