# Getting Data Ready for ML

Three main things we have to do:
  1. Split the data into train and test
  2. Converting non-numerical values to numerical value (also called feature encoding)
  3. Filling (also called imputing) or disregarding missing values

<b>Clean data --> Transform data --> Reduce data</b>

In [207]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [208]:
car_sales = pd.read_csv('data/car-sales-extended-missing-data.csv')
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


## Removing rows that don't have labels (Target Variable)

In [209]:
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [210]:
# Removing all the rows where the Price is missing
# Training on those rows doesn't make sense when we don't have the target variable value
car_sales.dropna(subset=['Price'], inplace=True)

In [211]:
car_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

## Splitting Data

<b style="color: orange;">Always split your data first (into train/test)</b> because you want to always keep your training & test set separate. If you don't do this, it can lead to data leakage when filling missing data.
- For example, if you're filling missing data in a numeric column and are using the mean, then if you haven't split, data from the test set will be used to fill your training set missing values and vice versa leading to data leakage.
- Another example is that let's say you fill a categorical column with its most abundant value. Again, you are using data from the test set to come to that conclusion which is data leakage.

NOTE: If you are doing something independent of the testing or training data, then you can do that before splitting. For example, above I've removed the rows that have missing price. This won't lead to data leakage. However, removing rows that have an Odometer value greater than the mean before splitting is data leakage because then you'll be using the test data to calculate the mean that you will use to remove the rows.

In [212]:
X = car_sales.drop(['Price'], axis=1)
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431.0,4.0
1,BMW,Blue,192714.0,5.0
2,Honda,White,84714.0,4.0
3,Toyota,White,154365.0,4.0
4,Nissan,Blue,181577.0,3.0


In [213]:
y = car_sales['Price']
y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

X_train and y_train compose the training data

X_test and y_test compose the test data

In [214]:
# Splitting into train and test set
from sklearn.model_selection import train_test_split

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Dealing with Missing Values
1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data altogether

NOTE: Remember to fill values separately for train and test data.

### Training Set

In [215]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 760 entries, 986 to 102
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           725 non-null    object 
 1   Colour         722 non-null    object 
 2   Odometer (KM)  724 non-null    float64
 3   Doors          722 non-null    float64
dtypes: float64(2), object(2)
memory usage: 29.7+ KB


In [216]:
# Filling the "Make" Column
X_train['Make'].fillna('missing', inplace=True)

# Filling the "Colour" Column
X_train['Colour'].fillna('missing', inplace=True)

# Filling the "Odometer (KM)" Column
X_train['Odometer (KM)'].fillna(X_train['Odometer (KM)'].mean(), inplace=True)

# Filling the "Doors" Column
X_train['Doors'].fillna(4, inplace=True)

In [217]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 760 entries, 986 to 102
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           760 non-null    object 
 1   Colour         760 non-null    object 
 2   Odometer (KM)  760 non-null    float64
 3   Doors          760 non-null    float64
dtypes: float64(2), object(2)
memory usage: 29.7+ KB


### Test Set

In [218]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 203 to 305
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           178 non-null    object 
 1   Colour         182 non-null    object 
 2   Odometer (KM)  178 non-null    float64
 3   Doors          181 non-null    float64
dtypes: float64(2), object(2)
memory usage: 7.4+ KB


In [219]:
# Filling the "Make" Column
X_test['Make'].fillna('missing', inplace=True)

# Filling the "Colour" Column
X_test['Colour'].fillna('missing', inplace=True)

# Filling the "Odometer (KM)" Column
X_test['Odometer (KM)'].fillna(X_test['Odometer (KM)'].mean(), inplace=True)

# Filling the "Doors" Column
X_test['Doors'].fillna(4, inplace=True)

In [220]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 203 to 305
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           190 non-null    object 
 1   Colour         190 non-null    object 
 2   Odometer (KM)  190 non-null    float64
 3   Doors          190 non-null    float64
dtypes: float64(2), object(2)
memory usage: 7.4+ KB


## Converting Categorical Data to Numerical

### Training Dataset

In [221]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Doors has numeric data but it is still a categorical feature
# The 3, 4, and 5 door values are basically categories of houses
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')

transformed_X = transformer.fit_transform(X_train)
transformed_X

<760x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3040 stored elements in Compressed Sparse Row format>

### Test Dataset

In [223]:
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot', one_hot, categorical_features)], remainder='passthrough')

transformed_X2 = transformer.fit_transform(X_test)
transformed_X2

<190x15 sparse matrix of type '<class 'numpy.float64'>'
	with 760 stored elements in Compressed Sparse Row format>

# Fitting Model

In [225]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(transformed_X, y_train)

In [226]:
model.score(transformed_X2, y_test)

0.2580656673601974