# Scikit-Learn course 2 
## Dealing with `NaN` and `None` values

# I. Getting data ready

## reminder :

- Get data ready : 
    - `df = pd.read_csv()`
    - `y = df.data_y`
    - `X = df.drop("data_y",axis=1)`
    - `dummies_X = pd.get_dummies(X,list_of_the_columns_you_want)` or sklearn version
    - `X_train, X_test, y_train, y_test = train_test_split(dummies_X,y,test_size=0.2)`
        
- Chose model, train and evaluate  :
    - `clf = RandomForestClassifier()`
    - `clf.fit(X_train, y_train)`
    - `y_preds = clf.predict(X_test)`
    - `clf.score(X_test, y_test) or accuracy_score(y_test, y_preds)`
       
- Improve model : modify hyperparameters
    - `clf.get_params()`
    - ex : chose the best n_estimators by testing

- Save and load model :
    - `pickle.dump(clf,open(path,"wb"))`
    - `loaded_model = pickle.load(open(path,"rb"))`
    - `loaded_model.score(X_test, y_test)`

![.](images/sklearn-workflow.png)

![.](images/Features-and-Labels-in-a-Dataset-i2tutorials.png)

## Features = Data
## Label = Target

## 0. Standards import 

In [2]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline 

In [3]:
# ---> y (axis=1)
# |
# |
# x (axis=0)

## 1. What if there were missing values ?
1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data
### 1.1 Import car sales missing data

In [4]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


In [5]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

### 1.2 Build machine learning model

In [6]:
X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing.Price

In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ("Make","Colour","Doors")
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],
                               remainder="passthrough")

X = transformer.fit_transform(X)
pd.DataFrame(X)

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
1,"(0, 0)\t1.0\n (0, 6)\t1.0\n (0, 13)\t1.0\n..."
2,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
3,"(0, 3)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
4,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 11)\t1.0\n..."
...,...
995,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 12)\t1.0\n..."
996,"(0, 4)\t1.0\n (0, 9)\t1.0\n (0, 11)\t1.0\n..."
997,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 12)\t1.0\n..."
998,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."


In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

from sklearn.ensemble import RandomForestRegressor
try :
    model = RandomForestRegressor()
    model.fit(X_train,y_train)
    model.score(X_test,y_test)
except : 
    print("error")
    
# Input contains NaN, infinity or a value too large for dtype('float32')

error


In [9]:
try : 
    dummies_X = pd.get_dummies(car_sales[["Make","Colour","Doors","Odometer (KM)"]])
    X_train, X_test, y_train, y_test = train_test_split(dummies_X,y,test_size=0.2)
    model = RandomForestRegressor()
    model.fit(X_train,y_train)
    model.score(X_test,y_test)
except : 
    print("error")
    
# Input contains NaN, infinity or a value too large for dtype('float32') 

error


## 2. Fill missing data (also call imputation)

### 2.1 Fill missing data with Pandas

<strong>
<font color='red'>
filling and transforming the entire dataset car_sales_missing (and so X) and although the code works and runs, it's best to fill and transform training and test sets separately.
</font>
</strong>

In [10]:
# Fill missing values with pandas
# Fill categorical values with 'missing' & numerical values with mean (or 4 for doors)

# Fill the "Make" column
car_sales_missing.Make.fillna("missing",inplace=True)

# Fill the "Colour" column
car_sales_missing.Colour.fillna("missing",inplace=True)

# Fill the "Odometer (KM)" column
Odometer_mean = car_sales_missing["Odometer (KM)"].mean()
car_sales_missing["Odometer (KM)"].fillna(Odometer_mean,inplace=True)

# Fill the "Doors" column
car_sales_missing.Doors.fillna(4,inplace=True)

# Remove rows missinf Price values
car_sales_missing.dropna(inplace=True)

car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [11]:
len(car_sales_missing)

950

In [12]:
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,131253.237895,4.0,20306.0
8,missing,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


#### 2.1.1 Build machine learning model

In [13]:
X = car_sales_missing.drop("Price",axis=1)

categorical_features = ("Make","Colour","Doors")
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],
                               remainder="passthrough")

X = transformer.fit_transform(X)

y = car_sales_missing.Price
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

clf = RandomForestRegressor()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.18887428768896075

### 2.2 Fill missing values with sklearn

**`SimpleImputer`**

In [4]:
from sklearn.impute import SimpleImputer

X_train = np.array([
    [10,3],
    [0,4],
    [5,3],
    [np.nan,3]
])

imputer = SimpleImputer(missing_values=np.nan,
                       strategy="mean") 
# these two parameter are the default one

imputer.fit_transform(X_train)  

array([[10.,  3.],
       [ 0.,  4.],
       [ 5.,  3.],
       [ 5.,  3.]])

In [5]:
X_test = np.array([
    [10,3],
    [np.nan,2]
])

imputer.transform(X_test) #use the mea of the train set, already fit

array([[10.,  3.],
       [ 5.,  2.]])

**Warining** : We must fit the data of the train set only, after using `train_test_split`<br>
otherwise some data from the test set will be use to train our ML model

**`KNNImputer`** : compare with neighbors to predict the data missing

In [19]:
from sklearn.impute import KNNImputer

X_train = np.array([
    [1,100],
    [2,75],
    [3,50],
    [np.nan,55],
    [np.nan,65],
    [np.nan,100]
])

imputer = KNNImputer(n_neighbors=1)
imputer.fit_transform(X_train)

array([[  1., 100.],
       [  2.,  75.],
       [  3.,  50.],
       [  3.,  55.],
       [  2.,  65.],
       [  1., 100.]])

## Application

In [14]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [15]:
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [16]:
X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing.Price

In [17]:
# Fill missing values with sklearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean (or 4 for doors)
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)
])

# Transform the data (training and testing)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test) # fit already done

# Get our transformed data array's back into DataFrame's
car_sales_filled_train = pd.DataFrame(filled_X_train, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

car_sales_filled_test = pd.DataFrame(filled_X_test, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

**Note** : We use `fit_transform()` on the training data and `transform()` on the testing data.
    In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform).
    Then we take those same patterns and fill the test set (transform only).

we could have done :  <br>
`
imputer.fit(X_train)
filled_X_train = imputer.transform(X_train)
filled_X_test = imputer.transform(X_test)
`

In [18]:
car_sales_filled_train.isna().sum(), car_sales_filled_test.isna().sum()

(Make             0
 Colour           0
 Doors            0
 Odometer (KM)    0
 dtype: int64,
 Make             0
 Colour           0
 Doors            0
 Odometer (KM)    0
 dtype: int64)

#### 2.2.1 Build machine learning model

In [19]:
categorical_features = ("Make","Colour","Doors")
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],
                               remainder="passthrough")

transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.transform(car_sales_filled_test)

clf = RandomForestRegressor() # our model
clf.fit(transformed_X_train,y_train)
clf.score(transformed_X_test,y_test)

0.3602678796130744