# Data Preprocessing Tools

## Importing the libraries

In [50]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Importing the dataset

Class notes:
* ```read_csv``` is not case sensitive ;
* ```iloc``` &rarr; index location ;
* ```.values``` transform dataframe into ndarray ;

In [51]:
dataset = pd.read_csv('Data.csv')

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [52]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [53]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

Class notes:
* If you have a massive dataset with less than
1% of the data missing you can think about
remove them.
* Instead of replacing missing values by the
mean of the column values, you could also
replace them by the median value or by the most
frequent value (for categorical data).
* ```X[:, a:b]``` will include the column a but
exclude the column b.

In [54]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [55]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independant Variable

Class notes:
* Because countries can't be compared as one
stronger than another one we can't use a classic
encoder transforming categories into 1,2,3,etc.
* One hot encoding will transform each category
of the column into is proper column. So for our
example, we have 3 categories, so one hot
encoder will create 3 columns.
* ```remainder``` option precise if we want to
keep or not, all the other columns that are not
encoded.
* ```transformers``` expect a list composed this
way :
    * a ```string``` for the type of transformation.
    * an encoding ```class``` like ```OneHotEncoder```
    her for categorical data.
    * the column ```index```.

In [56]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [57]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


## Encoding the Dependent Variable

In [58]:
le = LabelEncoder()
y = le.fit_transform(y)

In [59]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Class notes:
* We have to apply feature scaling after the
splitting the data set because test set is
actually need to be treated as completely new
data. You're not supposed to work with and
feature scaling is actually something using
the mean and the standard deviation of your data
to scale your data. So you will grab data from
the training set to compute the scaled value of
your test set.
* ```random state``` parameter is here to fix the
stochastic seed, so that we will always get the
same result.
* ```train_test_split``` of course keep the link
between X and y for each row.

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [61]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [62]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [63]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [64]:
print(y_test)

[0 1]


## Feature scaling

CLass notes:
* In some dataset, features will get so high value
it will dominate the other features in such a
way that dominated features will not even be
considered by some machine learning algorithm.
* Not all machine learning algorithm will need
features scaling.
* Main two features scaling techniques:
    * Standardisation: \\[x_{stand} = \frac{x - mean(x)}{standard\ deviation(x)}\\]

    Result: \\[-3 \leq x \leq 3\\]
    * Normalisation: \\[x_{norm} = \frac{x - min(x)}{max(x) - min(x)}\\]

    Result: \\[0 \leq x \leq 1\\]
* Normalisation is recommended when you have a
normal distribution in most of your features.
* Standardisation work well all the time.
* You don't need to apply feature scaling on dummy
variables.

In [65]:
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [66]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [67]:
print(X_test)


[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
