# Data Preprocessing Template

As you become a Machine Learning practioner, you will learn that preprocessing your data is key to building powerful algorithms. The cool thing though is that preprocessing very often follow the same steps. That is why we built a template for you that you can download and follow. 

## What you will learn in this course 🧐🧐

* Split your data into a train & a test set 
* Remplacing missing values in a dataset 
* Normalize your data 

## Step 1 - Import libraries 🎒

Usually, when preprocessing data, you will need: 

* `pandas`
* `sklearn` 

As you will see in the code, when we are using huge librairies like `sklearn` we will want to only import part of the modules. For example, we will import only `train_test_split` that is part of `model_selection` module within `sklearn`. When you want to do that kind of action, you will need to follow this syntax: 

`from lib.module import function_or_class` 

In [16]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer

## Step 2 - Import dataset 💾

Now let's import our dataset using `pandas`. As you already know, with this library you can import: 

* `.csv`
* `.sql`
* `.excel`
* ... 

And almost any files and database you want. 

In [17]:
# Import & visualize dataset
df = pd.read_csv("assets/ML/Data.csv")
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


* On veut savoir si les gesn ont acheté
* On est dans un pb de classification

In [18]:
# Print the shape of dataset in the form of (#rows, #columns)
print(df.shape)

# Describe dataset's main statistics
print(df.describe(include="all"))

(10, 4)
       Country        Age        Salary Purchased
count       10   9.000000      9.000000        10
unique       3        NaN           NaN         2
top     France        NaN           NaN        No
freq         4        NaN           NaN         5
mean       NaN  38.777778  63777.777778       NaN
std        NaN   7.693793  12265.579662       NaN
min        NaN  27.000000  48000.000000       NaN
25%        NaN  35.000000  54000.000000       NaN
50%        NaN  38.000000  61000.000000       NaN
75%        NaN  44.000000  72000.000000       NaN
max        NaN  50.000000  83000.000000       NaN


* Y a des valeurs manquantes pour toutes les colonnes
* count est pas identique partout
* 3 catégories uniques de pays

## Step 3 - Separate Target from feature variables 🎯

In this step, what you want to do is to **seperate your target variables from the ones that will be used to train your model**. Usually, we call `X` your features variable and `y` your target variable. 

In [19]:
# Separate target variable y from features X
print("Separating labels from features...")
features_list = ["Country", "Age", "Salary"]
X = df.loc[:,features_list]

y = df.loc[:,"Purchased"]


# print(X.head())
# print(y.head())
print("...Done.")
print()

Separating labels from features...
...Done.



👋 You could have used `iloc` as well. Whatever is handy for you. 

In [20]:
# X_prime = df.iloc[:,[0,1,2]]
# y_prime = df.iloc[:,[3]]
# 
# print(X_prime.head())
# print(y_prime.head())


## Step 4 - Train / Test split 🖖

As you might have already guessed, once you trained your model, you will need some data to test it on. That's why you won't use your whole dataset for training. You will keep a small part for testing. That is why `train_test_split` from scikit-learn comes in handy. 

Usually we split into 80% training data and 20% testing but it can vary depending on how much data you can work with. 

In [21]:
# Divide dataset Train set & Test set 
## First we import train_test_split


print("Splitting dataset into train set and test set...")
## Then we use train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0,       # donne une valeur pour être sûr d'avoir toujours le même comportement random
                                                    stratify=y)           # Allows you to stratify your sample. 
                                                                          # Meaning, you will have the same
                                                                          # proportion of categories in test 
                                                                          # and train set

print("...Done.")                      

Splitting dataset into train set and test set...
...Done.


## Step 5 - Cleaning ➿

Now we enter in the training step of the process. We will need to perform the following actions: 

1. Remplace missing values
2. Encode categorical variables 
3. Standardize numerical features

Let's tackle it! 

In [22]:
# ### Training pipeline ###
# print("--- Training pipeline ---")
# print()

First, let's deal with missing values. The most common way to deal with them is by replacing them with the mean of the corresponding column. We can do that with `SimpleImputer` class from `sklearn`. 

Si il avait manqué un nom de pays, on aurait pu utiiser une strategie "mode" pour remplacer par le pays le plus fréquent

In [23]:
# Missing values
print("Imputing missing values...")
print(X_train)
print()
imputer = SimpleImputer(strategy="mean") # Instanciate class of SimpleImputer with strategy of mean
X_train = X_train.copy()                 # Copy dataset to avoid caveats of assign a copy of a slice of a DataFrame
                                         # More info here https://towardsdatascience.com/explaining-the-settingwithcopywarning-in-pandas-ebc19d799d25

X_train.iloc[:,[1,2]] = imputer.fit_transform(X_train.iloc[:,[1,2]]) # Fit and transform columns where there are missing values
print("...Done!\n")
print(X_train) 
print()   

Imputing missing values...
   Country   Age   Salary
0   France  44.0  72000.0
4  Germany  40.0      NaN
6    Spain   NaN  52000.0
9   France  37.0  67000.0
3    Spain  38.0  61000.0
1    Spain  27.0  48000.0
2  Germany  30.0  54000.0
5   France  35.0  58000.0

...Done!

   Country        Age        Salary
0   France  44.000000  72000.000000
4  Germany  40.000000  58857.142857
6    Spain  35.857143  52000.000000
9   France  37.000000  67000.000000
3    Spain  38.000000  61000.000000
1    Spain  27.000000  48000.000000
2  Germany  30.000000  54000.000000
5   France  35.000000  58000.000000



👋 NB: There are other statistics you can replace missing values with like *median*, *mode* or anything else as long as it is coherent with your business and your goals.  

Finally, we need to do three things:

1. Standardize data
2. **One hot encode** categorical variables 
3. Encode labels of `y`

The first one is because we need to stage data at the same scale. To do so, we say we *standardize* data. This means that we are going to remove the mean and divide by the standard deviation for each data point: 

$$Standard\ Scaler = \frac{x_i - \mu}{\sigma}$$ 


Le standard scaler remplace les valeurs par leur Z score

```
Country France All Angleterre
France  1      0   0
All     0      1   0
Angl    0      0   1
France  1      0   0
```

La 3eme colonne est redondante (drop) car si c'est pas l'allemange ni le france c'est forcément UK
```
Country France All 
France  1      0 
All     0      1 
Angl    0      0 
France  1      0 
```

Where: 

* $x_i$ is a given observation 
* $\mu$ is the sample mean 
* $\sigma$ is the sample standard deviation 

The second part is about encoding categorical variables. Indeed, simply replacing categories by a number is not enough because we need to make sure that each category weights the same. I.e if you replace "cat", "dog", "rabbit" by "1", "2", "3", then mathematically "rabbit" weights three times more than "cat". Thereore we will create a new column per category that can contain only `0` and `1`. That is what is called *one hot encoding*. 

On fait OneHot encoder quand y a pas de valeur entre les variables : france, all, spain
On peut faire du labeling encoding (0, 1, 2...) si les catégories sont par exemple : mauvais, moyen, bon


Finally, for `y` we simply need to encode labels. Meaning we will replace "yes" / "no" by `0` and `1` which can be interpreted by a computer. 


Voir que dans le code on utulise un **ColumnTransformer()**

In [24]:
# X_train.iloc[:, 0].values.reshape(-1, 1)
# country_transformed = ohe.fit_transform(X_train.iloc[:,0].values.reshape(-1, 1))


# Standardizing numeric features and encoding categorical features
print("Standardizing numeric features and encoding categorical features...")
print()

numeric_features = [1, 2]                             # On crée une liste avec les indices des colonnes qui contiennent des valeurs numériques
                                                      # Age et Salaires sont dans les colonnes 1 et 2
numeric_transformer = StandardScaler()                # On précise le type de transformer qu'on veut utiliser pour les val numériques

categorical_features = [0]                            # On crée une liste avec les indices des colonnes qui contiennent des valeurs catégorielles
                                                      # Les pays sont dans la colonne d'indice 0 (la première)
categorical_transformer = OneHotEncoder(drop='first') # Pour virer l'Angleterre comme dans l'exemple ci-dessus on aurait mis drop="last" mais ça n'existe pas

featureencoder = ColumnTransformer(                   # ColumnTransformer provient du module compose
    transformers=[                                    # On passe une liste de transformers à qui ont donne un nom (cat, num...)
        ('cat', categorical_transformer, categorical_features),   
        ('num', numeric_transformer, numeric_features)
        ]
    )

# La variable featureencoder est un object de type ColumnTransformer
# Elle contient la "recette" pour transformer chacune des colonnes
# Sur les colonnes 1 et 2 qui sont de type numérique appliquer StandarScaler
# Sur la colonne 0 qui est de type catégorielle, appliquer OneHotEncoder
# ... 
# L'énorme avantage de procéder comme ça c'est que 
#     si on veut tester un ou ajouter des transformers sur des colonnes on peut le faire en modifiant le code à un seul endroit
#     on est sûr d'appliquer la même "recette" plus tard à nos données de test (X_test)

X_train = featureencoder.fit_transform(X_train)
print("...Done.")
print(X_train[:5])  # print first 5 rows (not using iloc since now X_train became a numpy array)
                    # ! X_train became a numpy array
print()

# On a 4 colonnes à l'affichage car il y a 2 pays, age et salary
# France => 0 dans les 2
# All => 1 et 0
# Spain => 0 et 1




# Encoding labels
print("Encoding labels...")
print(y_train)
print()
labelencoder = LabelEncoder()                       # LabelEncoder provient de sklearn.preprocessing
                                                    # Va transformer les Yes, No en 0, 1
                                                    # Si on avait eu Riri, Fifi, Loulou en lables différents
                                                    # il aurait codé en 0, 1 et 2
Y_train = labelencoder.fit_transform(y_train)
print("...Done.")
print(Y_train[:5])                                  # print first 5 rows (not using iloc since now y_train became a numpy array)
                                                    # ! X_train became a numpy array
print()

Standardizing numeric features and encoding categorical features...

...Done.
[[ 0.00000000e+00  0.00000000e+00  1.61706195e+00  1.78674463e+00]
 [ 1.00000000e+00  0.00000000e+00  8.22715727e-01  0.00000000e+00]
 [ 0.00000000e+00  1.00000000e+00 -1.41104234e-15 -9.32214592e-01]
 [ 0.00000000e+00  0.00000000e+00  2.26956063e-01  1.10700483e+00]
 [ 0.00000000e+00  1.00000000e+00  4.25542617e-01  2.91317060e-01]]

Encoding labels...
0     No
4    Yes
6     No
9    Yes
3     No
1    Yes
2     No
5    Yes
Name: Purchased, dtype: object

...Done.
[0 1 0 1 0]



Now you are done and you should be able to train your model ! 🤗 As this is only the data preprocessing template, we won't cover the algorithm here. But simply remember that this would be the moment where you'll train your model. 

In [25]:
print("*** HERE WILL BE THE TRAINING STEP (NOT IN THE SCOPE AT THIS STAGE OF" +
                                           " THE LECTURE) ***")
print()

*** HERE WILL BE THE TRAINING STEP (NOT IN THE SCOPE AT THIS STAGE OF THE LECTURE) ***



## Step 6 - Testing 🧪

Finally, whatever preprocessing you did for your train set, you need to do it for your test set. Therefore here is how you can apply it: 

In [26]:
### Test pipeline ###
print("--- Test pipeline ---")

# Missing values
print("Imputing missing values...")
print(X_test)
print()
X_test = X_test.copy() # Copy dataset to avoid caveats of assign a copy of a slice of a DataFrame
                        # More info here https://towardsdatascience.com/explaining-the-settingwithcopywarning-in-pandas-ebc19d799d25

X_test.iloc[:,[1,2]] = imputer.transform(X_test.iloc[:,[1,2]])
print("...Done.")
print(X_test) 
print()   

# Encoding categorical features and standardizing numeric features
print("Encoding categorical features and standardizing numerical features...")
print()

X_test = featureencoder.transform(X_test)       # Voir qu'on utilise la "recette" contenue dans l'objet featureencoder
                                                # On est sûr et certains de traiter les données de test de la même façon que les données de training  
print("...Done.")
print(X_test)
print()

# Encoding labels
print("Encoding labels...")
print(y_test)
print()
y_test = labelencoder.transform(y_test)
print("...Done.")
print(y_test)
print()

--- Test pipeline ---
Imputing missing values...
   Country   Age   Salary
8  Germany  50.0  83000.0
7   France  48.0  79000.0

...Done.
   Country   Age   Salary
8  Germany  50.0  83000.0
7   France  48.0  79000.0

Encoding categorical features and standardizing numerical features...

...Done.
[[1.         0.         2.80858127 3.28217221]
 [0.         0.         2.41140816 2.73838036]]

Encoding labels...
8     No
7    Yes
Name: Purchased, dtype: object

...Done.
[0 1]



## Step 7 - Predict and evaluate 🔮

Now that your model is ready and trained, you can test its performance on your test set and interpret results! 

In [27]:
print("*** HERE WILL BE THE PREDICTION STEP ***")
print()
print("*** HERE WILL BE THE ASSESSMENT OF PERFORMANCES ***")
print()

*** HERE WILL BE THE PREDICTION STEP ***

*** HERE WILL BE THE ASSESSMENT OF PERFORMANCES ***



## Resources 📚📚

* Standardization, or mean removal and variance scaling - [https://bit.ly/301Sx](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)

* Imputation of missing values - [https://bit.ly/02s0C](https://scikit-learn.org/stable/modules/impute.html)

* Label encoding - [https://bit.ly/0Zasc](https://scikit-learn.org/stable/modules/preprocessing_targets.html#label-encoding)

* Pipelines and composite estimators - [https://bit.ly/0csas2](https://scikit-learn.org/stable/modules/compose.html)