# Data preprocessing for supervised machine learning

## What you'll learn in this course 🧐🧐

The example below implements all the necessary steps to train a machine learning model and make predictions on a dataset.

* Use pandas and the preprocessing module of the scikit-learn library to prepare your data.
* Use scikit-learn to train a supervised machine learning model and assess its performance.

 We have a sample dataset of conversions available (has someone purchased a product). The objective is to predict whether a person has made a purchase, based on information about that person: nationality, age, and income level.

 - We will call "variable to predict", "variable to explain" or "target", noted Y, the column corresponding to "Purchased" in the dataset
 - The other columns of the dataset, called "explanatory variables" and denoted X, will be used to try to predict the value of Y

 ## Import useful modules ⬇️⬇️

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) # to avoid deprecation warnings

 ## File reading and basic exploration 📰📰

In [2]:
# Import dataset
print("Loading dataset...")
dataset = pd.read_csv("src/Dataset_preprocessing.csv")
print("...Done.")
print()

Loading dataset...
...Done.



In [3]:
# Basic stats
print("Number of rows : {}".format(dataset.shape[0]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values: ")
display(100*dataset.isnull().sum()/dataset.shape[0])

Number of rows : 12

Display of dataset: 


Unnamed: 0,id,Country,Age,Salary,Purchased,useless_col,almost_empty
0,0,France,44.0,72000,No,useless,
1,1,Spain,27.0,48000,Yes,useless,40.0
2,2,Germany,30.0,54000,No,useless,
3,3,Spain,38.0,61000,No,useless,20.0
4,4,Germany,40.0,69000,Yes,useless,



Basics statistics: 


Unnamed: 0,id,Country,Age,Salary,Purchased,useless_col,almost_empty
count,12.0,12,11.0,12.0,12,12,2.0
unique,,3,,,2,1,
top,,France,,,Yes,useless,
freq,,5,,,7,12,
mean,5.5,,36.909091,83389580.0,,,30.0
std,3.605551,,19.002392,288657400.0,,,14.142136
min,0.0,,-10.0,32000.0,,,20.0
25%,2.75,,32.5,53500.0,,,25.0
50%,5.5,,38.0,64000.0,,,30.0
75%,8.25,,46.0,73750.0,,,35.0



Percentage of missing values: 


id               0.000000
Country          0.000000
Age              8.333333
Salary           0.000000
Purchased        0.000000
useless_col      0.000000
almost_empty    83.333333
dtype: float64

 The exploration of the above data makes it possible to know which pre-processing steps will be necessary:

 **1. Preprocessing to be planned with pandas**

 **Unnecessary columns for prediction, to be thrown away** :
 - _id_ is an identifier, it should never be used for prediction (this column contains no information)
 - _useless_col_ will also be useless, because it always contains the same value

 **Columns with too many missing values, to be discarded** : _almost_empty_

 **Lines containing outliers, discarded** :
 
 - Lines for which _Age_ is negative
 - Lines for which _Salary_ is more than 2 standard deviations away (std) (this is not a rule to be applied in general, but here we notice that it allows to discard the value of 1Billion which seems aberrant)

 **Target variable/target (Y) that we will try to predict, to separate from the others** : Purchased

 **------------**

 **2. Preprocessings to be planned with scikit-learn****.

 **Explanatory variables (X)**
 We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

 - Categorical variables : Country
 - Numerical variables : Age, Salary

 In this dataset, we have both types of variables. It will thus be necessary to plan to create a numeric_transformer (which will call the StandardScaler class) and a categorical_transformer (which will call the OneHotEncoder class). On the other hand, as we observe missing values in the starting dataset, we will have to plan to call the SimpleImputer class to handle the missing values.

 **Target variable Y**
 Finally, here it should be noted that the variable Y to be predicted is categorical: it will thus be necessary to provide an encoding with the LabelEncoder class.

 ## Preprocessing - Pandas 🐼🐼

In [4]:
# Drop useless columns / columns with too many missing values
useless_cols = ['id', 'useless_col', 'almost_empty']

print("Dropping useless columns...")
dataset = dataset.drop(useless_cols, axis=1) # axis = 1 indicates that we are dropping along the column axis
# never hesitate to look at a function's documentation using the command name_of_the_function?
print("...Done.")
print(dataset.head())

Dropping useless columns...
...Done.
   Country   Age  Salary Purchased
0   France  44.0   72000        No
1    Spain  27.0   48000       Yes
2  Germany  30.0   54000        No
3    Spain  38.0   61000        No
4  Germany  40.0   69000       Yes


In [5]:
# Drop lines containing outliers (using masks)

print('Dropping outliers in Age...')
to_keep = (dataset['Age'] > 0) | (dataset['Age'].isnull()) # We want keeping positives values or missings
dataset = dataset.loc[to_keep,:] 
print('Done. Number of lines remaining : ', dataset.shape[0])
print()

print('Dropping outliers in Salary...')
to_keep = dataset['Salary'] < dataset['Salary'].mean() + 2*dataset['Salary'].std()
dataset = dataset.loc[to_keep,:]
print('Done. Number of lines remaining : ', dataset.shape[0])
print()

dataset.head()

Dropping outliers in Age...
Done. Number of lines remaining :  11

Dropping outliers in Salary...
Done. Number of lines remaining :  10



Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000,No
1,Spain,27.0,48000,Yes
2,Germany,30.0,54000,No
3,Spain,38.0,61000,No
4,Germany,40.0,69000,Yes


In [6]:
# Separate target variable Y from features X
target_name = 'Purchased'

print("Separating labels from features...")
Y = dataset.loc[:,target_name]
X = dataset.loc[:,[c for c in dataset.columns if c!=target_name]] # All columns are kept, except the target
print("...Done.")
print(Y.head())
print()
print(X.head())
print()

Separating labels from features...
...Done.
0     No
1    Yes
2     No
3     No
4    Yes
Name: Purchased, dtype: object

   Country   Age  Salary
0   France  44.0   72000
1    Spain  27.0   48000
2  Germany  30.0   54000
3    Spain  38.0   61000
4  Germany  40.0   69000



In [7]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X = X.values
Y = Y.tolist()
print("...Done")
print(X[0:5,:])
print()
print(Y[0:5])

Convert pandas DataFrames to numpy arrays...
...Done
[['France' 44.0 72000]
 ['Spain' 27.0 48000]
 ['Germany' 30.0 54000]
 ['Spain' 38.0 61000]
 ['Germany' 40.0 69000]]

['No', 'Yes', 'No', 'No', 'Yes']


 ## Preprocessing - Scikit-Learn 🔬🔬

In [8]:
# First : always divide dataset into train set & test set !!
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
# test_size indicates the proportion of rows from X and Y that will go into the test dataset while 
# maintaining the correspondance between the rows from X and Y 

# random_state is an argument that can be found in all functions that have a pseudo-random behaviour
# if random_state is not stated the function will derive a different random result everytime the cell 
# runs, if random_state is given a value the results will be the same everytime the cell runs while
# each different value of radom_state will derive a specific result
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [9]:
# Create pipeline for numeric features
numeric_features = [1,2] # Positions of numeric columns in X_train/X_test
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # missing values will be replaced by columns' median
    ('scaler', StandardScaler())
])

In [10]:
# Create pipeline for categorical features
categorical_features = [0] # Positions of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

In [11]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [12]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train[0:5,:])
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test[0:5,:])
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')
print(X_test[0:5,:])
print()

Performing preprocessings on train set...
[['Germany' 40.0 69000]
 ['France' 37.0 67000]
 ['Spain' 27.0 48000]
 ['Spain' nan 52000]
 ['France' 48.0 79000]]
...Done.
[[ 0.27978024  0.58858382  1.          0.        ]
 [-0.23673712  0.38385901  0.          0.        ]
 [-1.95846165 -1.56102665  0.          1.        ]
 [-0.06456467 -1.15157703  0.          1.        ]
 [ 1.65715986  1.61220785  0.          0.        ]]

Performing preprocessings on test set...
[['Germany' 30.0 54000]
 ['Germany' 50.0 83000]]
...Done.
[[-1.44194429 -0.94685223  1.          0.        ]
 [ 2.00150476  2.02165746  1.          0.        ]]



In [13]:
# Encode target variable Y
labelencoder = LabelEncoder()

print("Encoding labels on train set...")
print(Y_train[0:5])
print()
Y_train = labelencoder.fit_transform(Y_train)
print("...Done.")
print(Y_train[0:5])
print()

print("Encoding labels on test set...")
print(Y_test[0:5])
print()
Y_test = labelencoder.transform(Y_test) # Don't fit again !!
print("...Done.")
print(Y_test[0:5])
print()

Encoding labels on train set...
['Yes', 'Yes', 'Yes', 'No', 'Yes']

...Done.
[1 1 1 0 1]

Encoding labels on test set...
['No', 'No']

...Done.
[0 0]



 ### Model training 🏃

In [14]:
# Train model
model = LogisticRegression()

print("Training model...")
model.fit(X_train, Y_train) # Training is always done on train set !!
print("...Done.")

Training model...
...Done.




 ### Prédictions 🔮

In [15]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = model.predict(X_train)
print("...Done.")
print(Y_train_pred[0:5])
print()

Predictions on training set...
...Done.
[1 1 1 0 1]



In [16]:
# Predictions on test set
print("Predictions on test set...")
Y_test_pred = model.predict(X_test)
print("...Done.")
print(Y_test_pred[0:5])
print()

Predictions on test set...
...Done.
[1 1]



 ### Performance Evaluation 💯

 #### **Here the accuracy of the test is bad because there is not enough data.**

In [17]:
# Print scores
print("Accuracy on training set : ", accuracy_score(Y_train, Y_train_pred))
print("Accuracy on test set : ", accuracy_score(Y_test, Y_test_pred))

Accuracy on training set :  0.875
Accuracy on test set :  0.0


## Resources 📚📚

Introduction Pratique à Python - Antoine Krajnc-Rosenthal & Anais Armandy

Sklearn - [Documentation officielle de scikit-learn](https://scikit-learn.org/stable/)