# <font color=#023F7C> **Machine Learning, Explainability and Deep Learning** </font>

<font color=#023F7C>**Hi! PARIS DataBootcamp 2023 🚀**</font> <br>


<img src = https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png width = "300" height = "200" >



**Before you start to working on this notebook ⚠️**: <br>
Please download/copy this notebook from `hfactory_magic_folders\course` and drop it into your own directory `my_work` on HFactory. <br>
If you don't, you won't be able to save the modifications you've made on this notebook.

**How to work with this notebook ? 📝** <br>
Here are some guidelines on how you should work on this notebook during the week. <br>
*You don't need to finish the whole notebook before sending it to us on Friday*
- Wednesday: Work on section 1. and 2. (Import dataset and Machine Learning)
- Thursday: Finish section 2., work on section 3. and 4. (ML, Explainability, Deep Learning)
- Friday morning: Finish the notebook as best you can for the final deliverable at 12:00pm 



**Bootcamp deliverables** 💯: <br>
Send us the completed notebook before 12:00pm (midi) on Friday at `data-event@hi-paris.fr`<br>
*Don't forget to also send us the powerpoint deliverable on Friday*
- Send a single notebook per group. 
- Add the names of the members of your group in your email submission 


**Need help ? 🙏** <br>
You can find code examples in this morning's <b>Machine Learning</b> and <b>Optimisation and model evaluation</b> theortical courses. <br>
If you are really struggling with this section, you can also visit the `Data_Science_crash_course.ipynb` notebook in the Pre-bootcamp folder on HFactory.

## **1. Import libraries and clean dataset**

**Let's start by importing the libraries we used in two previous notebooks.**
 

In [1]:
import pandas as pd
#import time
import numpy as np
pd.set_option('display.max_columns', None) #Show all columns

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

**Now, let's import `scikit-learn` functions for classification models, data preprocessing and performance metrics.** 

In [2]:
# Classification models
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import ExtraTreesClassifier

# Multiclass classification 
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

# Preprocessing tools
from sklearn.preprocessing import OneHotEncoder, LabelEncoder,StandardScaler
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split

# Performance metrics
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve, classification_report, explained_variance_score
from sklearn.metrics import make_scorer, mean_absolute_error, mean_absolute_percentage_error

# Improve your model
# Improve your model
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV

**Finally, import the `dataset_train_clean.csv` dataset you cleaned/worked on in `Data_Clean.ipynb` and `Dataviz.ipynb`** <br>
*Make sure you name the loaded dataframe `dataset` in the notebook !*


**If you weren't able to save/create `dataset_train_clean.csv` then run the following code.**


In [4]:
path=r'~/hfactory_magic_folders/course/Dataset/dataset_train.csv'
#path=r"dataset_train.csv"

# Import the csv file
dataset = pd.read_csv(path,encoding='latin-1',sep=';')

# Clean the dataframe 
to_drop=['Unnamed: 0','Customer Email','Customer Fname','Customer Lname','Customer Password','Customer Street','Order Zipcode','Product Description', "Sales per customer","Product Card Id"]
dataset=dataset.drop(to_drop,axis=1)
dataset=dataset.dropna().reset_index(drop=True)

**You can drop more columns at this step if you don't think they will be useful in the Machine Learning model.**

# **2. Machine Learning**

**You can chose either `Late_delivery_risk` or `Delivery_status` as the variable to predict.**
- Predict `Late_delivery_risk` to try binary classification (Beginner level)
- Predict `Delivery_Status` to try multi-class classification (Intermediate/Advanced level)



**Important information ⚠️**: <br>
If you pick `Late_delivery_risk` then `Delivery_status` should be deleted from the dataset (and vice-versa). <br>
If you need help doing this, the next cell will do it for you.
- Keep `binary_classification=True` if you pick binary classification.
- Change it to `binary_classification=False` if you pick multi-class classification

In [6]:
# Choose your figther
binary_classification=True

if binary_classification is True:
    dataset=dataset.drop(columns=['Delivery Status'])
    name_label=['Late_delivery_risk']
else:
    dataset=dataset.drop(columns=['Late_delivery_risk'])
    name_label=['Delivery Status_Advance shipping', 'Delivery Status_Late delivery','Delivery Status_Shipping canceled', 'Delivery Status_Shipping on time']

### **2.1 Data Preprocessing**


**Question 1:** <br>**Transform the categorical variables (with less than 15 unique values) into numerical variables with OneHotEncoding (OHE).** <br>
*Make sure you don't include `Late_delivery_risk` with the variables to transform with OneHotencoding (if you are going to do binary classification)* <br>
*You can go this [page](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for more info on how to use scikit-learn's `OneHotEncoder` function.*




You can run the next cell if you are struggling to select the correct variables to Onehotencode.



In [7]:
## Run this following code if you need help selecting variables to onehotencode

df_continuous=dataset.select_dtypes(include=["float64"])
df_categorical=dataset.select_dtypes(include=["object","int64"])
nb_unique_value_max=15

# List of columns to transform with OneHotEncoding
to_OHE=[key for key in df_categorical.keys() if len(df_categorical[key].drop_duplicates())<nb_unique_value_max] 

# Remove Late_delivery_risk or Delivery Status from list of columns to transform with OHE
if "Late_delivery_risk" in to_OHE:
    to_OHE.remove("Late_delivery_risk")
    
if "Delivery Status" in to_OHE:
    to_OHE.remove("Delivery Status")

**Concatenate the OneHotencoded variables/dataframe with the original dataframe using `pd.concat([...],axis=1)`.<br>**
*Make sure to drop the columns you transformed with OneHotEncoding in the concatenated dataframe*

**Question 2:** <br>
**Transform the remaining categorical variables (those with more than 15 unique values) using `LabelEncoder`**. <br>
If you need help selecting categorical variables with more than 15 unique values, run the following cell



*Note: The `LabelEncoder` function can only transform 1 column at a time, where as `OneHotEncoder` can directly transform multiple columns.* <br>

**Question 3 (Bonus)**: <br>
**Try other data preprocessing methods on the data (StandardScaler, MinMaxScaler...).**<br>
*Scale continuous feature variables (with a float type), not categorical*

**Question 4**: <br>
**Create a `y` variable with the target variable in the dataset you've chosen to predict.** <br>

**Create a `X` variable with the remaining feature variables of your dataset.** <br>


**Question 5**: <br>
**Split X and y into training and validation/test sets using scikit-learn's `train_test_split()` function. <br>**

*Note: Add `stratify=y` to make sure your splits are stratified and `random_state=42`*
- *The training set will be used to train/fit our data to the model*
- *The test/validation set will be used to quantify the performance of the model on new data*

### **2.2 Train/fit a Machine Learning model**
Now that our dataset is clean and has the right format, we can use it train Machine Learning models with this data.





**Question 6**: <br>
**Train Logistic Regression, DecisionTree and Random forest models using scikit-learn's `.fit()` method. <br>**
- Logistic Regression in scikit-learn: `LogisticRegression()` 
- Decision Tree in scikit-learn: `DecisionTreeClassifier()`
- Random Forest in scikit-learn: `RandomForestClassifier()`



**Question 7**: <br>
**Compute the probability of the predicted values for the trained models with `.predict_proba()`**. 


**Question 8 (Bonus)**: <br>
**Now do the same (train and predict proba) with two other classification models.** <br>

### **2.3 Evaluate/test the performance of the models**

**Question 9**: <br>
**Compute the AUC score of the trained models with `roc_auc_score()`**.

**If the AUC score of your models are low, here are some tips/ideas of how you can improve their performances:**
- **You can retrain the models in Question 6 and 7 with different parameters.** <br>
- **You can drop columns or include more in the dataset**
- **You can scale the dataset with `StandardScaler` or `MinMaxScaler`**
- **You can try to find a model's optimal parameters with `GridSearch`, `RandomSearch`, `crossvalscore` (Section 2.4)**

**Question 10**: <br>
**Display the ROC curve of the models you've trained with `roc_curve()` and add their AUC score on the plot.** <br>


**Question 11**: <br>
**Create a confusion matrix for the two models with the best AUC score.** <br>
**Then, use `sns.heatmap()` to plot the confusion matrix**. <br>

*You need to compute the prediction of the model with `.predict()` to compute the confusion matrix*

### **2.4 Upgrade your model! (Bonus)**

**Use Grid Search, Cross validation and/or Random Search to find the optimal parameters for your model(s).** <br>
*Don't use these optimization methods on every model you've tested, just try them on the best model you've tested.* <br>
*For `GridSearchCV`, use the parameter `scoring='roc_auc'` to use the AUC score as the performance metric.*



## **3. Explainability with shap**

The `shap` library (SHapley Additive exPlanations) is a Python library used for explaining the output of machine learning models. <br> It provides a unified framework for interpreting complex models and understanding the contributions of individual features to model predictions. <br> 

Shap is particularly useful for understanding black-box models like boosting, random forests, and deep neural networks, among others. <br>
It can also be used with any classification model.

**Let's install and import the shap library.**

In [28]:
!pip install shap



In [29]:
import shap
np.bool=bool

Shap is very heavy and takes a long time to compute. <br> 
To facilitate execution and reduce computing time, we'll work on the **first 1000 rows** of X_train. 

In [30]:
dataset_shap = X_train[:1000]

**Question 10**: <br>
**Create an object that can compute the shap values using `shap.Explainer`** <br>


**Now, compute the shap values of a trained model with `.shap_values()`.** <br>
Save the computed shap values in a `shap_values` variable.

**Question 11**: <br>
**Create multiple shap summary plots using `shap.summary_plot()`**:
- Create the first plot with `shap_values` 
- Create the second plot with `shap_values[0]` and `plot_type="dot"` 

**Question 12 (Bonus)**: <br>
**Display other shap plots**

## **4. Deep Learning (Bonus)**

**We will start by importing one of Python's Deep Learning libraries `tensorflow`/`keras`.**

In [66]:
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

**Now should run the following cells to prepare the data to train a Deep Learning model.** <br>
*`dataset` should be the dataframe you transformed with data pre-processing (Onehotencoded, LabelEncoder,...).* 

In [67]:
dataset_DL = dataset.copy()

In [68]:
col_names = list(dataset_DL.keys())
col_names_replace = [sub.replace(" ", "_") for sub in col_names]
col_names_replace = [sub.replace("(", "_") for sub in col_names_replace]
col_names_replace = [sub.replace(")", "") for sub in col_names_replace]
col_names_replace[-5:]

['Order_Item_Profit_Ratio',
 'Sales',
 'Order_Item_Total',
 'Order_Profit_Per_Order',
 'Product_Price']

In [69]:
dict_map = {}
for key,value in zip(col_names, col_names_replace):
  dict_map[key] = value

In [70]:
dataset_DL.columns = dataset_DL.columns.map(dict_map)

In [71]:
name_label='Late_delivery_risk'
train, test = train_test_split(dataset_DL, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
X_dataset=dataset_DL.drop([name_label],axis=1)

86653 train examples
21664 validation examples
27080 test examples


In [72]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop(name_label)
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [73]:
feature_columns = []
# numeric cols
for header in X_dataset.keys():
  feature_columns.append(feature_column.numeric_column(header))

In [74]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [75]:
train, test = train_test_split(dataset_DL, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)

In [76]:
batch_size = 32 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

**Question 14** <br>
**Make a small neural network model using `tensorflow`/`keras`, and print the accuracy**

*Note: You can use the following elements to train the neural network* <br>
- *`tf.keras.Sequential`*
- *`layers.Dense(INTEGER, activation='relu')`*,
- *`tf.keras.losses.BinaryCrossentropy`*
- *`model.compile(optimizer='adam', ...)`* 
- *`model.fit`*
- *`model.evaluate`* with epoch ~= 10