### RATE CHALLENGE

### Objectifs : 

Le projet peut être découpé en quatre étapes :

Partie 1 : faire une EDA (exploration et analyse data) et les prétraitements et entraîner un modèle de base avec le fichier data_train.csv

Partie 2 : améliorez le score f1 de votre modèle sur votre ensemble de test (vous pouvez essayer l'ingénierie de fonctionnalités, la sélection de fonctionnalités, la régularisation, les modèles non linéaires, l'optimisation d'hyperparamètres par recherche de grille, etc...)

Partie 3 : Une fois que vous êtes satisfait du score de votre modèle, vous pouvez l'utiliser pour faire des prédictions avec le fichier data_test.csv . Vous devrez vider les prédictions dans un fichier .csv qui sera envoyé à Kaggle (en fait, à votre professeur/TA 🤓). Vous pouvez faire autant de soumissions que vous le souhaitez, n'hésitez pas à essayer différents modèles !

Partie 4 : Prenez le temps d'analyser les paramètres de votre meilleur modèle. Existe-t-il des leviers d'action permettant d'améliorer le taux de conversion de la newsletter ? Quelles recommandations feriez-vous à l'équipe ?

#### I) Téléchargement de la librairy Plotly

In [1]:
!pip install plotly



#### II) Import des librairies

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
pio.renderers.default = "iframe_connected" # to be replaced by "iframe" if working on JULIE
from IPython.display import display

#### III) Lire le fichier data_train : 

In [3]:
data = pd.read_csv('conversion_data_train.csv')
print('Set with labels (our train+test) :', data.shape)

Set with labels (our train+test) : (284580, 6)


In [4]:
# Basic stats
print("Number of rows : {}".format(data.shape[0]))
print()

print("Display of dataset: ")
display(data.head())
print()

print("Basics statistics: ")
data_desc = data.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values: ")
display(100*data.isnull().sum()/data.shape[0])

Number of rows : 284580

Display of dataset: 


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0



Basics statistics: 


Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
count,284580,284580.0,284580.0,284580,284580.0,284580.0
unique,4,,,3,,
top,US,,,Seo,,
freq,160124,,,139477,,
mean,,30.564203,0.685452,,4.873252,0.032258
std,,8.266789,0.464336,,3.341995,0.176685
min,,17.0,0.0,,1.0,0.0
25%,,24.0,0.0,,2.0,0.0
50%,,30.0,1.0,,4.0,0.0
75%,,36.0,1.0,,7.0,0.0



Percentage of missing values: 


country                0.0
age                    0.0
new_user               0.0
source                 0.0
total_pages_visited    0.0
converted              0.0
dtype: float64

##### Analyse du Dataset : 
- Aucunes valeurs manquantes 
- La target (y) converted est une donnée qualitative déjà encodée en 0 et 1
- les features (X) :
    - qualitative : country, source et new_user (déjà encodée en 0 et 1)
    - quantitative : age, total_pages_visited

#### IV) Selectionner une partie du dataset :

In [5]:
# The dataset is quite big : you must create a sample of the dataset before making any visualizations !
data_sample = data.sample(10000)

#### V) Créer un modèle

Choisissez les variables à utiliser dans le modèle et créez les ensembles de formation et de test.
D'après l'EDA, nous savons que la caractéristique la plus utile est : total_pages_visited. 
Créons un modèle de base, en utilisant d'abord seulement cette caractéristique : dans les prochaines cellules, nous ferons des prétraitements et entraînerons une régression logistique simple (univariée).

In [6]:
features_list = ['country','age','new_user','source','total_pages_visited']
target_variable = 'converted'

In [7]:
X = data_sample.loc[:, features_list]
Y = data_sample.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()

Explanatory variables :  Index(['country', 'age', 'new_user', 'source', 'total_pages_visited'], dtype='object')



In [8]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0, stratify = Y)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [9]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.values
Y_test = Y_test.values
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])

Convert pandas DataFrames to numpy arrays...
...Done
[['China' 51 1 'Ads' 6]
 ['China' 29 1 'Direct' 4]
 ['UK' 33 0 'Ads' 4]
 ['UK' 37 0 'Seo' 9]
 ['US' 26 1 'Direct' 4]]
[['US' 29 1 'Ads' 5]
 ['Germany' 24 1 'Seo' 4]]

[0 0 0 0 0]
[0 0]


### Training et Test pipeline

In [10]:
# Put here all the preprocessings
#print("Encoding categorical features and standardizing numerical features...")

#Features qualitative (country, source) et :
categorical_features = [0,3]                 # Positions of categorical columns in X_train/X_test
categorical_transformer = OneHotEncoder(drop='first') 


# Features quantitatives (age, new_user, total_page_visited)
numeric_features = [1,4,2]                     # Positions of numeric columns in X_train/X_test
numeric_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

### Appliquer le pipeline

In [11]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train[0:5,:])
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

Performing preprocessings on train set...
[['China' 51 1 'Ads' 6]
 ['China' 29 1 'Direct' 4]
 ['UK' 33 0 'Ads' 4]
 ['UK' 37 0 'Seo' 9]
 ['US' 26 1 'Direct' 4]]
...Done.
[[ 2.44967081  0.32227374  0.68363106  0.          0.          0.
   0.          0.        ]
 [-0.1873966  -0.26993711  0.68363106  0.          0.          0.
   1.          0.        ]
 [ 0.2920702  -0.26993711 -1.4627773   0.          1.          0.
   0.          0.        ]
 [ 0.771537    1.21059001 -1.4627773   0.          1.          0.
   0.          1.        ]
 [-0.5469967  -0.26993711  0.68363106  0.          0.          1.
   1.          0.        ]]



In [12]:
# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test[0:5,:])
X_test = preprocessor.transform(X_test) 
print('...Done.')
print(X_test[0:5,:])
print()

Performing preprocessings on test set...
[['US' 29 1 'Ads' 5]
 ['Germany' 24 1 'Seo' 4]
 ['US' 31 1 'Seo' 2]
 ['US' 19 0 'Seo' 2]
 ['US' 20 1 'Ads' 4]]
...Done.
[[-0.1873966   0.02616832  0.68363106  0.          0.          1.
   0.          0.        ]
 [-0.78673011 -0.26993711  0.68363106  1.          0.          0.
   0.          1.        ]
 [ 0.0523368  -0.86214795  0.68363106  0.          0.          1.
   0.          1.        ]
 [-1.38606361 -0.86214795 -1.4627773   0.          0.          1.
   0.          1.        ]
 [-1.26619691 -0.26993711  0.68363106  0.          0.          1.
   0.          0.        ]]



### Train modèle

In [13]:
# Train model
print("Train model...")
classifier = LogisticRegression()  
classifier.fit(X_train, Y_train)
print("...Done.")

Train model...
...Done.


### Prediction sur le train set 

In [14]:
# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



### Prediction sur le test set 

In [15]:
# Predictions on training set
print("Predictions on training set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



### Performance assessment

In [16]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

f1-score on train set :  0.7685774946921445
f1-score on test set :  0.7678571428571428


In [17]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

Confusion matrix on train set : 
[[7710   32]
 [  77  181]]

Confusion matrix on test set : 
[[1931    4]
 [  22   43]]



### Essayer d'utiliser le modele sur le dataset au complet

Avant de faire des prédictions sur le fichier conversion_data_test.csv, entraînons notre modèle sur TOUTES les données qui se trouvaient dans conversion_data_train.csv. 

Parfois, cela permet d'apporter de petites améliorations au score car nous utilisons plus d'exemples pour entraîner le modèle.

In [27]:
features_list = ['country','age','new_user','source','total_pages_visited']
target_variable = 'converted'

In [28]:
#Faire l'entrainement sur toutes les données au complet du train
X = data.loc[:, features_list]
Y = data.loc[:, target_variable]

print('Explanatory variables : ', X.columns)
print()

Explanatory variables :  Index(['country', 'age', 'new_user', 'source', 'total_pages_visited'], dtype='object')



In [29]:
# Divide dataset Train set & Test set 
print("Dividing into train and test sets...")
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0, stratify = Y)
print("...Done.")
print()

Dividing into train and test sets...
...Done.



In [30]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_train = X_train.values
X_test = X_test.values
Y_train = Y_train.values
Y_test = Y_test.values
print("...Done")

print(X_train[0:5,:])
print(X_test[0:2,:])
print()
print(Y_train[0:5])
print(Y_test[0:2])

Convert pandas DataFrames to numpy arrays...
...Done
[['China' 19 1 'Seo' 1]
 ['US' 33 1 'Direct' 5]
 ['US' 51 1 'Ads' 2]
 ['China' 17 0 'Seo' 1]
 ['China' 28 1 'Seo' 5]]
[['UK' 34 1 'Ads' 1]
 ['UK' 32 0 'Ads' 5]]

[0 0 0 0 0]
[0 0]


In [31]:
# Put here all the preprocessings
#print("Encoding categorical features and standardizing numerical features...")

#Features qualitative (country, source) et :
categorical_features = [0,3]                 # Positions of categorical columns in X_train/X_test
categorical_transformer = OneHotEncoder(drop='first') 


# Features quantitatives (age, new_user, total_page_visited)
numeric_features = [1,4,2]                     # Positions of numeric columns in X_train/X_test
numeric_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [32]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train[0:5,:])
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train[0:5,:])
print()

Performing preprocessings on train set...
[['China' 19 1 'Seo' 1]
 ['US' 33 1 'Direct' 5]
 ['US' 51 1 'Ads' 2]
 ['China' 17 0 'Seo' 1]
 ['China' 28 1 'Seo' 5]]
...Done.
[[-1.3990984  -1.15935344  0.67651656  0.          0.          0.
   0.          1.        ]
 [ 0.29299544  0.03743241  0.67651656  0.          0.          1.
   1.          0.        ]
 [ 2.46854467 -0.86015697  0.67651656  0.          0.          1.
   0.          0.        ]
 [-1.64082609 -1.15935344 -1.47816042  0.          0.          0.
   0.          1.        ]
 [-0.31132378  0.03743241  0.67651656  0.          0.          0.
   0.          1.        ]]



In [33]:
# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test[0:5,:])
X_test = preprocessor.transform(X_test) 
print('...Done.')
print(X_test[0:5,:])
print()

Performing preprocessings on test set...
[['UK' 34 1 'Ads' 1]
 ['UK' 32 0 'Ads' 5]
 ['US' 44 1 'Ads' 1]
 ['US' 35 1 'Direct' 1]
 ['US' 29 1 'Direct' 3]]
...Done.
[[ 0.41385929 -1.15935344  0.67651656  0.          1.          0.
   0.          0.        ]
 [ 0.1721316   0.03743241 -1.47816042  0.          1.          0.
   0.          0.        ]
 [ 1.62249775 -1.15935344  0.67651656  0.          0.          1.
   0.          0.        ]
 [ 0.53472314 -1.15935344  0.67651656  0.          0.          1.
   1.          0.        ]
 [-0.19045994 -0.56096051  0.67651656  0.          0.          1.
   1.          0.        ]]



In [34]:
#Attention le modèle est déjà entrainer donc on peut passer à l'étape de la prédiction

# Predictions on training set
print("Predictions on training set...")
Y_train_pred = classifier.predict(X_train)
print("...Done.")
print(Y_train_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



In [35]:
# Predictions on training set
print("Predictions on training set...")
Y_test_pred = classifier.predict(X_test)
print("...Done.")
print(Y_test_pred)
print()

Predictions on training set...
...Done.
[0 0 0 ... 0 0 0]



In [36]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(Y_train, Y_train_pred))
print("f1-score on test set : ", f1_score(Y_test, Y_test_pred))

f1-score on train set :  0.7601641178664678
f1-score on test set :  0.751269035532995


In [37]:
# You can also check more performance metrics to better understand what your model is doing
print("Confusion matrix on train set : ")
print(confusion_matrix(Y_train, Y_train_pred))
print()
print("Confusion matrix on test set : ")
print(confusion_matrix(Y_test, Y_test_pred))
print()

Confusion matrix on train set : 
[[219354    966]
 [  2249   5095]]

Confusion matrix on test set : 
[[54825   255]
 [  578  1258]]



NB : Dans notre cas, le F1 score a légérement baisser mais reste néanmoins stable 

#### III) Lire le fichier data_test : 

In [19]:
# Read data without labels
data_without_labels = pd.read_csv('conversion_data_test.csv')
print('Prediction set (without labels) :', data_without_labels.shape)

Prediction set (without labels) : (31620, 5)


In [20]:
# Basic stats
print("Number of rows : {}".format(data_without_labels.shape[0]))
print()

print("Display of dataset: ")
display(data_without_labels.head())
print()

print("Basics statistics: ")
data_desc1 = data_without_labels.describe(include='all')
display(data_desc1)
print()

print("Percentage of missing values: ")
display(100*data_without_labels.isnull().sum()/data_without_labels.shape[0])

Number of rows : 31620

Display of dataset: 


Unnamed: 0,country,age,new_user,source,total_pages_visited
0,UK,28,0,Seo,16
1,UK,22,1,Direct,5
2,China,32,1,Seo,1
3,US,32,1,Ads,6
4,China,25,0,Seo,3



Basics statistics: 


Unnamed: 0,country,age,new_user,source,total_pages_visited
count,31620,31620.0,31620.0,31620,31620.0
unique,4,,,3,
top,US,,,Seo,
freq,17968,,,15563,
mean,,30.620746,0.685579,,4.870398
std,,8.316736,0.464292,,3.333128
min,,17.0,0.0,,1.0
25%,,24.0,0.0,,2.0
50%,,30.0,1.0,,4.0
75%,,36.0,1.0,,7.0



Percentage of missing values: 


country                0.0
age                    0.0
new_user               0.0
source                 0.0
total_pages_visited    0.0
dtype: float64

##### Analyse du Dataset : 
- Aucunes valeurs manquantes 
- La target (y) converted est à rajouter dans le tableau + c'est une donnée qualitative qui sera à mettre en 1 ou 0
- les features (X) :
    - qualitative : country, source et new_user (déjà encodée en 0 et 1)
    - quantitative : age, total_pages_visited

#### IV) Selectionner une partie du dataset :

In [21]:
# The dataset is quite big : you must create a sample of the dataset before making any visualizations !
data_sample1 = data_without_labels.sample(10000)

#### V) définir les features

In [22]:
# Warning : check consistency of features_list (must be the same than the features 
# used by your best classifier)
features_list = ['country','age','new_user','source','total_pages_visited']

X_without_labels = data_without_labels.loc[:, features_list]

In [23]:
# Convert pandas DataFrames to numpy arrays before using scikit-learn
print("Convert pandas DataFrames to numpy arrays...")
X_without_labels= X_without_labels.values
print("...Done")

print(X_without_labels[0:5,:])

Convert pandas DataFrames to numpy arrays...
...Done
[['UK' 28 0 'Seo' 16]
 ['UK' 22 1 'Direct' 5]
 ['China' 32 1 'Seo' 1]
 ['US' 32 1 'Ads' 6]
 ['China' 25 0 'Seo' 3]]


### Training et Test pipeline

In [24]:
# Put here all the preprocessings
#print("Encoding categorical features and standardizing numerical features...")

#Features qualitative (country, source) :
categorical_features = [0,3]                 # Positions of categorical columns in X_train/X_test
categorical_transformer = OneHotEncoder(drop='first') 


# Features quantitatives (age, total_page_visited, new_user)
numeric_features = [1,4,2]                     # Positions of numeric columns in X_train/X_test
numeric_transformer = StandardScaler()


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

### Appliquer le pipeline

In [25]:
# Preprocessings ON set
print("Performing preprocessings on train set...")
print(X_without_labels[0:5,:])
X_without_labels = preprocessor.fit_transform(X_without_labels)
print('...Done.')
print(X_without_labels[0:5,:])
print()

Performing preprocessings on train set...
[['UK' 28 0 'Seo' 16]
 ['UK' 22 1 'Direct' 5]
 ['China' 32 1 'Seo' 1]
 ['US' 32 1 'Ads' 6]
 ['China' 25 0 'Seo' 3]]
...Done.
[[-0.31512217  3.33913917 -1.47663353  0.          1.          0.
   0.          1.        ]
 [-1.03657046  0.03888347  0.6772161   0.          1.          0.
   1.          0.        ]
 [ 0.16584336 -1.16120951  0.6772161   0.          0.          0.
   0.          1.        ]
 [ 0.16584336  0.33890671  0.6772161   0.          0.          1.
   0.          0.        ]
 [-0.67584631 -0.56116302 -1.47663353  0.          0.          0.
   0.          1.        ]]



### Prediction sur le train et test set 

In [26]:
print("Predictions on training set...")
X_without_labels_train = classifier.predict(X_without_labels_train)
print("...Done.")
print(X_without_labels_train)
print()

Predictions on training set...


NameError: name 'X_without_labels_train' is not defined

In [None]:
### Predictions on test set
print("Predictions on training set...")
X_without_labels_test = classifier.predict(X_without_labels_test)
print("...Done.")
print(X_without_labels_test)
print()

In [None]:
data = {
    'converted': classifier.predict(X_without_labels)
}

Y_predictions = pd.DataFrame(columns=['converted'],data=data)

Y_predictions.to_csv('conversion_data_test_predictions_Margaux_Levesque.csv', index=False)

In [None]:
# WARNING : Use the same score as the one that will be used by Kaggle !
# Here, the f1-score will be used to assess the performances on the leaderboard
print("f1-score on train set : ", f1_score(X_without_labels))

Conclusion : notre modèle n'overfit pas et le score est supérieur à celui de Kaggle (69% vs 75%). 
On a pu améliorer le score en utilisant tous les features du dataset. 