# Conversion Prediction : Classification
>In this second phase of the project, we will try to:  

>> find out which ML technique suits the best the data and the prediction goal and   
make predictions given that technique. 

### Table of Contents

* [1. Data Preparation](#section1)
    * [1.1. Load Data](#section21)
    * [1.2. Predictors and Target](#section21)
    * [1.3. Training and Validation sets](#section22)
    * [1.4. Preprocessing pipeline](#section23)
* [2. Classification](#section2)
    * [2.1. Preliminary Analysis](#section21)
        * [2.1.1. Statmodels logit](#section21)
        * [2.1.2. Preliminary model selection](#section22)
    * [2.2. Logistic Regression](#section23)
        * [2.2.1. Model Evaluation](#section24)
    * [2.3. XGBoost Classifier](#section25)
        * [2.3.1. Model Seletion](#section26)
        * [2.3.2. Model Evaluation](#section27)
    * [2.4. Train the chosen models on the whole dataset](#section25)
* [3. Predict the target of the test set](#section2)

 #### Import useful modules ⬇️⬇️ and Global params

In [4]:
# generic libs
import os
import pandas as pd
from numpy import append
from time import time

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# ML tools
# pre_training tools
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# training tools
import statsmodels.api as sm 

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# predefined modules
from modules import MyFunctions as MyFunct

# Global parameters 
train_filepath = 'data/conversion_data_train.csv'
test_filepath = 'data/conversion_data_test.csv'
results_path = "results/"

if not os.path.exists("output"):
    os.mkdir("output")
output_path = 'output/'

seed = 0
cv = 100

# Data Preparation

## Load data

In [5]:
print("Loading dataset...")
dataset = pd.read_csv(train_filepath)
print("...Done.")
print()

Loading dataset...
...Done.



## Predictors and Target

In [6]:
# Separate target variable y from features X
y = dataset['converted']
X = dataset.drop('converted', axis = 'columns')

## Training and Validation sets

🗒 **_Stratify_**: If we select observations from the dataset with a uniform probability distribution (**stratify = y(dataset['converted']**), we will draw observations from each class with the same probability of their occurrence in the dataset.

In [7]:
# Divide dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify = y)

# Convert pandas DataFrames to numpy arrays before using scikit-learn
X_train = X_train.values
X_test = X_test.values
y_train = y_train.tolist()
y_test = y_test.tolist()

## Preprocessing pipeline

>🗒 In the dataset, we have mixed data with both quantitative and qualitative predictors. Hence, we must define a different preprocessing pipeline for each category.
>> 1. we will **standardize** the numerical data before training to eliminate large scales effect on the learning phase.
>> 2. we will **encode** categorical predictors using one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.

In [8]:
# Create pipeline for numeric features 
#Num_X =['age', 'total_pages_visited'] 
num_X = [1,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Create pipeline for categorical features
#cat_X = ['country', 'new_user', 'source']
cat_X = [0,2,3]
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first'))
])

# Use ColumnTranformer to make a preprocessor object 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_X),
        ('cat', categorical_transformer, cat_X)
    ])

# Preprocessings on train set (8 cols = 2 for numerci columns + 1 for new_user + 3 for country + 2 for source)
X_train = preprocessor.fit_transform(X_train)
X_test  = preprocessor.transform(X_test)

# Classification

## Preliminary Analysis

### Statmodels logit

🗒 **_Statmodels_**: we want to establish a preliminary analysis using the Statmodels logit function that gives a detailed results of a regression model in order to confirm what we have noticed in the EDA part.

In [9]:
cols =preprocessor.transformers_[1][1].named_steps['encoder'].get_feature_names().tolist()
columns = ['const','age', 'total_pages_visited'] + cols

In [10]:
X2 = sm.add_constant(X_train)

logit = sm.Logit(y_train,X2)

logit_fit = logit.fit()

logit_fit.summary(xname=columns)

Optimization terminated successfully.
         Current function value: 0.040482
         Iterations 11


0,1,2,3
Dep. Variable:,y,No. Observations:,227664.0
Model:,Logit,Df Residuals:,227655.0
Method:,MLE,Df Model:,8.0
Date:,"Thu, 14 Apr 2022",Pseudo R-squ.:,0.7159
Time:,06:56:11,Log-Likelihood:,-9216.3
converged:,True,LL-Null:,-32443.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-8.9262,0.149,-59.801,0.000,-9.219,-8.634
age,-0.5954,0.023,-25.768,0.000,-0.641,-0.550
total_pages_visited,2.5540,0.025,103.431,0.000,2.506,2.602
x0_Germany,3.7701,0.155,24.288,0.000,3.466,4.074
x0_UK,3.6049,0.141,25.575,0.000,3.329,3.881
x0_US,3.2286,0.137,23.625,0.000,2.961,3.496
x1_1,-1.6935,0.042,-40.495,0.000,-1.775,-1.612
x2_Direct,-0.1985,0.057,-3.462,0.001,-0.311,-0.086
x2_Seo,-0.0450,0.047,-0.960,0.337,-0.137,0.047


****************************************************************  
> 🗒 **_Notation_**:  
                  **x0_**: Country     **x1_1**: new_user    **x2_**: Source      
                  
> 🗒 **Statistical Significance (P>|z|)**: all variables are significant, except x2_Seo. it's normal as among the values of the initial variable source, there is no big difference on how each source may influence the user conversion and all say approximately the same thing

> 🗒 **Predictors Importance (coef)**: looking at the coefficients of the regression, we can notice that the predictors are ordred as follows given their importance:     
**Country, total_pages_visited, age, Source**

> 🗒 
0.7159

### Preliminary Model Selection

In this part, we will try to find the most suitable classification technique for our problem. We want to establish a preliminary performance evaluation to get some first insights on the classification techniques that can be efficiently used to solve the current prediction problem. We will evaluate the baseline performance of various techniques, using the default settings as proposed by the ML library **sklearn**. We will check **9** algorithms:    

1) Baseline or Dummy Classifier: used as a reference to evaluate the efficacity of the different algorithms.  
2) Logistic Regression  
3) Support Vector Classifier  
4) Naive Bayes   
5) Decision Tree Classifier   
6) Random Forest Classifier   
7) AdaBoost Classifier   
8) Gradient Boosting Classifier   
9) XGBoost Classifier   

> 🗒 As the dataset is **highly imbalanced** the **_accuracy_score_** is not too informative for the algorithms evaluation. Instead, we want to use **_f1_score_** that offers a tradeoff between precision and recall and **_roc_auc_score_** that measure the ability of a model to distinguish (separate) between different classes.

In [11]:
classifiers = [
    DummyClassifier(strategy='most_frequent'),
    LogisticRegression(),
    SVC(probability=True),
    GaussianNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(random_state = seed),
    AdaBoostClassifier(random_state = seed),
    GradientBoostingClassifier(random_state = seed),
    XGBClassifier(objective = 'binary:logistic') 
]

In [12]:
iterables = [["Accuracy", "f1_score", "roc_auc_score"], ["Train", "Test"]]
ind = pd.MultiIndex.from_product(iterables)

metrics = pd.DataFrame(index = ind)
for clf in classifiers:
    name = str(clf).split('(')[0]
    scores = MyFunct.learn(clf, X_train, y_train, X_test, y_test, name)
    metrics[name] = scores
    
metrics

fitting DummyClassifier is launched
fitting DummyClassifier is done in 0.027981042861938477 s
fitting LogisticRegression is launched
fitting LogisticRegression is done in 1.015674114227295 s
fitting GaussianNB is launched
fitting GaussianNB is done in 0.10459709167480469 s
fitting DecisionTreeClassifier is launched
fitting DecisionTreeClassifier is done in 0.5917823314666748 s
fitting RandomForestClassifier is launched
fitting RandomForestClassifier is done in 15.229117631912231 s
fitting AdaBoostClassifier is launched
fitting AdaBoostClassifier is done in 6.393838167190552 s
fitting GradientBoostingClassifier is launched
fitting GradientBoostingClassifier is done in 21.827544927597046 s
fitting XGBClassifier is launched
fitting XGBClassifier is done in 13.220532894134521 s


Unnamed: 0,Unnamed: 1,DummyClassifier,LogisticRegression,GaussianNB,DecisionTreeClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier,XGBClassifier
Accuracy,Train,0.967742,0.986331,0.978697,0.98861,0.98861,0.985931,0.986282,0.986941
Accuracy,Test,0.967742,0.985769,0.978705,0.983695,0.984117,0.985329,0.985435,0.985259
f1_score,Train,0.0,0.765451,0.69812,0.802619,0.807627,0.757587,0.764355,0.777253
f1_score,Test,0.0,0.755435,0.694556,0.717073,0.732227,0.748115,0.750827,0.749178
roc_auc_score,Train,0.5,0.986005,0.979144,0.994434,0.994135,0.985453,0.985247,0.989084
roc_auc_score,Test,0.5,0.985336,0.978506,0.921006,0.949716,0.98456,0.983951,0.984092


In [13]:
# save the results of this preliminary analysis
metrics.to_csv(results_path+'preliminary_analysis.csv')

> 🗒 Given this preliminary analysis, almost all algorithms gives close scores but we can notice that the algorithms that gives the best scores (among the checked algorithms) are the **Logistic regression** and the **XGBoost** classifiers. Hence, we will keep these 2 algorithms for **hyperparameters tuning**.

## Logistic Regression

> 🗒 Logistic regression does not really have any critical hyperparameters to tune. We will not have recourse to regularization because looking at the preliminary analysis done with statsmodels logit, we didn't notice any anomaly. However, it would be practical to evaluate its mean performance using the **k-fold cross validation** technique.

### Model Evaluation

In [31]:
scorings = ['accuracy', 'f1', 'roc_auc']
for s in scorings:
    print(f'\n************** Metric : {s} score ******************\n')
    scores = MyFunct.model_validation(LogisticRegression(),X_train, y_train, cv = 100, scoring = s)
    print(f"Classifier : {scores[0]} \nMean : {scores[1]} \nStd : {scores[2]}")


************** Metric : accuracy score ******************

fitting LogisticRegression is done in 31.844318866729736s
Classifier : LogisticRegression 
Mean : 0.9863263837272396 
Std : 0.0023254987432484654

************** Metric : f1 score ******************

fitting LogisticRegression is done in 31.69506049156189s
Classifier : LogisticRegression 
Mean : 0.7644634617786664 
Std : 0.04540174813269441

************** Metric : roc_auc score ******************

fitting LogisticRegression is done in 31.698376655578613s
Classifier : LogisticRegression 
Mean : 0.9859623680576236 
Std : 0.005786422693895489


Output:          
************** Metric : accuracy score ******************          
            
fitting LogisticRegression is done in 31.844318866729736s            
Classifier : LogisticRegression            
Mean : 0.9863263837272396             
Std : 0.0023254987432484654             
            
************** Metric : f1 score ******************              
             
fitting LogisticRegression is done in 31.69506049156189s             
Classifier : LogisticRegression         
Mean : 0.7644634617786664            
Std : 0.04540174813269441          
             
************** Metric : roc_auc score ******************         
          
fitting LogisticRegression is done in 31.698376655578613s         
Classifier : LogisticRegression          
Mean : 0.9859623680576236          
Std : 0.005786422693895489           

## XGBoost Classifier

> 🗒 The XGBoost classifier require lot of **hyperparameters** to be tuned. It has about 30 [hyperparameters](https://xgboost.readthedocs.io/en/latest/parameter.html). Given the computation constraints, we will only tune the most important hyperparameters using the **GridSearchCV** technique. We have been inspired by the study established by [Dataiku](https://blog.dataiku.com/narrowing-the-search-which-hyperparameters-really-matter) about the most importnat hyperparameters of well known algorithms. Better results can be obtained with further tuning.

### Model Selection

In [None]:
estimator = XGBClassifier(objective = 'binary:logistic', seed =seed)
params = {'learning_rate':[ 0.01, 0.1, 0.2],
          "n_estimators": [5, 10, 50, 100],
          'max_depth' : [2, 4, 6, 8]}
scoring ='f1'

best_estimator = MyFunct.model_selection(estimator, X_train, y_train, X_test, y_test, params, scoring)

Output:         
cv =  PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0]))        
Fitting 1 folds for each of 48 candidates, totalling 48 fits       
Tuning XGBClassifier hyperparameters is done in 692.7736556529999s           
          
Best Estimator           
Best Params         
       
{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100}           
Best score          
         
0.7488721804511278           

### Model Evaluation

In [None]:
scorings = ['accuracy', 'f1', 'roc_auc']
for s in scorings:
    print(f'\n************** Metric : {s} score ******************\n')
    scores = MyFunct.model_validation(best_estimator, X_train, y_train, cv = cv, scoring = s)
    print(f"Classifier : {scores[0]} \nMean : {scores[1]} \nStd : {scores[2]}")

output:   
************** Metric : accuracy score ******************   
fitting XGBClassifier is done in 74.6988594532013s   
Classifier : XGBClassifier    
Mean : 0.9863395178860075    
Std : 0.00045854191434234075   
  
************** Metric : f1 score ******************    
fitting XGBClassifier is done in 73.22126317024231s     
Classifier : XGBClassifier        
Mean : 0.7662324256302723       
Std : 0.007874064578495559        
        
************** Metric : roc_auc score ******************        
fitting XGBClassifier is done in 70.36932730674744s      
Classifier : XGBClassifier       
Mean : 0.985375130047026     
Std : 0.0008332235597007553     

## Train the chosen models on the whole dataset

In [36]:
# train the model on the whole data
X1 = append(X_train,X_test,axis=0)
y1 = append(y_train,y_test)

lr_model = LogisticRegression()
t0= time()
lr_model.fit(X1, y1)
name = str(lr_model).split('(')[0]
print(f'fitting {name} is done in {time() - t0}s')

xgb_model = XGBClassifier(objective = 'binary:logistic', seed =seed, learning_rate = 0.1, max_depth = 4, n_estimators= 100)
t0= time()
xgb_model.fit(X1, y1)
name = str(xgb_model).split('(')[0]
print(f'fitting {name} is done in {time() - t0}s')

fitting LogisticRegression is done in 1.351513385772705s
fitting XGBClassifier is done in 12.032495737075806s


# Predict the target of the test set

In [39]:
# Read data without labels
X_without_labels = pd.read_csv(test_filepath)
print('Prediction set (without labels) :', X_without_labels.shape)

# Convert pandas DataFrames to numpy arrays before using scikit-learn
X_without_labels = X_without_labels.values

# preprocess
X_without_labels  = preprocessor.transform(X_without_labels)

# predict
name = str(lr_model).split('(')[0]
y_pred = lr_model.predict(X_without_labels)
y_pred_df = pd.DataFrame(y_pred, columns=['conversion'])
y_pred_df.to_csv(output_path+'conversion_data_test_predictions_'+name+'.csv', index=False)

name = str(xgb_model).split('(')[0]
y_pred = xgb_model.predict(X_without_labels)
y_pred_df = pd.DataFrame(y_pred, columns=['conversion'])
y_pred_df.to_csv(output_path+'conversion_data_test_predictions_'+name+'.csv', index=False)


Prediction set (without labels) : (31620, 5)
