# Conversion Prediction : Classification with Resampling
------
>In this part of the project, we will try to:  
>> analyse the effect of **balancing** the dataset using the **Resampling** techniques on the prediction performance.
------

### Table of Contents

* [1. Data Preparation](#section1)
    * [1.1. Load Data](#section21)
    * [1.2. Predictors and Target](#section21)
    * [1.3. Resampling](#section22)
    * [1.4. Training and Validation sets](#section22)
    * [1.5. Preprocessing pipeline](#section23)
* [2. Classification](#section2)
    * [2.1. Preliminary Analysis](#section21)
        * [2.1.1. Statmodels logit](#section21)
    * [2.2. Logistic Regression](#section23)
        * [2.2.1. Model Evaluation](#section24)
    * [2.4. Train on the whole dataset](#section25)
* [3. Predict the target of the test set](#section2)

 #### Import useful modules ‚¨áÔ∏è‚¨áÔ∏è and Global params

In [2]:
# generic libs
import os
import pandas as pd
from numpy import append
from time import time

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# ML tools
# pre_training tools
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# training tools
import statsmodels.api as sm 

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
#from xgboost import XGBClassifier

# predefined modules
from modules import MyFunctions as MyFunct

# Global parameters 
train_filepath = 'data/conversion_data_train.csv'
test_filepath = 'data/conversion_data_test.csv'
results_path = "results/"

if not os.path.exists("output"):
    os.mkdir("output")
output_path = 'output/'

seed = 0
cv = 100

# Data Preparation

## Load data

In [3]:
print("Loading dataset...")
dataset = pd.read_csv(train_filepath)
print("...Done.")
print()

Loading dataset...
...Done.



In [4]:
dataset.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,China,22,1,Direct,2,0
1,UK,21,1,Ads,3,0
2,Germany,20,0,Seo,14,1
3,US,23,1,Seo,3,0
4,US,28,1,Direct,3,0


## Predictors and Target

In [5]:
# Separate target variable y from features X
y = dataset['converted']
X = dataset.drop('converted', axis = 'columns')

## Resampling

üóí As the dataset is **highly imbalanced**, it would be better to **resample** the observations in order to **balance** the dataset. In general, there are 2 approaches:   
1) Under Sampling: Remove samples from the majority class.   
2) Over Sampling: Duplicate samples from the minority class.

In the current proposal we will use the **Under Sampling** approach from the library **imblearn**.

In [8]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X, y = rus.fit_resample(X, y)

## Training and Validation sets

üóí **_Stratify_**: If we select observations from the dataset with a uniform probability distribution (**stratify = y(dataset['converted']**), we will draw observations from each class with the same probability of their occurrence in the dataset.

In [9]:
# Divide dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify = y)

# Convert pandas DataFrames to numpy arrays before using scikit-learn
X_train = X_train.values
X_test = X_test.values
y_train = y_train.tolist()
y_test = y_test.tolist()

## Preprocessing pipeline

>üóí In the dataset, we have mixed data with both quantitative and qualitative predictors. Hence, we must define a different preprocessing pipeline for each category.
>> 1. we will **standardize** the numerical data before training to eliminate large scales effect on the learning phase.
>> 2. we will **encode** categorical predictors using one-hot (aka ‚Äòone-of-K‚Äô or ‚Äòdummy‚Äô) encoding scheme.

In [10]:
# Create pipeline for numeric features 
#Num_X =['age', 'total_pages_visited'] 
num_X = [1,4]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Create pipeline for categorical features
#cat_X = ['country', 'new_user', 'source']
cat_X = [0,2,3]
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(drop='first'))
])

# Use ColumnTranformer to make a preprocessor object 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_X),
        ('cat', categorical_transformer, cat_X)
    ])

# Preprocessings on train set (8 cols = 2 for numerci columns + 1 for new_user + 3 for country + 2 for source)
X_train = preprocessor.fit_transform(X_train)
X_test  = preprocessor.transform(X_test)

# Classification

## Preliminary Analysis

### Statmodels logit

üóí **_Statmodels_**: we want to establish a preliminary analysis using the Statmodels logit function that gives a detailed results of a regression model in order to confirm what we have noticed in the EDA part.

In [11]:
cols =preprocessor.transformers_[1][1].named_steps['encoder'].get_feature_names().tolist()
columns = ['const','age', 'total_pages_visited'] + cols

In [12]:
X2 = sm.add_constant(X_train)

logit = sm.Logit(y_train,X2)

logit_fit = logit.fit()

logit_fit.summary(xname=columns)

Optimization terminated successfully.
         Current function value: 0.157403
         Iterations 9


0,1,2,3
Dep. Variable:,y,No. Observations:,14688.0
Model:,Logit,Df Residuals:,14679.0
Method:,MLE,Df Model:,8.0
Date:,"Thu, 14 Apr 2022",Pseudo R-squ.:,0.7729
Time:,06:54:53,Log-Likelihood:,-2311.9
converged:,True,LL-Null:,-10181.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-2.0279,0.198,-10.226,0.000,-2.417,-1.639
age,-0.5309,0.041,-13.034,0.000,-0.611,-0.451
total_pages_visited,4.4927,0.088,50.800,0.000,4.319,4.666
x0_Germany,3.8345,0.244,15.695,0.000,3.356,4.313
x0_UK,3.5782,0.205,17.455,0.000,3.176,3.980
x0_US,3.2664,0.193,16.909,0.000,2.888,3.645
x1_1,-1.4625,0.078,-18.712,0.000,-1.616,-1.309
x2_Direct,-0.0656,0.111,-0.591,0.555,-0.283,0.152
x2_Seo,0.0261,0.090,0.291,0.771,-0.150,0.202


****************************************************************                  
> üóí **Statistical Significance (P>|z|)**: The **Resampling** confirms the **non-significance** of the predictor **source**. We may eliminate it from the prediction.

> üóí **Predictors Importance (coef)**: The **Resampling** changes the ordering of the predictors and highlights the insights found in the EDA part. The predictors are ordred as follows given their importance:     
**total_pages_visited, Country, age, Source**

## Logistic Regression

### Model Evaluation

> üóí We will evaluate the performance of the **LogisticRegression** classifier using the **f1_score** by the means of **k-fold Cross validation** technique.

In [14]:
scores = MyFunct.model_validation(LogisticRegression(),X_train, y_train, cv = cv, scoring = 'f1')
print(f"Classifier : {scores[0]} \nMean_f1 : {scores[1]} \nStd_f1 : {scores[2]}")

fitting LogisticRegression is done in 1.6619296073913574s
Classifier : LogisticRegression 
Mean_f1 : 0.9364959161839149 
Std_f1 : 0.022319690058280755


> üóí The old values of the **f1_score** are:  
Mean : 0.7644634617786664    
Std : 0.04540174813269441    

The **f1_score** is increased by **17%**. Not bad!! 

>> The **Resampling** gives better scores.

## Train on the whole dataset

In [20]:
# train the model on the whole data
X1 = append(X_train,X_test,axis=0)
y1 = append(y_train,y_test)

lr_model = LogisticRegression()
name = 'Resampling_'+str(lr_model).split('(')[0]

t0= time()
lr_model.fit(X1, y1)
print(f'fitting {name} is done in {time() - t0}s')

fitting Resampling_LogisticRegression is done in 0.06558585166931152s


# Predict the target of the test set

In [22]:
# Read data without labels
X_without_labels = pd.read_csv(test_filepath)
print('Prediction set (without labels) :', X_without_labels.shape)

# Convert pandas DataFrames to numpy arrays before using scikit-learn
X_without_labels = X_without_labels.values

# preprocess
X_without_labels  = preprocessor.transform(X_without_labels)

# predict
y_pred = lr_model.predict(X_without_labels)
y_pred_df = pd.DataFrame(y_pred, columns=['conversion'])
y_pred_df.to_csv(output_path+'conversion_data_test_predictions_'+name+'.csv', index=False)

Prediction set (without labels) : (31620, 5)
