In [None]:
!pip install category_encoders &> /dev/null

In [None]:
!pip install imblearn &> /dev/null

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Part02 - Modeling: <p>

<h2> Introduction: </h2> <p>
Basically, modeling is a process of data mining, because by making a model we can find patterns from customers who will churn and those who will not. In this notebook, we will create appropriate model that can predict our customer will be churn or not. The key word is "appropriate" means not only the model with the highest accuracy, but the model that we will create must be interpreted as to why it can take a churn decision and our model must also not be biased. As I said in previous notebook, we will try some machine learning model algorithms:
<ol>
<li> Logistic Regression</li>
<li> KNN</li>
<li> Decision Tree</li>
<li> Random Forest</li>
<li> XGBoost</li> 
</ol><p>
In the modeling process, we have to do some data preprocessing, such as:
<ul>
<li> Scaling</li>
<li> Encoding</li>
<li> SMOTE</li>
<li> Polynomial</li>
</ul><p>
At the end, we will evaluate to get the appropriate model and we will save the model so that we can deploy later.


## Load Library:

In [None]:
# Data
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Analyze
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Model Preprocessing
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

# Model Evaluasi
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, cross_val_score
from sklearn.metrics import classification_report, f1_score, plot_roc_curve

# Warnings
import warnings
warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


## Load Clean Dataset:

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/EcommerceCustomerChurn/CleanDataset.csv')
df.head()

Unnamed: 0,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount,Satisfaction,AvgCashbackPerOrder
0,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,Single,9,1,11.0,1.0,1.0,5.0,159.93,Unhappy,159.93
1,1,9.0,Mobile Phone,1,8.0,UPI,Male,3.0,4,Mobile Phone,Single,7,1,15.0,0.0,1.0,0.0,120.9,Neutral,120.9
2,1,9.0,Mobile Phone,1,30.0,Debit Card,Male,2.0,4,Mobile Phone,Single,6,1,14.0,0.0,1.0,3.0,120.28,Neutral,120.28
3,1,0.0,Mobile Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,Single,8,0,23.0,0.0,1.0,3.0,134.07,Happy,134.07
4,1,0.0,Mobile Phone,1,12.0,Credit Card,Male,3.0,3,Mobile Phone,Single,3,0,11.0,1.0,1.0,3.0,129.6,Happy,129.6


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Churn                        5630 non-null   int64  
 1   Tenure                       5630 non-null   float64
 2   PreferredLoginDevice         5630 non-null   object 
 3   CityTier                     5630 non-null   int64  
 4   WarehouseToHome              5630 non-null   float64
 5   PreferredPaymentMode         5630 non-null   object 
 6   Gender                       5630 non-null   object 
 7   HourSpendOnApp               5630 non-null   float64
 8   NumberOfDeviceRegistered     5630 non-null   int64  
 9   PreferedOrderCat             5630 non-null   object 
 10  MaritalStatus                5630 non-null   object 
 11  NumberOfAddress              5630 non-null   int64  
 12  Complain                     5630 non-null   int64  
 13  OrderAmountHikeFro

## Drop Bias Columns:
As I said before, we don't want to create a biased model. For that, we need to remove the columns that could cause our model to be biased or tend to favor certain categories. On our dataset, there are <strong>Gender</strong> and <strong>MaritalStatus</strong> features.

In [None]:
df.drop(columns=['Gender', 'MaritalStatus'], inplace = True)

In [None]:
df.columns

Index(['Churn', 'Tenure', 'PreferredLoginDevice', 'CityTier',
       'WarehouseToHome', 'PreferredPaymentMode', 'HourSpendOnApp',
       'NumberOfDeviceRegistered', 'PreferedOrderCat', 'NumberOfAddress',
       'Complain', 'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',
       'DaySinceLastOrder', 'CashbackAmount', 'Satisfaction',
       'AvgCashbackPerOrder'],
      dtype='object')

In [None]:
df.shape

(5630, 18)

## Handling Multicolinearity:
In previous notebook, we found that our data has a multicollinearity problem, especially on <strong>HourSpendOnApp</strong>, <strong>NumberOfDeviceRegistered</strong>, <strong>OrderAmountHikeFromlastYear</strong>, <strong>CashbackAmount</strong>, and probably <strong>AvgCashbackPerOrder</strong> features. Basically, multicolinearity doesn't affect the prediction result of tree-based model. Multicolinearity causes the results of the regression model to be unstable because the independent feature will also affect the other independent features. To understand more about the impact of multicollinearity on linear models, <a href = 'https://medium.com/analytics-vidhya/what-is-multicollinearity-and-how-to-remove-it-413c419de2f'>read here</a>. <p>
In the tree-based model, multicollinearity does not have an impact on the prediction results because this model uses the splitter concept, if you want to know more about tree-based model, you can <a href = 'https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html'>read here</a>. However, if we want to identify the most important features in predicting customer churn, the results might be wrong because these important features are correlated with other features so that the importance of these features is low <sup>[1]</sup>. To avoid that mistake, we will handle multicolinearity by dropping the feature that has the highest VIF value. <p>
Basically, we can use PCA to handle our multicolinearity problem. Dropping the feature that has the highest VIF value have the risk of losing important information about the pattern of customer churn and by using PCA we can avoid that risk. However, remember our key word, "appropriate", the disadvantage of using PCA is that it has low interpretability, if we decide to use the PCA method, we will have difficulty explaining our model. <p>
Okay, let's handle our multicolinearity problem.

In [None]:
numFeat = df.select_dtypes(include=np.number).columns.tolist()

## Check multicolinearity
def calc_vif(x):
  vif = pd.DataFrame()
  vif['variable'] = x.columns
  vif['vif'] = [variance_inflation_factor(x.values, i) for i in range(x.shape[1])]

  return vif

calc_vif(df[numFeat[1:]])

Unnamed: 0,variable,vif
0,Tenure,3.397104
1,CityTier,4.171916
2,WarehouseToHome,4.585438
3,HourSpendOnApp,19.102331
4,NumberOfDeviceRegistered,14.825241
5,NumberOfAddress,4.086215
6,Complain,1.398222
7,OrderAmountHikeFromlastYear,15.675112
8,CouponUsed,3.280053
9,OrderCount,6.523937


# Source:
<sup>[1]</sup> https://medium.com/@manepriyanka48/multicollinearity-in-tree-based-models-b971292db140 <p>