# Customer Churn Prediction Notebook

- Customer retention is one of the primary KPI for companies with a subscription-based business model. Competition is tough particularly in the SaaS market where customers are free to choose from plenty of providers. One bad experience and customer may just move to the competitor resulting in customer churn.

- Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame. One of the ways to calculate a churn rate is to divide the number of customers lost during a given time interval by the number of active customers at the beginning of the period.

- Predicting customer churn is a challenging but extremely important business problem especially in industries where the cost of customer acquisition (CAC) is high such as technology, telecom, finance, etc. The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge additional potential revenue source for companies.

In [None]:
#!pip install xgboost

In [1]:
# Data Prep and Visuals
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

#set max rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 

# Evaluation
from sklearn import metrics
from sklearn.metrics import accuracy_score, classification_report

In [2]:
def preprocess_null_values(df, column_name, replace_with=None, drop_threshold=0.05):
    # Replace null values with a given word
    if replace_with is not None:
        df[column_name].fillna(replace_with, inplace=True)
        return df
    
    # Drop rows with null values if they are less than the threshold
    if drop_threshold > 0:
        null_count = df[column_name].isnull().sum()
        total_rows = len(df)
        null_percentage = null_count / total_rows
        if null_percentage < drop_threshold:
            df.dropna(subset=[column_name], inplace=True)
            return df

In [3]:
def preprocess_duplicates(df, column_name = None, keep_first=True):
    if keep_first:
        df.drop_duplicates( keep='first', inplace=True)
    else:
        df.drop_duplicates( inplace=True)

In [4]:
# Read the data frame
df = pd.read_csv("October_Churn_label.csv")

In [5]:
# Display the top 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,month,Churn_1,Churn_2
0,0,2019-10-01 00:00:00 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c,Oct,1.0,1.0
1,1,2019-10-01 00:00:15 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c,Oct,1.0,1.0
2,2,2019-10-02 14:30:46 UTC,view,17302761,2053013553853497655,,,73.4,541312140,bda25b1a-8844-40ec-b430-a704ab39e9d5,Oct,1.0,1.0
3,3,2019-10-05 14:10:39 UTC,view,17700454,2053013558861496931,,lumene,19.0,541312140,58c59c3e-da37-4a57-8c04-f85e1e4ee77f,Oct,1.0,1.0
4,4,2019-10-05 14:11:38 UTC,view,17700020,2053013558861496931,,payot,83.58,541312140,23fb14a1-9fd3-4e35-a729-bfaa64f4e875,Oct,1.0,1.0


In [6]:
# df.drop(['event_time', 'product_id', 'category_code', 'user_id', 'user_session', 'month','category_id'], axis=1, inplace = True)
df.drop(['Unnamed: 0', 'event_time', 'product_id', 'category_code', 'user_id', 'user_session', 'month'], axis=1, inplace = True)

In [7]:
# Display info about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14320508 entries, 0 to 14320507
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   event_type   object 
 1   category_id  int64  
 2   brand        object 
 3   price        float64
 4   Churn_1      float64
 5   Churn_2      float64
dtypes: float64(3), int64(1), object(2)
memory usage: 655.5+ MB


In [8]:
df['brand'].nunique()

3296

In [9]:
# Check nulls
df.isnull().sum()

event_type           0
category_id          0
brand          2071421
price                0
Churn_1              1
Churn_2              1
dtype: int64

In [10]:
df = preprocess_null_values(df, column_name ='brand', replace_with='other')

In [11]:
df = preprocess_null_values(df, column_name = 'Churn_1')

In [12]:
df = preprocess_null_values(df, column_name = 'Churn_2')

In [13]:
# Check nulls again
df.isnull().sum()

event_type     0
category_id    0
brand          0
price          0
Churn_1        0
Churn_2        0
dtype: int64

In [14]:
# One-hotencoding categorical variables
le = LabelEncoder()
le_cat_id = le.fit_transform(df['category_id'])
le_brand = le.fit_transform(df['brand'])

In [15]:
df_num = df[:]

In [16]:
df_num['category_id'] = le_cat_id
df_num['brand'] = le_brand

In [17]:
df_num = pd.get_dummies(df_num, drop_first = True, dtype=int)

In [18]:
df_num.head()

Unnamed: 0,category_id,brand,price,Churn_1,Churn_2,event_type_purchase,event_type_view
0,448,2656,35.79,1.0,1.0,0,1
1,448,2656,35.79,1.0,1.0,0,1
2,39,2238,73.4,1.0,1.0,0,1
3,155,1766,19.0,1.0,1.0,0,1
4,155,2276,83.58,1.0,1.0,0,1


In [19]:
# We will use the data frame where we had created dummy variables
y = df_num['Churn_1'].values
X = df_num.drop(columns = ['Churn_1', 'Churn_2'])

# Scaling all the variables to a range of 0 to 1
features = X.columns.values
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X))
X.columns = features

In [20]:
# Create Train & Test Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## **Machine learning models**

### **AdaBoost**

AdaBoost is a boosting algorithm that also works on the principle of the stagewise addition method where multiple weak learners are used for getting strong learners. 

The value of the alpha parameter, in this case, will be indirectly proportional to the error of the weak learner, Unlike Gradient Boosting in XGBoost, the alpha parameter calculated is related to the errors of the weak learner, here the value of the alpha parameter will be indirectly proportional to the error of the weak learner.

**Here are the hyperparameters:**

**estimatorobject**, default=None
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

**n_estimatorsint**, default=50
The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early. Values must be in the range [1, inf).

**learning_ratefloat**, default=1.0
Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learning_rate and n_estimators parameters. Values must be in the range (0.0, inf).

**algorithm**{‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.

**random_stateint, RandomState instance or None**, default=None
Controls the random seed given at each estimator at each boosting iteration. Thus, it is only used when estimator exposes a random_state. Pass an int for reproducible output across multiple function calls. See Glossary.

**base_estimatorobject**, default=None
The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1.

In [21]:
# AdaBoost Algorithm
from sklearn.ensemble import AdaBoostClassifier
model_2= AdaBoostClassifier()
result_2= model_2.fit(X_train,y_train)

In [22]:
prediction_test= model_2.predict(X_test)

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

0.6588089390671142


###  **XGBoost**

Several names for this approach include stochastic gradient boosting, gradient boosting machines, multiple additive regression trees, and gradient boosting.

Similar to random forest, the Gradient Boosting Decision Trees (GBDT) decision tree ensemble learning algorithm combines different machine learning algorithms to create a more accurate model. Using random bootstrap samples of the data set, random forest is used to construct entire decision trees concurrently.

For tasks including regression, classification, and ranking, it is the best machine learning library.

**It has the following features:**

- Parallel tree boosting.


- Regularization.


- Built-in features to manage missing data.


- At each iteration, the user is able to perform a cross-validation.


- In small to medium datasets, it performs well.


- It is made to be extremely effective, adaptable, and portable.


- To efficiently handle weighted data, it has a distributed weighted quantile sketch process.

The XGBoost classifier has different hyperparameters that are used to build models. Some of these will be used to improve the model and score.

**Here are the hyperparameters:**


* **learning_rate:** Learning rate reduces each tree's contribution by learning rate. Between learning rate and n estimators, there is a trade-off.


* **n_estimators:** The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.


* **subsample:** The percentage of samples that will be used to fit particular base learners. Stochastic Gradient Boosting occurs when the value is less than 1.0. The parameter n estimators interacts with subsample. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.


* **colsample_bytree:** Subsample ratio of columns when constructing each tree.


*  **nthread:** Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.


* **objective:** Specify the learning task and the corresponding learning objective or a custom objective function to be used.


* **silent:** Whether print messages during construction.


* **random_state:** Controls the random seed given to each Tree estimator at each boosting iteration. In addition, it controls the random permutation of the features at each split (see Notes for more details). It also controls the random splitting of the training data to obtain a validation set if n_iter_no_change is not None. Pass an int for reproducible output across multiple function calls.


In [25]:
# XGBoost Algorithm
from xgboost import XGBClassifier
model_3= XGBClassifier()
result_3= model_3.fit(X_train, y_train)

In [26]:
prediction_test = model_3.predict(X_test)

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

0.6598952597824146


### **Logistic Regression**

Logistic regression is used to handle the classification problems.

It is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.  

It is often used for predictive analytics and modeling, and extends to applications in machine learning. Logistic regression is easier to implement, interpret, and very efficient to train. 


**There are three main types of logistic regression:**

 * **Binary regression** deals with two possible values, essentially: yes or no. 
 
 
 * **Multinomial logistic regression** deals with three or more values.
 
 
 * **ordinal logistic regression** deals with three or more classes in a predetermined order. 

To develop a model, the Logistic Regression classifier contains a lot of hyperparameters. I'll use some of them to assist us enhance the model and score.

**The hyperparameters are:**
* **penalty:** Used to specify the norm used in the penalization. The newton-cg and lbfgs solvers support only l2 penalties.
   * `'none':` no penalty is added;
   * `'l2':` add a L2 penalty term and it is the default choice;
   * `'l1':` add a L1 penalty term;
   * `'elasticnet':` both L1 and L2 penalty terms are added.


* **C:** Inverse of regularization strength.


* **solver:** *(‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’)* use in the optimization problem. Default is ‘lbfgs’.
  * `For small datasets, ‘liblinear’` is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones;
  * `For multiclass problems,` only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss;
  * `‘liblinear’` is limited to one-versus-rest schemes.
  
  
* **random_state:** Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. 

In [25]:
# logistic regression model
from sklearn.linear_model import LogisticRegression
model_5= LogisticRegression()
result_5 = model_5.fit(X_train, y_train)

In [26]:
prediction_test = model_5.predict(X_test)

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

0.6567605948857036


### **Decision Tree**

A decision tree is a flowchart-like tree structure where each internal node denotes the feature, branches denote the rules and the leaf nodes denote the result of the algorithm.

It is a versatile supervised machine-learning algorithm, which is used for both classification and regression problems.

It is one of the very powerful algorithms. And it is also used in Random Forest to train on different subsets of training data, which makes random forest one of the most powerful algorithms in machine learning.

**Here are the hyperparameters:**

- **Root Node:** It is the topmost node in the tree,  which represents the complete dataset. It is the starting point of the decision-making process.

- **Decision/Internal Node:** A node that symbolizes a choice regarding an input feature. Branching off of internal nodes connects them to leaf nodes or other internal nodes.

- **Leaf/Terminal Node:** A node without any child nodes that indicates a class label or a numerical value.

- **Splitting:** The process of splitting a node into two or more sub-nodes using a split criterion and a selected feature.

- **Branch/Sub-Tree:** A subsection of the decision tree starts at an internal node and ends at the leaf nodes.

- **Parent Node:** The node that divides into one or more child nodes.

- **Child Node:** The nodes that emerge when a parent node is split.

- **Impurity:** A measurement of the target variable’s homogeneity in a subset of data. It refers to the degree of randomness or uncertainty in a set of examples. The Gini index and entropy are two commonly used impurity measurements in decision trees for classifications task. 

- **Variance:** Variance measures how much the predicted and the target variables vary in different samples of a dataset. It is used for regression problems in decision trees. Mean squared error, Mean Absolute Error, friedman_mse, or Half Poisson deviance are used to measure the variance for the regression tasks in the decision tree.

- **Information Gain:** Information gain is a measure of the reduction in impurity achieved by splitting a dataset on a particular feature in a decision tree. The splitting criterion is determined by the feature that offers the greatest information gain, It is used to determine the most informative feature to split on at each node of the tree, with the goal of creating pure subsets.

- **Pruning:** The process of removing branches from the tree that do not provide any additional information or lead to overfitting.

In [21]:
# Decision Tree model
from sklearn.tree import DecisionTreeClassifier
model_6= DecisionTreeClassifier()
result_6 = model_6.fit(X_train, y_train)

In [22]:
prediction_test = model_6.predict(X_test)

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

0.662608617523631


### **Random Forest**

A random forest is an ensemble classifier that estimates based on the combination of different decision trees. Effectively, it fits a number of decision tree classifiers on various subsamples of the dataset. Also, each tree in the forest built on a random best subset of features. Finally, the act of enabling these trees gives us the best subset of features among all the random subsets of features. Random forest is currently one of best performing algorithms for many classification problems.

**It has the following features:**


- Diversity: Not all attributes/variables/features are considered while making an individual tree; each tree is different.


- Immune to the curse of dimensionality: Since each tree does not consider all the features, the feature space is reduced.


- Parallelization: Each tree is created independently out of different data and attributes. This means we can fully use the CPU to build random forests.


- Train-Test split: In a random forest, we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.


- Stability: Stability arises because the result is based on majority voting/ averaging.

The Random Forest classifier has different hyperparameters that are used to build models. Some of these will be used to improve the model and score.

**Here are the hyperparameters:**


- **n_estimators:** Number of trees the algorithm builds before averaging the predictions.


- **max_features:** Maximum number of features random forest considers splitting a node.


- **mini_sample_leaf:** Determines the minimum number of leaves required to split an internal node.


- **criterion:** How to split the node in each tree? (Entropy/Gini impurity/Log Loss)


- **max_leaf_nodes:** Maximum leaf nodes in each tree


- **n_jobs:** it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor, but if the value is -1, there is no limit.


- **random_state:** controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and has been given the same hyperparameters and training data.


- **oob_score:** OOB means out of the bag. It is a random forest cross-validation method. In this, one-third of the sample is not used to train the data; instead used to evaluate its performance. These samples are called out-of-bag samples.

In [23]:
# Random forest model
from sklearn.ensemble import RandomForestClassifier
model_7= RandomForestClassifier()
result_7= model_7.fit(X_train, y_train)

In [24]:
prediction_test = model_7.predict(X_test)

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, prediction_test))

0.6634230671021261


### **Tuning Methods**

Hyperparameter tuning (or hyperparameter optimization) is the process of determining the right combination of hyperparameters that maximizes the model performance. It works by running multiple trials in a single training process. Each trial is a complete execution of your training application with values for your chosen hyperparameters, set within the limits you specify.

This process once finished will give you the set of hyperparameter values that are best suited for the model to give optimal results. Needless to say, It is an important step in any Machine Learning project since it leads to optimal results for a model


**Hyperparameters of Tuning Methods *(Grid Search, Random Search, Bayisen Search)* are:**
* **estimator:** *(object)* a scikit-learn model.
* **param_grid:** *(dict or list of dictionaries)* This enables searching over any sequence of parameter settings.
* **scoring:** *(str, callable, list, tuple or dict)* Strategy to evaluate the performance of the cross-validated model on the test set.
* **n_jobs:** *(int)* Number of jobs to run in parallel. 
  * `None` means 1.
  * `-1` means using all processors.
* **refit:** *(bool, str, or callable)* Refit an estimator using the best found parameters on the whole dataset.
* **cv:** *(int, cross-validation generator or an iterable)* determines the cross-validation splitting strategy. Possible inputs for cv are:

  * None, to use the default 5-fold cross validation.
  * integer, to specify the number of folds in a (Stratified)KFold.
  * CV splitter.
  * An iterable yielding (train, test) splits as arrays of indices.
* **verbose:** *(int)* Controls the verbosity (Controll to show messages)
  * `>1`: the computation time for each fold and parameter candidate is displayed.
  * `>2` : the score is also displayed.
  * `>3` : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
* **error_score:** *(‘raise’ or numeric)* Value to assign to the score if an error occurs in estimator fitting.

There are different hyperparameter tuning methods, the used ones in this task are:

- Grid Search 
- Random Search
- Bayesian Search

### **Random Search**

Using a specified probability distribution or series of probability distributions, random search methods are stochastic approaches that solely rely on the random sampling of a series of points in the problem's feasible region.


In [34]:
# Set the parameters of RF model
param_grid = { 
'n_estimators': [25, 50, 100, 150], 
'max_features': ['sqrt', 'log2', None], 
'max_depth': [3, 6, 9], 
'max_leaf_nodes': [3, 6, 9], 
} 

In [35]:
# Hyperparameter Tuning- RandomizedSearchCV
random_search = RandomizedSearchCV(RandomForestClassifier(), param_grid) 
random_search.fit(X_train, y_train) 
print(random_search.best_estimator_) 

RandomForestClassifier(max_depth=3, max_features=None, max_leaf_nodes=9,
                       n_estimators=25)


In [38]:
# Update the model
model_random = RandomForestClassifier(max_depth=3, 
                                      max_features=None, 
                                      max_leaf_nodes=9, 
                                      n_estimators=25) 
model_random.fit(X_train, y_train) 
pred_test_rand = model_random.predict(X_test) 

# Print the prediction accuracy
print (metrics.accuracy_score(y_test, pred_test_rand ))

0.6588475782868999


## **Result**

Fom all brevious trials, we found that random forest with accuracy 0.663

In our churn prediction model, we undertook the method of self-labeling for eight segments and subsequently trained a variety of machine learning models.
Through evaluation, it became that the random forest algorithm was the best trial with acc of 56.78% and acc of 65.89% after the random search.
However, the obtained accuracy fell short of our expectations. we recognized that this low accuracy could be attributed to the self-labeling process itself.
As such, we made the decision to disregard these results as less valuable in our churn prediction. This highlits the importance of ensuring high-quality, well-labeled data in any predictive modeling task.


In [40]:
%%time
#save dataframe as feather in case our notebook got crashed
#feather save column data types
import os
import pyarrow.feather as feather
os.makedirs('tmp', exist_ok=True)  # Make a temp dir for storing the feather file
feather.write_feather(df, './tmp/df')

CPU times: user 2.39 s, sys: 740 ms, total: 3.13 s
Wall time: 2.01 s


In [None]:
%%time
#load the feather data cause feather more lightweight
df = pd.read_feather('./tmp/df')
df

CPU times: user 1.4 s, sys: 1.2 s, total: 2.6 s
Wall time: 1.22 s
