<a href="https://colab.research.google.com/github/KelvinLam05/customer_churn_prediction/blob/main/travel_customer_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

The aim of this project will be to predict whether a travel consumer will churn or not.

In [127]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

**Load the data**

In [128]:
# Load dataset
df = pd.read_csv('/content/customer_churn.csv')

In [129]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [130]:
# Examine the data
df.head()

Unnamed: 0,age,frequentflyer,annualincomeclass,servicesopted,accountsyncedtosocialmedia,bookedhotelornot,target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


In [131]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age                         954 non-null    int64 
 1   frequentflyer               954 non-null    object
 2   annualincomeclass           954 non-null    object
 3   servicesopted               954 non-null    int64 
 4   accountsyncedtosocialmedia  954 non-null    object
 5   bookedhotelornot            954 non-null    object
 6   target                      954 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 52.3+ KB


**Examine the data**

If we run the Pandas value_counts( ) function we will see that this is an imbalanced dataset. 

In [132]:
df['target'].value_counts()

0    730
1    224
Name: target, dtype: int64

**Check for missing values**

In [133]:
# Check for missing values
df.isnull().sum()

age                           0
frequentflyer                 0
annualincomeclass             0
servicesopted                 0
accountsyncedtosocialmedia    0
bookedhotelornot              0
target                        0
dtype: int64

We don’t have any missing values. We are good to go.

**Examine categorical data cardinality**

Next we will take a look at the “cardinality” of the categorical variables. Cardinality is just a technical way of saying the number of unique values held within.

In [134]:
df.select_dtypes(include = ['object']).agg(['count', 'nunique']).T

Unnamed: 0,count,nunique
frequentflyer,954,3
annualincomeclass,954,3
accountsyncedtosocialmedia,954,2
bookedhotelornot,954,2


All of these features are quite low in cardinality.

**Split the train and test data**

In [135]:
X = df.drop('target', axis = 1) 

In [136]:
y = df['target']

To divide X and y into the train and test datasets we need to train the model we will use the train_test_split( ) function from scikit-learn. We will assign 30% of the data to the test groups using the argument test_size = 0.3, and we will use the stratify = y option to ensure the target variable is present in the test and train data in equal proportions. The random_state = 42 argument means we get reproducible results each time we run the code, rather than a random mix, which may give us different results.

In [137]:
from sklearn.model_selection import train_test_split

In [138]:
# Isolate X and y variables, and perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

**Create a model pipeline**

Next, we will create a model pipeline. This will handle the encoding of our data using the ColumnTransformer( ) feature. This also scales the data before we pass it to the model.

I have also added some basic feature selection via SelectKBest( ), and have used the SMOTEENN to better handle class imbalance. I fiddled around with the SelectKBest parameters until I found the optimum number of features to leave in. This was around eight features. Doing this greatly improved performance.

In [139]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import OneHotEncoder
from imblearn.combine import SMOTEENN
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import roc_auc_score

In [140]:
def get_pipeline(X, model):

    categorical_columns = list(X.select_dtypes(include = ['object']).columns.values.tolist())
    categorical_transformer = OneHotEncoder(drop = 'if_binary', sparse = False, handle_unknown = 'ignore')
    
    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())
    
    preprocessor = ColumnTransformer(transformers = [('numeric', 'passthrough', numeric_columns), ('categorical', categorical_transformer, categorical_columns)])

    bundled_pipeline = imbpipeline(steps = [('preprocessor', preprocessor),
                                            ('smote', SMOTEENN(random_state = 42)),
                                            ('scaler', RobustScaler()), 
                                            ('feature_selection', SelectKBest(score_func = mutual_info_classif, k = 8)),
                                            ('model', model)])

    return bundled_pipeline

**Select the best model**

Rather than simply selecting a single model, or repeating our code manually on a range of models, we can create another function to automatically test a wide range of possible models to determine the best one for our needs. To do this we first create a dictionary containing some a selection of base classifiers, including XGBoost, Random Forest, Decision Tree, SVC, and a Bernoulli Naive Bayes among others.

We will create a Pandas dataframe into which we will store the data. Then we will loop over each of the models, fit it using the X_train and y_train data, then generate predictions from X_test and calculate the mean ROC/AUC score from 10 rounds of cross-validation. That will give us the ROC/AUC score for the X_test data, plus the average ROC/AUC score for the training data set.

In [141]:
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import BernoulliNB

In [142]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'DummyClassifier': DummyClassifier(strategy = 'most_frequent')})
  classifiers.update({'XGBClassifier': XGBClassifier()})
  classifiers.update({'XGBRFClassifier': XGBRFClassifier()})
  classifiers.update({'LogisticRegression': LogisticRegression()})
  classifiers.update({'LGBMClassifier': LGBMClassifier()})
  classifiers.update({'RandomForestClassifier': RandomForestClassifier()})
  classifiers.update({'DecisionTreeClassifier': DecisionTreeClassifier()})
  classifiers.update({'ExtraTreeClassifier': ExtraTreesClassifier()})
  classifiers.update({'GradientBoostingClassifier': GradientBoostingClassifier()})    
  classifiers.update({'BaggingClassifier': BaggingClassifier()})
  classifiers.update({'AdaBoostClassifier': AdaBoostClassifier()})
  classifiers.update({'HistGradientBoostingClassifier': HistGradientBoostingClassifier()})
  classifiers.update({'KNeighborsClassifier': KNeighborsClassifier()})
  classifiers.update({'SGDClassifier': SGDClassifier()})
  classifiers.update({'BaggingClassifier': BaggingClassifier()})
  classifiers.update({'BernoulliNB': BernoulliNB()})
  classifiers.update({'LinearSVC': LinearSVC()})
  classifiers.update({'SVC': SVC()})
  classifiers.update({'CatBoostClassifier': CatBoostClassifier(silent = True)})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'roc_auc_cv', 'roc_auc'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'roc_auc')

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'roc_auc_cv': cv.mean(),
             'roc_auc': roc_auc_score(y_test, y_pred)}

      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'roc_auc_cv', ascending = False)
      
  return df_models

In [143]:
models = select_model(X_train, y_train)

* DummyClassifier
* XGBClassifier
* XGBRFClassifier
* LogisticRegression
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreeClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* HistGradientBoostingClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* LinearSVC
* SVC
* CatBoostClassifier


Running the select_model( ) function on our training data takes a minute or so. The best independent model was  XGBClassifier. 

In [144]:
models.head()

Unnamed: 0,model,run_time,roc_auc_cv,roc_auc
4,LGBMClassifier,0.02,0.926385,0.819539
8,GradientBoostingClassifier,0.03,0.918574,0.890299
17,CatBoostClassifier,0.2,0.918521,0.887042
1,XGBClassifier,0.02,0.916683,0.902951
11,HistGradientBoostingClassifier,0.07,0.913292,0.827001


**Examine the performance of the best model**

Finally, we can take our best model - the XGBClassifier - and fit the data on this. To do this step, we will first define our selected model, then we will pass its configuration to get_pipeline( ) with our training data. Then, we will fit( ) the training data and use predict( ) to return our predictions from the newly trained model.

In [145]:
selected_model = XGBClassifier()
bundled_pipeline = get_pipeline(X_train, selected_model)
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

**Assess the performance of the model**

To examine how well the model performed in a little more detail we can make use of the confusion_matrix( ) function. 

The confusion matrix shows us that:

* 197 true negatives (the customers didn’t churn, and we predicted this correctly)

* 61 true positives (the customers did churn, and we predicted this correctly)

* 23 false positives (the customers didn’t churn, but we wrongly predicted that they would)

* 6 false negatives (the customers did churn, but we wrongly predicted that they wouldn’t)

Out of 287 predictions, we got it right 258 times, and we got it wrong just 29 times.

In [146]:
from sklearn.metrics import confusion_matrix

In [147]:
confusion_matrix(y_test, y_pred)

array([[197,  23],
       [  6,  61]])