<a href="https://colab.research.google.com/github/KelvinLam05/travel_customer_churn_prediction/blob/main/travel_customer_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**

The aim of this project will be to predict whether a travel consumer will churn or not.

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

**Load the data**

In [None]:
# Load dataset
df = pd.read_csv('/content/customer_churn.csv')

In [None]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [None]:
# Examine the data
df.head()

Unnamed: 0,age,frequentflyer,annualincomeclass,servicesopted,accountsyncedtosocialmedia,bookedhotelornot,target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


In [None]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age                         954 non-null    int64 
 1   frequentflyer               954 non-null    object
 2   annualincomeclass           954 non-null    object
 3   servicesopted               954 non-null    int64 
 4   accountsyncedtosocialmedia  954 non-null    object
 5   bookedhotelornot            954 non-null    object
 6   target                      954 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 52.3+ KB


**Examine the data**

If we run the Pandas value_counts( ) function we will see that this is an imbalanced dataset. 

In [None]:
df['target'].value_counts()

0    730
1    224
Name: target, dtype: int64

**Check for missing values**

In [None]:
# Check for missing values
df.isnull().sum()

age                           0
frequentflyer                 0
annualincomeclass             0
servicesopted                 0
accountsyncedtosocialmedia    0
bookedhotelornot              0
target                        0
dtype: int64

We don’t have any missing values. We are good to go.

**Examine categorical data cardinality**

Next we will take a look at the “cardinality” of the categorical variables. Cardinality is just a technical way of saying the number of unique values held within.

In [None]:
df.select_dtypes(include = ['object']).agg(['count', 'nunique']).T

Unnamed: 0,count,nunique
frequentflyer,954,3
annualincomeclass,954,3
accountsyncedtosocialmedia,954,2
bookedhotelornot,954,2


All of these features are quite low in cardinality.

**Split the train and test data**

In [None]:
X = df.drop('target', axis = 1) 

In [None]:
y = df['target']

To divide X and y into the train and test datasets we need to train the model we will use the train_test_split( ) function from scikit-learn. We will assign 30% of the data to the test groups using the argument test_size = 0.3, and we will use the stratify = y option to ensure the target variable is present in the test and train data in equal proportions. The random_state = 42 argument means we get reproducible results each time we run the code, rather than a random mix, which may give us different results.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Isolate X and y variables, and perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

**Create a model pipeline**

Next, we will create a model pipeline. This will handle the encoding of our data using the ColumnTransformer( ) feature. This also scales the data before we pass it to the model.

I have also added some basic feature selection via SelectKBest( ), and have used the SMOTEENN to better handle class imbalance. I fiddled around with the SelectKBest parameters until I found the optimum number of features to leave in. This was around eight features. Doing this greatly improved performance.

In [None]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import OneHotEncoder
from imblearn.combine import SMOTEENN
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import roc_auc_score

In [None]:
def get_pipeline(X, model):

    categorical_columns = list(X.select_dtypes(include = ['object']).columns.values.tolist())
    categorical_transformer = OneHotEncoder(drop = 'if_binary', sparse = False, handle_unknown = 'ignore')
    
    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())
    
    preprocessor = ColumnTransformer(transformers = [('numeric', 'passthrough', numeric_columns), ('categorical', categorical_transformer, categorical_columns)])

    bundled_pipeline = imbpipeline(steps = [('preprocessor', preprocessor),
                                            ('smote', SMOTEENN(random_state = 42)),
                                            ('scaler', RobustScaler()), 
                                            ('feature_selection', SelectKBest(score_func = mutual_info_classif, k = 8)),
                                            ('model', model)])

    return bundled_pipeline

**Select the best model**

Rather than simply selecting a single model, or repeating our code manually on a range of models, we can create another function to automatically test a wide range of possible models to determine the best one for our needs. To do this we first create a dictionary containing some a selection of base classifiers, including XGBoost, Random Forest, Decision Tree, SVC, and a Bernoulli Naive Bayes among others.

We will create a Pandas dataframe into which we will store the data. Then we will loop over each of the models, fit it using the X_train and y_train data, then generate predictions from X_test and calculate the mean ROC/AUC score from 10 rounds of cross-validation. That will give us the ROC/AUC score for the X_test data, plus the average ROC/AUC score for the training data set.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import BernoulliNB

In [None]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'DummyClassifier': DummyClassifier(strategy = 'most_frequent', random_state = 42)})
  classifiers.update({'XGBClassifier': XGBClassifier(random_state = 42)})
  classifiers.update({'XGBRFClassifier': XGBRFClassifier(random_state = 42)})
  classifiers.update({'LogisticRegression': LogisticRegression(random_state = 42)})
  classifiers.update({'LGBMClassifier': LGBMClassifier(random_state = 42)})
  classifiers.update({'RandomForestClassifier': RandomForestClassifier(random_state = 42)})
  classifiers.update({'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42)})
  classifiers.update({'ExtraTreeClassifier': ExtraTreesClassifier(random_state = 42)})
  classifiers.update({'GradientBoostingClassifier': GradientBoostingClassifier(random_state = 42)})    
  classifiers.update({'BaggingClassifier': BaggingClassifier(random_state = 42)})
  classifiers.update({'AdaBoostClassifier': AdaBoostClassifier(random_state = 42)})
  classifiers.update({'HistGradientBoostingClassifier': HistGradientBoostingClassifier(random_state = 42)})
  classifiers.update({'KNeighborsClassifier': KNeighborsClassifier()})
  classifiers.update({'SGDClassifier': SGDClassifier(random_state = 42)})
  classifiers.update({'BaggingClassifier': BaggingClassifier(random_state = 42)})
  classifiers.update({'BernoulliNB': BernoulliNB()})
  classifiers.update({'LinearSVC': LinearSVC(random_state = 42)})
  classifiers.update({'SVC': SVC(random_state = 42)})
  classifiers.update({'CatBoostClassifier': CatBoostClassifier(silent = True, random_state = 42)})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'roc_auc_cv', 'roc_auc'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])
      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      cv = cross_val_score(pipeline, X, y, cv = 10, scoring = 'roc_auc')

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'roc_auc_cv': cv.mean(),
             'roc_auc': roc_auc_score(y_test, y_pred)}

      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'roc_auc_cv', ascending = False)
      
  return df_models

In [None]:
models = select_model(X_train, y_train)

* DummyClassifier
* XGBClassifier
* XGBRFClassifier
* LogisticRegression
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreeClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* HistGradientBoostingClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* LinearSVC
* SVC
* CatBoostClassifier


Examining the output from the model selection step shows that we achieved very good results. The XGBClassifier performed particularly well.

In [None]:
models.head(10)

Unnamed: 0,model,run_time,roc_auc_cv,roc_auc
11,HistGradientBoostingClassifier,0.06,0.924024,0.827001
17,CatBoostClassifier,0.2,0.923317,0.891588
4,LGBMClassifier,0.02,0.921789,0.819539
1,XGBClassifier,0.02,0.920196,0.902951
8,GradientBoostingClassifier,0.03,0.90844,0.890299
5,RandomForestClassifier,0.04,0.901021,0.872117
7,ExtraTreeClassifier,0.04,0.890131,0.854919
10,AdaBoostClassifier,0.03,0.889837,0.839009
12,KNeighborsClassifier,0.01,0.881458,0.845183
9,BaggingClassifier,0.02,0.874142,0.884125


**Assessing performance**

When it comes to assessing models, there’s more to it than simply picking the one with the best score. It’s where the model that goes wrong that often matters. Too many false positives will waste the time of the sales team, while too many false negatives will mean the model isn’t predicting customer churn.



In [None]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [None]:
classifiers = {'XGBClassifier': XGBClassifier(random_state = 42), 
               'CatBoostClassifier': CatBoostClassifier(random_state = 42),
               'GradientBoostingClassifier': GradientBoostingClassifier(random_state = 42)}

df_models = pd.DataFrame(columns = ['model', 'roc_auc', 'precision', 'recall'])

for key in classifiers:

    print('*', key)
      
    pipeline = get_pipeline(X_train, classifiers[key])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    roc_auc = roc_auc_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)

    row = {'model': key,
           'roc_auc': round(roc_auc, 3),
           'precision': round(precision, 3),
           'recall': round(recall, 3)}

    df_models = df_models.append(row, ignore_index = True)

* XGBClassifier
* CatBoostClassifier
Learning rate set to 0.009426
0:	learn: 0.6798751	total: 1.13ms	remaining: 1.13s
1:	learn: 0.6674599	total: 3.71ms	remaining: 1.85s
2:	learn: 0.6548966	total: 5.72ms	remaining: 1.9s
3:	learn: 0.6464276	total: 6.68ms	remaining: 1.66s
4:	learn: 0.6359319	total: 8.02ms	remaining: 1.6s
5:	learn: 0.6249612	total: 9.38ms	remaining: 1.55s
6:	learn: 0.6158073	total: 10.6ms	remaining: 1.51s
7:	learn: 0.6058225	total: 12.1ms	remaining: 1.5s
8:	learn: 0.5957319	total: 13.7ms	remaining: 1.5s
9:	learn: 0.5852409	total: 15.1ms	remaining: 1.5s
10:	learn: 0.5761923	total: 16.4ms	remaining: 1.48s
11:	learn: 0.5674579	total: 17.8ms	remaining: 1.47s
12:	learn: 0.5591005	total: 19.3ms	remaining: 1.46s
13:	learn: 0.5497785	total: 20.6ms	remaining: 1.45s
14:	learn: 0.5421214	total: 21.9ms	remaining: 1.44s
15:	learn: 0.5346505	total: 23.2ms	remaining: 1.43s
16:	learn: 0.5252514	total: 24.7ms	remaining: 1.43s
17:	learn: 0.5177529	total: 25.9ms	remaining: 1.41s
18:	learn: 0

In [None]:
df_models.sort_values(by = 'roc_auc', ascending = False).head()

Unnamed: 0,model,roc_auc,precision,recall
0,XGBClassifier,0.903,0.726,0.91
1,CatBoostClassifier,0.892,0.685,0.91
2,GradientBoostingClassifier,0.89,0.728,0.881


A for-profit organization is likely to be forced to use its marketing budget as judiciously as possible: Profitability is key. As a result, precision and the ROC-AUC score are most important.  As a result, XGBClassifier therefore is probably a better choice as it provides a better ROC-AUC score than GradientBoostingClassifier. While GradientBoostingClassifier has a better precision score than that of XGBClassifier, the difference is minimal.



**Examine the performance of the best model**

Finally, we can take our best model - the XGBClassifier - and fit the data on this. To do this step, we will first define our selected model, then we will pass its configuration to get_pipeline( ) with our training data. Then, we will fit( ) the training data and use predict( ) to return our predictions from the newly trained model.

In [None]:
selected_model = XGBClassifier(random_state = 42)
bundled_pipeline = get_pipeline(X_train, selected_model)
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)