<a href="https://colab.research.google.com/github/KelvinLam05/telecom_churn_prediction/blob/main/telecom_churn_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal of the project**


The ability to predict that a particular customer is at a high risk of churning, while there is still time to do something about it, represents a huge additional potential revenue source for every online business. Besides the direct loss of revenue that results from a customer abandoning the business, the costs of initially acquiring that customer may not have already been covered by the customer’s spending to date. (In other words, acquiring that customer may have actually been a losing investment.) Furthermore, it is always more difficult and expensive to acquire a new customer than it is to retain a current paying customer.

In this project, we’ll build a contractual churn model for contractual settings.

**Load the package**

In [82]:
# Importing library
import pandas as pd

**Load the data**

For this project I’ve used the [Iranian Churn](https://www.kaggle.com/datasets/royjafari/customer-churn) dataset from Kaggle. This is data from a telecoms provider, so we’ll be using it here to create a contractual churn model.

In [83]:
# Load dataset
df = pd.read_csv("/content/telecom's_churn_dataset.csv")

In [84]:
# Rename Pandas columns to lower case
df.columns = df.columns.str.lower()

In [85]:
# Examine the data
df.head()

Unnamed: 0,call failure,complains,subscription length,charge amount,seconds of use,frequency of use,frequency of sms,distinct called numbers,age group,tariff plan,status,age,customer value,fn,fp,churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64,177.876,69.764,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,41.4315,60.0,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52,1382.868,203.652,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02,216.018,74.002,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,131.2245,64.5805,0


In [86]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   call  failure            3150 non-null   int64  
 1   complains                3150 non-null   int64  
 2   subscription  length     3150 non-null   int64  
 3   charge  amount           3150 non-null   int64  
 4   seconds of use           3150 non-null   int64  
 5   frequency of use         3150 non-null   int64  
 6   frequency of sms         3150 non-null   int64  
 7   distinct called numbers  3150 non-null   int64  
 8   age group                3150 non-null   int64  
 9   tariff plan              3150 non-null   int64  
 10  status                   3150 non-null   int64  
 11  age                      3150 non-null   int64  
 12  customer value           3150 non-null   float64
 13  fn                       3150 non-null   float64
 14  fp                      

**Define the target variable**


If we run the Pandas value_counts( ) function we will see that this is an imbalanced dataset. 

In [87]:
df['churn'].value_counts()

0    2655
1     495
Name: churn, dtype: int64

**Check for missing values**

Before moving on, we’ll check to see if there are any null values to impute. However, the data were all fine, so there was nothing to do.

In [88]:
# Check for missing values
df.isnull().sum()

call  failure              0
complains                  0
subscription  length       0
charge  amount             0
seconds of use             0
frequency of use           0
frequency of sms           0
distinct called numbers    0
age group                  0
tariff plan                0
status                     0
age                        0
customer value             0
fn                         0
fp                         0
churn                      0
dtype: int64

**Split the train and test data**

In [89]:
X = df.drop('churn', axis = 1) 

In [90]:
y = df['churn']

To divide X and y into the train and test datasets we need to train the model we will use the train_test_split( ) function from scikit-learn. We will assign 30% of the data to the test groups using the argument test_size = 0.3, and we will use the stratify = y option to ensure the target variable is present in the test and train data in equal proportions. The random_state = 42 argument means we get reproducible results each time we run the code, rather than a random mix, which may give us different results.

In [91]:
from sklearn.model_selection import train_test_split

In [92]:
# Isolate X and y variables, and perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, stratify = y)

**Create a model pipeline**

Next, we will create a model pipeline. This will handle the encoding of our data using the ColumnTransformer( ) feature. This also scales the data before we pass it to the model.

I have used the SMOTEENN to better handle class imbalance.

In [93]:
import time
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.combine import SMOTEENN
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

In [94]:
def get_pipeline(X, model):

    numeric_columns = list(X.select_dtypes(exclude = ['object']).columns.values.tolist())    
    
    preprocessor = ColumnTransformer(transformers = [('numeric', 'passthrough', numeric_columns)])

    bundled_pipeline = imbpipeline(steps = [('preprocessor', preprocessor),
                                            ('smote', SMOTEENN(random_state = 42)),
                                            ('scaler', RobustScaler()),
                                            ('model', model)])
    return bundled_pipeline

**Select the best model**

Rather than simply selecting a single model, or repeating our code manually on a range of models, we can create another function to automatically test a wide range of possible models to determine the best one for our needs. To do this we first create a dictionary containing some a selection of base classifiers, including XGBoost, Random Forest, Decision Tree, SVC, and a Bernoulli Naive Bayes among others.

We will create a Pandas dataframe into which we will store the data. Then we will loop over each of the models, fit it using the X_train and y_train data, then generate predictions from X_test and calculate the mean ROC/AUC score from 5 rounds of cross-validation. That will give us the ROC/AUC score for the X_test data, plus the average ROC/AUC score for the training data set.

In [95]:
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from catboost import CatBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB

In [96]:
def select_model(X, y, pipeline = None):

  classifiers = {}
  classifiers.update({'DummyClassifier': DummyClassifier(strategy = 'most_frequent', random_state = 42)})
  classifiers.update({'XGBClassifier': XGBClassifier(random_state = 42)})
  classifiers.update({'XGBRFClassifier': XGBRFClassifier(random_state = 42)})
  classifiers.update({'LGBMClassifier': LGBMClassifier(random_state = 42)})
  classifiers.update({'RandomForestClassifier': RandomForestClassifier(random_state = 42)})
  classifiers.update({'DecisionTreeClassifier': DecisionTreeClassifier(random_state = 42)})
  classifiers.update({'ExtraTreesClassifier': ExtraTreesClassifier(random_state = 42)})
  classifiers.update({'GradientBoostingClassifier': GradientBoostingClassifier(random_state = 42)})    
  classifiers.update({'BaggingClassifier': BaggingClassifier(random_state = 42)})
  classifiers.update({'AdaBoostClassifier': AdaBoostClassifier(random_state = 42)})
  classifiers.update({'HistGradientBoostingClassifier': HistGradientBoostingClassifier(random_state = 42)})
  classifiers.update({'KNeighborsClassifier': KNeighborsClassifier()})
  classifiers.update({'SGDClassifier': SGDClassifier(random_state = 42)})
  classifiers.update({'BernoulliNB': BernoulliNB()})
  classifiers.update({'SVC': SVC(random_state = 42)})
  classifiers.update({'CatBoostClassifier': CatBoostClassifier(silent = True, random_state = 42)})

  df_models = pd.DataFrame(columns = ['model', 'run_time', 'roc_auc_cv', 'roc_auc'])

  for key in classifiers:

      print('*', key)

      start_time = time.time()
      
      pipeline = get_pipeline(X_train, classifiers[key])

      cv = cross_val_score(pipeline, X, y, cv = 5, scoring = 'roc_auc', n_jobs = -1)

      pipeline.fit(X_train, y_train)
      y_pred = pipeline.predict(X_test)

      row = {'model': key,
             'run_time': format(round((time.time() - start_time) / 60, 2)),
             'roc_auc_cv': cv.mean(),
             'roc_auc': roc_auc_score(y_test, y_pred)}

      df_models = df_models.append(row, ignore_index = True)

  df_models = df_models.sort_values(by = 'roc_auc', ascending = False)
      
  return df_models

In [97]:
models = select_model(X_train, y_train)

* DummyClassifier
* XGBClassifier
* XGBRFClassifier
* LGBMClassifier
* RandomForestClassifier
* DecisionTreeClassifier
* ExtraTreesClassifier
* GradientBoostingClassifier
* BaggingClassifier
* AdaBoostClassifier
* HistGradientBoostingClassifier
* KNeighborsClassifier
* SGDClassifier
* BernoulliNB
* SVC
* CatBoostClassifier


In [98]:
models.head(10)

Unnamed: 0,model,run_time,roc_auc_cv,roc_auc
3,LGBMClassifier,0.02,0.965212,0.942233
15,CatBoostClassifier,0.55,0.967242,0.938227
10,HistGradientBoostingClassifier,0.04,0.964661,0.934221
8,BaggingClassifier,0.01,0.950193,0.933835
6,ExtraTreesClassifier,0.02,0.966524,0.930699
4,RandomForestClassifier,0.03,0.958578,0.923556
11,KNeighborsClassifier,0.01,0.940544,0.918635
1,XGBClassifier,0.02,0.957881,0.917622
7,GradientBoostingClassifier,0.06,0.954917,0.91376
5,DecisionTreeClassifier,0.0,0.89511,0.906279


**Fit the best model**

Finally, we can take our best model - the LGBMClassifier - and fit the data on this. To do this step, we will first define our selected model, then we will pass its configuration to get_pipeline( ) with our training data. Then, we will fit( ) the training data and use predict( ) to return our predictions from the newly trained model.

In [99]:
selected_model = LGBMClassifier(random_state = 42)
bundled_pipeline = get_pipeline(X_train, selected_model)
bundled_pipeline.fit(X_train, y_train)
y_pred = bundled_pipeline.predict(X_test)

**Examine the predictions**

To examine how well the model performed in a little more detail we can make use of the classification_report( ). The classification report shows us the precision, recall, and F1 score for our predictions.




In [100]:
from sklearn.metrics import classification_report

In [101]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.94      0.96       797
           1       0.74      0.95      0.83       148

    accuracy                           0.94       945
   macro avg       0.87      0.94      0.90       945
weighted avg       0.95      0.94      0.94       945



Let’s unpack those results a little bit…

**Recall**


A churn class recall of 0.95 means that the model was able to catch 95% of the actual churn cases. This is the measure we really care about, because we want to miss as few of the true churn cases as possible.

**Precision**

Precision of the churn class measures how often the model catches an actual churn case, while also factoring in how often it misclassifies a non-churn case as a churn case. In this case, a churn precision of 0.74 is not a problem because there are no significant consequences of identifying a customer as a churn risk when she isn’t.

**F1 score**

The F1 score is the harmonic mean of precision and recall. It helps give us a balanced idea of how the model is performing on the churn class. In this case a churn class F1 score of 0.83 is pretty good. 