# Telecom Churn Prediction with Boosting

Customer churn, also known as customer attrition, customer turnover, or customer defection, is the loss of clients or customers.Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one.

Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.

For this project we will be exploring the dataset of a telecom company and try to predict the customer churn

## Problem Statement

Using the method of Boosting, classify whether or not the customer will churn

## About the Dataset

The snapshot of the dataset you will be working on

![](https://github.com/commit-live-students/Data_Science_Masters_Program_2021/blob/main/16-ensemble_methods/gbm/images/dataset.png?raw=1)


## Why solve this project ?

After completing this project, you will have a better understanding of how to build a boosting model. In this project, you will apply the following concepts.


- Handling missing values in data
- Applying AdaBoost
- Applying XGBoost
- Interpreting evaluation metrics


## Load data

The first step - you know the drill by now - load the dataset and see how it looks like. Additionally, split it into train and test set.


## Instructions:

- Load the dataset from path using the `"read_csv()"` method from pandas and store it in a variable called `'df'`


- Store all the features(All columns except `'customerID'`,`'churn'`) of `'df'` in  a variable called `X`


- Store the target variable (`Churn`) of `'df'` in a variable called `y`


- Split `'X'` and `'y'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.3` and `random_state = 0`


## Hints:

- Use `X_train,X_test,y_train,y_test=train_test_split(X,y ,test_size=0.3,random_state=0)` to split the dataset.


## Test cases:

#df
Variable declaration
df.shape==(7043,21)
df.iloc[10,10]=='No'

#X_train
Variable declaration
X_train.iloc[10,10]=='Yes'

#X_test
Variable declaration
X_test.iloc[10,10]=='No'

#y_train
Variable declaration
y_train[10]=='No'

#y_test
Variable declaration
y_test[1]=='No'

In [None]:
#Pre Code
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
path='https://raw.githubusercontent.com/commit-live-students/Data_Science_Masters_Program_2021/main/16-ensemble_methods/gbm/data/telecom_churn.csv'

# Code starts here
#Reading of file
df = pd.read_csv(path)
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [None]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [None]:
#Extracting features
X=df.drop(['customerID','Churn'],1)

#Extracting target class
y=df['Churn']

#Splitting data into train and test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size = 0.3,random_state=0)

In [None]:
X_train.shape

(4930, 19)

In [None]:
X_test.shape

(2113, 19)

In [None]:
y_train

3580    Yes
2364     No
6813    Yes
789      No
561      No
       ... 
4931     No
3264     No
1653     No
2607    Yes
2732     No
Name: Churn, Length: 4930, dtype: object

## Success Message

Congrats! You have successfully loaded the dataset and split it into `train` and `test` set.

# Clean data

In this task, we will try to replace the missing values and modify some column values.

## Instructions

- Replace the spaces(' ') in `TotalCharges` column of `'X_train'` with `np.NaN` and assign it back to the same column. Do the same for `'X_test'`

- Change the  `TotalCharges` column of `'X_train'`to `float` data type. Do the same for `'X_test'`

- Fill the missing values(NaN) of `TotalCharges` column of `'X_train'` with its mean using `"fillna()"` assign it back to the same column. Do the same for `'X_test'`

- Check whether any other NaN value exists in train data using `"isnull().sum()"`

- Label encode all categorical columns of `'X_train'` using `"LabelEncoder()"`. Do the same for `'X_test'`

- Using `"replace()"` function, replace the values of `'y_train'` in the following way :`{'No':0, 'Yes':1}`(i.e. Replace No with 0 and Yes with 1). Do the same with `y_test`.

## Hints

- Use `"X_train.isnull().sum()"` to find how many null values exists in X_train

- You can label encode all categorical columns of `X_train` by writing code similar to

```python
cat_cols = X_train.select_dtypes(include='O').columns.tolist()

#Label encoding train data
for x in cat_cols:
    le = LabelEncoder()
    X_train[x] = le.fit_transform(X_train[x])


```

## Test cases

np.round(X_train.loc[28,'TotalCharges'],2)==np.round(6369.45,2)

np.round(X_test.loc[1,'TotalCharges'],2)==np.round(1889.5,2)

np.all(X_train.isnull().sum()[:]==0)

np.all(X_test.isnull().sum()[:]==0)


X_train.iloc[10,10]==2


X_test.iloc[10,10]==0


y_train[10]==0

y_test[1]==0

In [None]:
df['TotalCharges']

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: TotalCharges, Length: 7043, dtype: object

In [None]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Code starts here
#Replacing spaces with 'NaN' in train dataset
X_train['TotalCharges'].replace(' ',np.NaN,inplace=True)

#Replacing spaces with 'NaN' in test dataset
X_test['TotalCharges'].replace(' ',np.NaN,inplace=True)

#Converting the type of column from X_train to float
X_train['TotalCharges']=X_train['TotalCharges'].astype(float)

#Converting the type of column from X_test to float
X_test['TotalCharges']=X_test['TotalCharges'].astype(float)

#Filling missing values
X_train['TotalCharges'].fillna(X_train['TotalCharges'].mean(),inplace=True)
X_test['TotalCharges'].fillna(X_test['TotalCharges'].mean(),inplace=True)


In [None]:
X_train['TotalCharges'].isna().sum()

0

In [None]:
X_test['TotalCharges'].isna().sum()

0

In [None]:
#Check value counts
cat_cols = X_train.select_dtypes(include='O').columns.tolist()
cat_cols

['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod']

In [None]:
y_train

3580    Yes
2364     No
6813    Yes
789      No
561      No
       ... 
4931     No
3264     No
1653     No
2607    Yes
2732     No
Name: Churn, Length: 4930, dtype: object

In [None]:
# Label encoding train data
for x in cat_cols:
    le = LabelEncoder()
    X_train[x] = le.fit_transform(X_train[x])

#Label encoding test data
for x in cat_cols:
    le = LabelEncoder()
    X_test[x] = le.fit_transform(X_test[x])

#Encoding train data target
y_train=y_train.replace({'No':0,'Yes':1})

#Encoding test data target
y_test=y_test.replace({'No':0,'Yes':1})

In [None]:
y_test

2200    0
4627    0
3225    0
2828    0
3768    0
       ..
4448    1
1231    0
3304    0
4805    0
5843    0
Name: Churn, Length: 2113, dtype: int64

## Success Message

Congrats !You have successfully filled missing values and cleaned the data.

# AdaBoost Implementation

X_train, X_test have been label encoded.

y_train and y_test have been encoded with the following dictionary {'No':0, 'Yes':1}

In this task, we will try to predict the churning of customers using AdaBoost

## Instructions
- Print values of X_train, X_test, y_value, y_test to take a look at their transformed versions


- Initialise a AdaBoost model with `AdaBoostClassifier()` having `random_state=0` and save it to a variable called `'ada_model'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Store the prediction of `'X_test'` by `'ada_model'` in a variable called `'y_pred'`

- Find out the accuracy score between `y_test` and `'y_test'` using `"accuracy_score()"` and save it in a variable called `'ada_score'`

- Since it's a slightly imbalanced dataset, find out the `confusion matrix` between `'y_test'` and `'y_pred'` using `"confusion_matrix()"` and save it in a variable called `'ada_cm'`


- Also find out the `classification_report` between `'y_test'` and `'y_pred'` using `"classification_report()"` and save it in a variable called `'ada_cr'`




## Hints

* Use `accuracy_score(y_test,y_pred)` to check the prediction accuracy of the model.

* Use `confusion_matrix(y_test,y_pred)` to check the confusion matrix of the model.

* Use `classification_report(y_test,y_pred)` to check the classification report of the model.

## Test cases

#ada_score
Variable declaration
np.round(ada_score,2) == np.round(0.795551348793185,2)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

# Code starts here
# Initialising AdaBoostClassifier model
ada_model=AdaBoostClassifier(random_state=0)

#Fitting the model on train data
ada_model.fit(X_train,y_train)

#Making prediction on test data
y_pred=ada_model.predict(X_test)

#Finding the accuracy score
ada_score=accuracy_score(y_test,y_pred)
print("ada_score:-",ada_score)

#Finding the confusion matrix
ada_cm=confusion_matrix(y_test,y_pred)
print("confusion_matrix",ada_cm)

#Finding the classification report
ada_cr=classification_report(y_test,y_pred)
print("classification_report",ada_cr)

ada_score:- 0.795551348793185
confusion_matrix [[1371  189]
 [ 243  310]]
classification_report               precision    recall  f1-score   support

           0       0.85      0.88      0.86      1560
           1       0.62      0.56      0.59       553

    accuracy                           0.80      2113
   macro avg       0.74      0.72      0.73      2113
weighted avg       0.79      0.80      0.79      2113



(2113,)

## Success Message

Congrats! You have successfully applied AdaBoost model to make Churn predictions

# XGBoost Implementation

Let's also try and implement XGBoost for our customer churn problem and see how it performs in comparision to AdaBoost

## Instructions

- Initialise a XGBoost Classifier model with `XGBClassifier()` having `random_state=0` and save it to a variable called `'xgb_model'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Store the prediction of `'X_test'` by `'xgb_model'` in a variable called `'y_pred'`

- Find out the accuracy score between `y_test` and `'y_test'` using `"accuracy_score()"` and save it in a variable called `'xgb_score'`

- Since it's a slightly imbalanced dataset, find out the `confusion matrix` between `'y_test'` and `'y_pred'` using `"confusion_matrix()"` and save it in a variable called `'xgb_cm'`


- Also find out the `classification_report` between `'y_test'` and `'y_pred'` using `"classification_report()"` and save it in a variable called `'xgb_cr'`

# Observation

The score of XGBoost is just 0.001 better than the AdaBoost score.

Let's try to see if we can make it perform better

## Instructions

- Initialise a grid search object with `GridSearch()` having `estimator=xgb_clf` & `param_grid=parameter` and save it to a variable called `'clf_model'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Store the prediction of `'X_test'` by `'clf_model'` in a variable called `'y_pred'`

- Find out the accuracy score between `y_test` and `'y_test'` using `"accuracy_score()"` and save it in a variable called `'clf_score'`

- Find out the `confusion matrix` between `'y_test'` and `'y_pred'` using `"confusion_matrix()"` and save it in a variable called `'clf_cm'`.


- Also find out the `classification_report` between `'y_test'` and `'y_pred'` using `"classification_report()"` and save it in a variable called `'clf_cr'`.

- Print and compare the accuracy score, confusion matrix, classification report between `'xgb_model'` and `'clf_model'`.


## Hints

- Use `accuracy_score(y_test,y_pred)` to check the prediction accuracy of the model.

- Use `confusion_matrix(y_test,y_pred)` to check the confusion matrix of the model.

- Use `classification_report(y_test,y_pred)` to check the classification report of the model.



## Test Case

#xgb_score
Variable declaration
np.round(xgb_score,2) == np.round(0.79649787032655,2)

#clf_score
Variable declaration
np.round(clf_score,2)== np.round(0.8017037387600567,2)


In [None]:
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

#Parameter list
parameters={'learning_rate':[0.1,0.15,0.2,0.25,0.3],
            'max_depth':range(1,3)}

# Code starts here
#Initializing the model
xgb_model=XGBClassifier(random_state=0)

#Fitting the model on train data
xgb_model.fit(X_train,y_train)

#Making prediction on test data
y_pred=xgb_model.predict(X_test)

#Finding the accuracy score
xgb_score=accuracy_score(y_test,y_pred)
print("accuracy_score",xgb_score)

#Finding the confusion matrix
xgb_cm=confusion_matrix(y_test,y_pred)
print("confusion_matrix",xgb_cm)

#Finding the classification report
xgb_cr=classification_report(y_test,y_pred)
print("classification_report",xgb_cr)

### GridSearch CV

#Initialsing Grid Search
clf=GridSearchCV(xgb_model,parameters)

#Fitting the model on train data
clf.fit(X_train,y_train)

#Making prediction on test data
y_pred=clf.predict(X_test)

#Finding the accuracy score
clf_score=accuracy_score(y_test,y_pred)
print("accuracy_score_with_GS",clf_score)

#Finding the confusion matrix
clf_cm=confusion_matrix(y_test,y_pred)
print("confusion_matrix_with_GS",clf_cm)

#Finding the classification report
clf_cr=classification_report(y_test,y_pred)
print("classification_report_with_GS",clf_cr)

#Code ends here

accuracy_score 0.7908187411263606
confusion_matrix [[1376  184]
 [ 258  295]]
classification_report               precision    recall  f1-score   support

           0       0.84      0.88      0.86      1560
           1       0.62      0.53      0.57       553

    accuracy                           0.79      2113
   macro avg       0.73      0.71      0.72      2113
weighted avg       0.78      0.79      0.79      2113

accuracy_score_with_GS 0.8021769995267393
confusion_matrix_with_GS [[1398  162]
 [ 256  297]]
classification_report_with_GS               precision    recall  f1-score   support

           0       0.85      0.90      0.87      1560
           1       0.65      0.54      0.59       553

    accuracy                           0.80      2113
   macro avg       0.75      0.72      0.73      2113
weighted avg       0.79      0.80      0.80      2113



## Success Message

Congrats! You have successfully applied XGBoost and predicted Churn.