# **Customer Churn Prediction**
by Okeke

Customer churn involves the discontinuation of service by subscribers or customers of a particular service provider. The availability of different service providers in recent times give customers the opportunity to switch service providers. Customer retention requires strateties due highly competitive market and accurate forecast could help organizations to know customers that have the potential to leave their services. And will help the organizations to focus their customer retention policies on the "high risk" clients. In this project, I will build a model to predict the probability of customers leaving or abandoning a telecommunication provider using a publicly available telco dataset.

In [1]:
# Loading relevant liberaries
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
from sklearn.ensemble import RandomForestClassifier

In [2]:
#Load data
Teleco_df = pd.read_csv('Telco_churn.csv')
Teleco_df.head(10)

Unnamed: 0.1,Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,0,29.85,29.85,0,1,0,0,1,1,...,0,0,1,0,1,0,0,0,0,0
1,1,0,56.95,1889.5,0,0,1,1,0,1,...,0,0,0,1,0,0,1,0,0,0
2,2,0,53.85,108.15,1,0,1,1,0,1,...,0,0,0,1,1,0,0,0,0,0
3,3,0,42.3,1840.75,0,0,1,1,0,1,...,1,0,0,0,0,0,0,1,0,0
4,4,0,70.7,151.65,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,0
5,5,0,99.65,820.5,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,0
6,6,0,89.1,1949.4,0,0,1,1,0,0,...,0,1,0,0,0,1,0,0,0,0
7,7,0,29.75,301.9,0,1,0,1,0,1,...,0,0,0,1,1,0,0,0,0,0
8,8,0,104.8,3046.05,1,1,0,0,1,1,...,0,0,1,0,0,0,1,0,0,0
9,9,0,56.15,3487.95,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,1


In [3]:
#Checking data statistics
Teleco_df.describe()

Unnamed: 0.1,Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,...,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,3521.562144,0.1624,64.798208,2283.300441,0.265785,0.495307,0.504693,0.517491,0.482509,0.701507,...,0.219283,0.216297,0.33632,0.2281,0.3093,0.14562,0.118316,0.108362,0.118316,0.200085
std,2032.832448,0.368844,30.085974,2266.771362,0.441782,0.500014,0.500014,0.499729,0.499729,0.457629,...,0.41379,0.411748,0.472483,0.419637,0.462238,0.35275,0.323005,0.310859,0.323005,0.400092
min,0.0,0.0,18.25,18.8,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1762.75,0.0,35.5875,401.45,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3521.5,0.0,70.35,1397.475,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5282.25,0.0,89.8625,3794.7375,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,7042.0,1.0,118.75,8684.8,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [4]:
#Checking data built structure
Teleco_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 52 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Unnamed: 0                               7032 non-null   int64  
 1   SeniorCitizen                            7032 non-null   int64  
 2   MonthlyCharges                           7032 non-null   float64
 3   TotalCharges                             7032 non-null   float64
 4   Churn                                    7032 non-null   int64  
 5   gender_Female                            7032 non-null   int64  
 6   gender_Male                              7032 non-null   int64  
 7   Partner_No                               7032 non-null   int64  
 8   Partner_Yes                              7032 non-null   int64  
 9   Dependents_No                            7032 non-null   int64  
 10  Dependents_Yes                           7032 no

In [5]:
#Checking the data shape
Teleco_df.shape

(7032, 52)

In [6]:
#Checking duplicates
Teleco_df.duplicated().sum()

0

In [7]:
#Droping the Unnamed row
Teleco_df=Teleco_df.drop('Unnamed: 0',axis=1)

In [8]:
##Droping the Churn row as the it will be used later as the label or target
Teleco_data = Teleco_df.drop('Churn',axis=1)
Teleco_data.head()

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,29.85,29.85,1,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,0
1,0,56.95,1889.5,0,1,1,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,53.85,108.15,0,1,1,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0,42.3,1840.75,0,1,1,0,1,0,1,...,1,0,0,0,0,0,0,1,0,0
4,0,70.7,151.65,1,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0


In [9]:
#Setting the churn row as label or target
Target = Teleco_df['Churn']
Target.head(5)

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64

In [10]:
#Splitting the data into train and test samples
x_train,x_test,y_train,y_test=train_test_split(Teleco_data, Target, test_size=0.2)


# Training Model with Decision Tree Classifier without Data Resampling

In [11]:
#Let's use the decision tree algorithm to train the classifier
model_telco = DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)
model_telco.fit(x_train,y_train)

DecisionTreeClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

In [12]:
#Let's view the trained model performance score using the reserved test samples
model_telco.score(x_test,y_test)

0.7803837953091685

In [13]:
#Let's make a prediction using the reserved test samples
telco_pred = model_telco.predict(x_test)
telco_pred 

array([0, 0, 0, ..., 1, 0, 0])

In [14]:
#Let's view the trained model's confussion matrix
print(classification_report(y_test, telco_pred, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      1033
           1       0.60      0.50      0.55       374

    accuracy                           0.78      1407
   macro avg       0.72      0.69      0.70      1407
weighted avg       0.77      0.78      0.77      1407



# Training Model with Decision Tree Classifier and Data Resampling
As can be observed from the various results obtained above (both accuracy and matrix scores), the performance accuracy of the trained model is low for production due to imbalanced dataset problem in the used data. Thus, I will use the SMOTEENN (UpSampling + ENN) technique to boost manage the imbalance data problem and boost the performance of the model.


In [15]:
#Using SMOTEENN resample the data
from imblearn.combine import SMOTEENN
telco_sm = SMOTEENN(random_state=42)
X_resampled, y_resampled = telco_sm.fit_resample(Teleco_data,Target)

In [16]:
#Splitting the resampled data into train and test samples
xxrr_train, xxrr_test, yyrr_train, yyrr_test = train_test_split(X_resampled, y_resampled, test_size=0.2)

In [17]:
#Training the model with the resampled data 
model_telco_smote = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=6, min_samples_leaf=8)
model_telco_smote.fit(xxrr_train, yyrr_train)
yres_predict = model_telco_smote.predict(xxrr_test)
model_score_res = model_telco_smote.score(xxrr_test, yyrr_test)
print(model_score_res)
print(metrics.classification_report(yyrr_test, yres_predict))

0.9415254237288135
              precision    recall  f1-score   support

           0       0.93      0.94      0.94       531
           1       0.95      0.94      0.95       649

    accuracy                           0.94      1180
   macro avg       0.94      0.94      0.94      1180
weighted avg       0.94      0.94      0.94      1180



In [18]:
print(metrics.confusion_matrix(yyrr_test, yres_predict))

[[498  33]
 [ 36 613]]


As can be observed from the results obtained above, the acurracy, the precision,    recall and  f1   scores greatly improved compared to when the unresamppled data was used

# Training the Model with Random Forest Classifier without Data Sampling

In [19]:
#Training rF without data sampling
telco_model_rf = RandomForestClassifier(n_estimators=100, criterion='gini', random_state = 100,max_depth=6, min_samples_leaf=8)
telco_model_rf.fit(x_train,y_train)

RandomForestClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

In [21]:
#prind out the accuracy
telco_model_rf.score(x_test,y_test)

0.7924662402274343

In [20]:
#Making predictions with the rF model 
y_telco_pred = telco_model_rf.predict(x_test)
telco_model_rf.score(x_test,y_test)
print(classification_report(y_test, y_telco_pred, labels=[0,1]))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87      1033
           1       0.67      0.44      0.53       374

    accuracy                           0.79      1407
   macro avg       0.74      0.68      0.70      1407
weighted avg       0.78      0.79      0.78      1407



# Training Model with Random Forest Classifier and Data Sampling

In [23]:
#Training the Random Forest Classifier with the resampled data 
model_telco_smote_rf = DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=6, min_samples_leaf=8)
model_telco_smote_rf.fit(xxrr_train, yyrr_train)
yrf_predict = model_telco_smote_rf.predict(xxrr_test)
model_score_rf = model_telco_smote_rf.score(xxrr_test, yyrr_test)
print(model_score_rf)
print(metrics.classification_report(yyrr_test, yrf_predict))

0.9415254237288135
              precision    recall  f1-score   support

           0       0.93      0.94      0.94       531
           1       0.95      0.94      0.95       649

    accuracy                           0.94      1180
   macro avg       0.94      0.94      0.94      1180
weighted avg       0.94      0.94      0.94      1180



In [24]:
print(metrics.confusion_matrix(yyrr_test, yrf_predict))

[[498  33]
 [ 36 613]]


With the random Forest Classifier and Data Sampling strategy, we obtained a more better result.

In [25]:
#Converting the trained model to a pickle file
import pickle
filename = 'telco_rf_model.sav'
pickle.dump(telco_model_rf, open(filename, 'wb'))

In [26]:
#Load the pickle file and use it to make a prediction
telco_load_model = pickle.load(open(filename, 'rb'))
telco_model_score_rf = telco_load_model.score(xxrr_test, yyrr_test)
telco_model_score_rf

0.7576271186440678

In this project, I used two machine learning models to predict whether customers will churn a telco service provider or not. The models yielded good performance and could be improved further using techniques like PCA and others