<a href="https://colab.research.google.com/github/MWFK/Machine-Learning-From-Zero-to-Hero/blob/main/Binary_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Telco Customer Churn

"Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." 

[IBM Sample Data Sets]


https://www.kaggle.com/blastchar/telco-customer-churn

## Import Data

Upload file from Hard Drive


In [4]:
import pandas as pd

from google.colab import files
# It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded.
uploaded = files.upload()

# To store dataset in a Pandas Dataframe
import io
# put your file's name
df2 = pd.read_csv(io.BytesIO(uploaded['Telco-Customer-Churn.xls']))
#drive.mount('/content/drive')

Saving Telco-Customer-Churn.xls to Telco-Customer-Churn (2).xls


Get dataset from Github

In [15]:
df = pd.read_csv('https://raw.githubusercontent.com/MWFK/Machine-Learning-From-Zero-to-Hero/main/Telco-Customer-Churn.xls')
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


# KNN

### Preprocessing

In [17]:
# converting TotalCharges column from string to float
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')

# dropping null entries and the customerID column
df = df.dropna()
df = df.drop(columns='customerID')

In [18]:
# creating dummy variables to resolve categorical features
categ_feats = list(df.columns.values)
list_to_remove = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']
categ_feats = list(set(categ_feats).difference(set(list_to_remove)))
df = pd.get_dummies(df, columns = categ_feats) 

In [19]:
# Scale The numeric features
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

# pulling out numeric features for normalizing
numeric_feats = ['tenure', 'MonthlyCharges', 'TotalCharges']
df_numeric_feats = pd.DataFrame(df, columns = numeric_feats)
df_categ_feats = df.drop(columns = numeric_feats)

# normalizing numeric features and converting back to dataframe
min_max_scaler = preprocessing.MinMaxScaler()
normalized_numeric_feats = min_max_scaler.fit_transform(df_numeric_feats)
normalized_numeric_feats = pd.DataFrame(normalized_numeric_feats, columns = numeric_feats, index=df_categ_feats.index)

# creating new dataframe with categorical features and the normalized numeric features
df_numeric_norm = pd.concat([df_categ_feats, normalized_numeric_feats], axis=1)

In [20]:
# splitting normalized X data into train and test sets
X_normalized = df_numeric_norm.drop('Churn', axis=1).values
y = df_numeric_norm['Churn'].values
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=21)

### Fine Tuning

In [26]:
from sklearn.model_selection import GridSearchCV

# Normalized KNN classifier: identifying best k value with GridSearchCV
param_grid = {'n_neighbors': np.arange(1, 30)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=5)
knn_cv.fit(X_train, y_train)
print('k-NN best n_neighbors:', knn_cv.best_params_, '\n')

k-NN best n_neighbors: {'n_neighbors': 27} 



### Training

In [29]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=27)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=27, p=2,
                     weights='uniform')

### Prediction

In [30]:
# predicted values
y_pred_knn_test = knn.predict(X_test)
print(y_pred_knn_test)

['Yes' 'No' 'No' ... 'Yes' 'No' 'No']


### Measure model performance

In [31]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print('k-NN test set confusion matrix:')
print(pd.DataFrame(confusion_matrix(y_test, y_pred_knn_test), index=['actual: no churn', 'actual: churn'], columns=['pred: no churn', 'pred: churn']), '\n')

print('k-NN test set classification report:')
print(classification_report(y_test, y_pred_knn_test))

k-NN test set confusion matrix:
                  pred: no churn  pred: churn
actual: no churn             883          124
actual: churn                187          213 

k-NN test set classification report:
              precision    recall  f1-score   support

          No       0.83      0.88      0.85      1007
         Yes       0.63      0.53      0.58       400

    accuracy                           0.78      1407
   macro avg       0.73      0.70      0.71      1407
weighted avg       0.77      0.78      0.77      1407



# Logistic Regression

In [32]:
# Logistic Regression classifier (L1 regularization)

from sklearn.linear_model import LogisticRegression

param_grid_L1 = {'C': np.arange(.5, 5, .5)}

logreg_L1 = LogisticRegression(penalty='l1', solver='liblinear')   
logreg_L1_cv = GridSearchCV(logreg_L1, param_grid_L1, cv=5)
logreg_L1_cv.fit(X_train, y_train)

print('Lasso Reg best C value', logreg_L1_cv.best_params_, '\n')

y_pred_L1_test = logreg_L1_cv.predict(X_test)

print('Lasso Reg test set confusion matrix:')
print(pd.DataFrame(confusion_matrix(y_test, y_pred_L1_test), index=['actual: no churn', 'actual: churn'], columns=['pred: no churn', 'pred: churn']), '\n')
print('Lasso Reg test set classification report:')
print(classification_report(y_test, y_pred_L1_test))

Lasso Reg best C value {'C': 1.0} 

Lasso Reg test set confusion matrix:
                  pred: no churn  pred: churn
actual: no churn             909           98
actual: churn                197          203 

Lasso Reg test set classification report:
              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1007
         Yes       0.67      0.51      0.58       400

    accuracy                           0.79      1407
   macro avg       0.75      0.71      0.72      1407
weighted avg       0.78      0.79      0.78      1407



# Random Forest

In [33]:
# Logistic Regression classifier (L1 regularization)

from sklearn.linear_model import LogisticRegression

param_grid_L1 = {'C': np.arange(.5, 5, .5)}

logreg_L1 = LogisticRegression(penalty='l1', solver='liblinear')   
logreg_L1_cv = GridSearchCV(logreg_L1, param_grid_L1, cv=5)
logreg_L1_cv.fit(X_train, y_train)

print('Lasso Reg best C value', logreg_L1_cv.best_params_, '\n')

y_pred_L1_test = logreg_L1_cv.predict(X_test)

print('Lasso Reg test set confusion matrix:')
print(pd.DataFrame(confusion_matrix(y_test, y_pred_L1_test), index=['actual: no churn', 'actual: churn'], columns=['pred: no churn', 'pred: churn']), '\n')
print('Lasso Reg test set classification report:')
print(classification_report(y_test, y_pred_L1_test))

Lasso Reg best C value {'C': 1.0} 

Lasso Reg test set confusion matrix:
                  pred: no churn  pred: churn
actual: no churn             909           98
actual: churn                197          203 

Lasso Reg test set classification report:
              precision    recall  f1-score   support

          No       0.82      0.90      0.86      1007
         Yes       0.67      0.51      0.58       400

    accuracy                           0.79      1407
   macro avg       0.75      0.71      0.72      1407
weighted avg       0.78      0.79      0.78      1407

