Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The main contribution of your work is to develop a churn prediction model that assists telecom operators in predicting customers who are most likely subject to churn. Perform the following operations as you create the much needed deep learning application.

Using the given datasetLinks to an external site. extract the relevant features that can define a customer churn.

Use your EDA(Exploratory Data Analysis) skills to find out which customer profiles relate to churning a lot.

Using the features in (1) define and train a Multi-Layer Perceptron model

Evaluate the model’s accuracy and calculate the AUC score

Create a platform to host the model either web-based or desktop application

Allow users to use the application to enter new data and your model should predict if the supplied data of a new customer can result in a churn or not giving the confidence factor of the model

Record a short video to demonstrate how your application works
Create a README.md file to briefly describe your project, functionalities, etc. This should include a link to the video.
Submission:

Create a GitHub Repository named YourID_Churning_Customers

Submit all files: Colab Notebook, Deployment files, README.

In [1]:
import pandas as pd
import os
import sklearn
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from google.colab import drive
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score,f1_score


drive.mount('/content/drive')
df=pd.read_csv('/content/drive/My Drive/Colab Notebooks/CustomerChurn_dataset.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Looking for most relevant features; viewing dataset

In [2]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [5]:
df.columns.tolist()

['customerID',
 'gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges',
 'Churn']

In [6]:
df.drop(columns=['customerID'], inplace = True)

In [7]:
df.columns.tolist()

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges',
 'Churn']

Encoding data

In [8]:
from pandas.core.arrays import categorical

numericVariables = df.select_dtypes(include=['int64','float64'])
categoricalVariables = df.select_dtypes(include=['object'])

In [9]:
categoricalVariables = pd.DataFrame(categoricalVariables, columns =categoricalVariables.columns)
label_encoder = LabelEncoder()

for column in categoricalVariables.columns:
        categoricalVariables[column] = label_encoder.fit_transform(df[column])


Selecting most relevant features

In [10]:
new_df = pd.concat([numericVariables, categoricalVariables], axis=1)
new_df.columns.tolist()

['SeniorCitizen',
 'tenure',
 'MonthlyCharges',
 'gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

In [11]:
y = new_df['Churn']
X = new_df.drop('Churn', axis = 1)

In [12]:
X_df =  pd.DataFrame(X)

EDIT: selecting most relevant features alternate version

In [17]:
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

In [18]:
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,y,test_size=0.2,random_state=42)
Xtrain.shape

# Create a tree-based model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# RFECV object
rfecv = RFECV(estimator=model, step=1, cv=3, scoring='accuracy')

rfecv.fit(Xtrain, Ytrain)

# selected features
selected_features = Xtrain.columns[rfecv.support_]

In [19]:
optimal_num_features = rfecv.n_features_
support_mask = rfecv.support_
selected_features = X.columns[support_mask]


selected_features

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'gender', 'Partner',
       'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'TotalCharges'],
      dtype='object')

Pre-processing

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

In [21]:
scaled=sc.fit_transform(X)

In [22]:
scaled.tolist()

[[-0.43991649313097614,
  -1.2774445836787656,
  -1.1603229160349193,
  -1.0095586736769342,
  1.0345302279174904,
  -0.6540119291623984,
  -3.054010391622917,
  0.06272275030724751,
  -1.183233637371519,
  -0.9188377542300901,
  1.242549826454581,
  -1.0279101359737688,
  -0.9252621220729103,
  -1.1134954084787796,
  -1.1214051291685294,
  -0.8282068960390321,
  0.8297975015331794,
  0.39855772409659124,
  -0.3986075939976442],
 [-0.43991649313097614,
  0.06632741908223598,
  -0.2596289419448806,
  0.990531829475428,
  -0.9666223112813243,
  -0.6540119291623984,
  0.3274383095561751,
  -0.9915883008000169,
  -1.183233637371519,
  1.4073212332043186,
  -1.0299192537115092,
  1.2451106136850234,
  -0.9252621220729103,
  -1.1134954084787796,
  -1.1214051291685294,
  0.3712710329765761,
  -1.2051132934870794,
  1.3348626109585968,
  -0.9487623815517401],
 [-0.43991649313097614,
  -1.2367242199587352,
  -0.3626603559551803,
  0.990531829475428,
  -0.9666223112813243,
  -0.6540119291623984,

In [23]:
Xtrain.shape

(5634, 19)

Multi Layer Processing

In [24]:
import tensorflow as tf
from tensorflow import keras

In [25]:
import keras
from keras.models import Model
from keras.layers import Input, Dense
from keras.optimizers import Adam
from keras.utils import to_categorical

In [26]:
!pip install --upgrade tensorflow




In [27]:
# Keras Functional API model
input_layer = Input(shape=(Xtrain.shape[1],))
hidden_layer_1 = Dense(32, activation='relu')(input_layer)
hidden_layer_2 = Dense(24, activation='relu')(hidden_layer_1)
hidden_layer_3 = Dense(12, activation='relu')(hidden_layer_2)
output_layer = Dense(1, activation='sigmoid')(hidden_layer_3)

model = Model(inputs=input_layer, outputs=output_layer)

In [28]:
model.compile(optimizer=Adam(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

model.fit(Xtrain, Ytrain, epochs=50, batch_size=32, validation_data=(Xtest, Ytest))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7b2aac561ff0>

In [29]:
_, accuracy = model.evaluate(Xtrain, Ytrain)
accuracy*100



65.58395624160767

In [30]:
loss, accuracy = model.evaluate(Xtest, Ytest)
print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy*100:.4f}')

Test Loss: 0.6617
Test Accuracy: 65.9333


In [31]:
def create_model(optimizer=Adam(learning_rate=0.0001), hidden_unit=32):
    input_layer = Input(shape=(Xtrain.shape[1],))
    hidden_layer_1 = Dense(hidden_unit, activation='relu')(input_layer)
    hidden_layer_2 = Dense(24, activation='relu')(hidden_layer_1)
    hidden_layer_3 = Dense(12, activation='relu')(hidden_layer_2)
    output_layer = Dense(1, activation='sigmoid')(hidden_layer_3)

    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

    return model

In [32]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import GridSearchCV
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import matthews_corrcoef
from imblearn.metrics import geometric_mean_score
from sklearn.model_selection import cross_val_score
from numpy import mean
from numpy import std

In [33]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

# Split the data into train and test sets while preserving class distribution
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize the RandomOverSampler
oversampler = RandomOverSampler(sampling_strategy='auto', random_state=42)

# Apply random oversampling to the training data
X_train_resampled, y_train_resampled = oversampler.fit_resample(Xtrain, Ytrain)

# Print the original and resampled class distribution
print("Original class distribution:", np.bincount(Ytrain))
print("Resampled class distribution:", np.bincount(y_train_resampled))

Original class distribution: [4139 1495]
Resampled class distribution: [4139 4139]


In [34]:
Ytrain.value_counts()

0    4139
1    1495
Name: Churn, dtype: int64

In [35]:
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [45]:
num_classes=2
epochs=250
batch_size=64

In [None]:
# Wrap the Keras model using KerasClassifier
model = KerasClassifier(build_fn=create_model, epochs=epochs, batch_size=batch_size, verbose=1)

# Define the hyperparameter grid
param_grid = {
    'model__optimizer': ['adam','adadelta','rmsprop'],
    'model__hidden_unit': [32, 64, 128]
}

# Create GridSearchCV instance
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=4, scoring='accuracy')

# Fit the grid search to the data
grid_search.fit(Xtrain, Ytrain)


In [51]:
best_model = grid_search.best_estimator_
best_model


In [53]:
from sklearn.metrics import classification_report

In [56]:
y_pred = best_model.predict(Xtest)
fpr_mlp, tpr_mlp, _ = metrics.roc_curve(Ytest, y_pred)
auc_mlp = round(metrics.roc_auc_score(Ytest, y_pred), 4)
print("AUC:",auc_mlp)
y_pred=np.round(best_model.predict(Xtest)).ravel()
print("\nCR by library method=\n",
          classification_report(Ytest, y_pred))

AUC: 0.6761

CR by library method=
               precision    recall  f1-score   support

           0       0.82      0.92      0.87      1035
           1       0.67      0.43      0.52       374

    accuracy                           0.79      1409
   macro avg       0.74      0.68      0.69      1409
weighted avg       0.78      0.79      0.78      1409



Test/Train

In [None]:

import pickle
finalModel_mlp = best_model



# Specify the filename for the pickle file
pickle_filename = '/content/drive/My Drive/Colab Notebooks/finalModel_mlp.pkl'

# Save the model to a pickle file
with open(pickle_filename, 'wb') as file:
    pickle.dump(best_model, file)