## Problem Statement :
In the telecom industry, customers are able to choose from a pool of companies to cater their needs regarding communication and internet. Customers are very critical about the kind of services they receive and judge the entire company based on a single experience! These communication services have become so recurrent and inseparable from the daily routine that a 30 minute maintenance break kicks in anxiety in the users highlighting our taken-for-granted attitude towards these services! Coupled with high customer acquisition costs, churn analysis becomes very pivotal! Churn rate is a metric that describes the number of customers that cancelled or did not renew their subscription with the company. Thus, higher the churn rate, more customers stop buying from your business, directly affecting the revenue! Hence, based on the insights gained from the churn analysis, companies can build strategies, target segments, improve the quality of the services being provided to improve the customer experience, thus cultivating trust with the customers. That is why building predictive models and creating reports of churn analysis becomes key that paves the way for growth!

## Aim :
To classify the potential churn customers based on numerical and categorical features.
It is a binary classification problem for an imbalanced dataset.


## Dataset Attributes :
customerID : Customer ID
gender : Whether the customer is a male or a female
SeniorCitizen : Whether the customer is a senior citizen or not (1, 0)
Partner : Whether the customer has a partner or not (Yes, No)
Dependents : Whether the customer has dependents or not (Yes, No)
tenure : Number of months the customer has stayed with the company
PhoneService : Whether the customer has a phone service or not (Yes, No)
MultipleLines : Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService : Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity : Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup : Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection : Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport : Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV : Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies : Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract : The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling : Whether the customer has paperless billing or not (Yes, No)
PaymentMethod : The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges : The amount charged to the customer monthly
TotalCharges : The total amount charged to the customer
Churn : Whether the customer churned or not (Yes or No)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
os.getcwd()
sns.set()
import warnings
warnings.filterwarnings('ignore')

In [2]:
data=pd.read_csv("/content/Telcom Data.csv")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
data.shape

(7043, 21)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
data["Churn"].value_counts()

No     5174
Yes    1869
Name: Churn, dtype: int64

In [6]:
data["Churn"]=data["Churn"].replace({"Yes":1,"No":0})

In [7]:
data.columns


Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [8]:
for i in data.columns:
  print("************",i,"*****************")
  print()
  print(set(data[i].tolist()))
  print()


************ customerID *****************

{'5982-XMDEX', '3694-GLTJM', '5917-RYRMG', '7730-IUTDZ', '2560-WBWXF', '2523-EWWZL', '4686-GEFRM', '0831-JNISG', '4077-HWUYD', '7564-GHCVB', '0020-INWCK', '4737-AQCPU', '0004-TLHLJ', '4822-LPTYJ', '6959-UWKHF', '1794-HBQTJ', '1096-ADRUX', '8824-RWFXJ', '4955-VCWBI', '2080-SRCDE', '5356-RHIPP', '2674-MLXMN', '0348-SDKOL', '7080-TNUWP', '8155-IBNHG', '5337-IIWKZ', '6281-FKEWS', '3658-KIBGF', '3092-IGHWF', '4317-VTEOA', '4192-GORJT', '4998-IKFSE', '1092-GANHU', '2719-BDAQO', '4918-FYJNT', '5404-GGUKR', '8595-SIZNC', '6127-IYJOZ', '5514-YQENT', '4193-ORFCL', '5639-NTUPK', '8152-VETUR', '3235-ETOOB', '0042-JVWOJ', '5498-IBWPI', '2761-XECQW', '6008-NAIXK', '7954-MLBUN', '3668-QPYBK', '2514-GINMM', '5698-BQJOH', '2227-JRSJX', '4735-BJKOU', '9548-LERKT', '3780-YVMFA', '3932-CMDTD', '8809-XKHMD', '8532-UEFWH', '0947-IDHRQ', '8069-YQQAJ', '0266-GMEAO', '1575-KRZZE', '2043-WVTQJ', '9278-VZKCD', '4118-CEVPF', '2700-LUEVA', '5889-JTMUL', '9494-BDNNC', '559

In [9]:
# Merge No and No phone service in mulitple lines,online security, online backup, device protection,tech support, streaming
# Total charges to be conerted to numericl

In [10]:
# PyCaret is a simple, easy to learn, low-code machine learning library in Python. With PyCaret, you spend less time coding and more time on analysis.

# 
# Exploratory Data Analysis
# 
# Data Preprocessing
# I
# Model Training
# 
# Model Explainability
# 
# MLOps

In [11]:
!pip install pycaret



In [12]:
from pycaret.classification import *

In [20]:
train_data = data.sample(frac = 0.8,random_state = 101).reset_index(drop=True)
test_data = data.drop(train_data.index).reset_index(drop=True)

In [22]:
print(train_data.shape)
print(test_data.shape)

(5634, 21)
(1409, 21)


In [23]:
exp_clf = setup(data=train_data,target = "Churn",pca =True, pca_components=0.95,session_id = 125)

Unnamed: 0,Description,Value
0,Session id,125
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(5634, 21)"
4,Transformed data shape,"(5634, 3)"
5,Transformed train set shape,"(3943, 3)"
6,Transformed test set shape,"(1691, 3)"
7,Ordinal features,5
8,Numeric features,3
9,Categorical features,17


In [None]:
# Comparing all model

In [24]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.7925,0.0,0.3866,0.7058,0.4981,0.3814,0.4095,0.463
lda,Linear Discriminant Analysis,0.7882,0.8115,0.4388,0.6576,0.5249,0.3958,0.41,0.409
gbc,Gradient Boosting Classifier,0.786,0.8158,0.4065,0.664,0.5025,0.3765,0.3957,0.614
lr,Logistic Regression,0.7857,0.812,0.4502,0.6443,0.5288,0.3958,0.4072,0.544
ada,Ada Boost Classifier,0.7824,0.8063,0.4093,0.6581,0.4986,0.3702,0.3904,0.694
qda,Quadratic Discriminant Analysis,0.7725,0.8083,0.5243,0.5859,0.5519,0.4004,0.4024,0.445
nb,Naive Bayes,0.7718,0.8068,0.5167,0.5854,0.5474,0.3959,0.3982,0.319
lightgbm,Light Gradient Boosting Machine,0.7677,0.7946,0.4359,0.5886,0.4997,0.3531,0.3605,0.46
knn,K Neighbors Classifier,0.7634,0.7641,0.4653,0.5721,0.5125,0.3585,0.3623,0.385
xgboost,Extreme Gradient Boosting,0.7598,0.7862,0.4435,0.5658,0.4962,0.3418,0.3467,0.363


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
# create a model

In [25]:
gbc=create_model('gbc')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.762,0.7938,0.4434,0.5732,0.5,0.3472,0.3521
1,0.7671,0.7974,0.3396,0.6207,0.439,0.3076,0.3299
2,0.8076,0.8115,0.4434,0.7344,0.5529,0.4397,0.4625
3,0.797,0.8431,0.4286,0.6923,0.5294,0.409,0.4281
4,0.7716,0.7818,0.3048,0.6531,0.4156,0.2962,0.3295
5,0.7919,0.7809,0.4381,0.6667,0.5287,0.4024,0.417
6,0.7919,0.8265,0.4381,0.6667,0.5287,0.4024,0.417
7,0.802,0.8749,0.4095,0.7288,0.5244,0.4116,0.4388
8,0.7868,0.8208,0.4,0.6667,0.5,0.3751,0.3949
9,0.7817,0.8277,0.419,0.6377,0.5057,0.3733,0.3868


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
# Hyper parameter Tuning

In [26]:
tuned_gbc = tune_model(gbc)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7671,0.7963,0.4151,0.5946,0.4889,0.3442,0.3535
1,0.7747,0.8057,0.3302,0.6604,0.4403,0.3183,0.3483
2,0.8051,0.7965,0.3868,0.7736,0.5157,0.4102,0.4489
3,0.797,0.8377,0.3905,0.7193,0.5062,0.3922,0.4212
4,0.7665,0.7842,0.2476,0.6667,0.3611,0.2533,0.3
5,0.7893,0.7786,0.381,0.6897,0.4908,0.3716,0.3976
6,0.7944,0.8243,0.4095,0.6935,0.515,0.3953,0.4174
7,0.8046,0.8765,0.4,0.75,0.5217,0.4129,0.4451
8,0.797,0.8187,0.381,0.7273,0.5,0.3878,0.4198
9,0.8046,0.841,0.4667,0.7,0.56,0.4408,0.4557


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [27]:
# Evaluate Model

evaluate_model(tuned_gbc)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [None]:
# predict Model

In [30]:
unseen_pred = predict_model(gbc, data=test_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,0.7984,0.8343,0.4357,0.7064,0.539,0.4191,0.4392


In [31]:
unseen_pred

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
0,2320-JRSDE,Female,0,Yes,Yes,1,Yes,No,No,No internet service,...,No internet service,No internet service,Month-to-month,Yes,Electronic check,19.900000,19.9,1,0,0.7301
1,2087-QAREY,Female,0,Yes,No,22,Yes,No,DSL,No,...,No,No,Month-to-month,Yes,Mailed check,54.700001,1178.75,0,0,0.8375
2,0601-WZHJF,Male,0,Yes,No,14,No,No phone service,DSL,No,...,Yes,Yes,Month-to-month,No,Electronic check,46.349998,667.7,1,0,0.7400
3,4423-JWZJN,Male,0,Yes,Yes,64,Yes,Yes,Fiber optic,No,...,No,Yes,One year,No,Credit card (automatic),90.250000,5629.15,0,0,0.9302
4,5143-WMWOG,Male,0,No,No,1,Yes,No,No,No internet service,...,No internet service,No internet service,Month-to-month,No,Electronic check,19.950001,19.95,1,0,0.7301
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1404,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,One year,Yes,Mailed check,84.800003,1990.5,0,0,0.6645
1405,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,Yes,One year,Yes,Credit card (automatic),103.199997,7362.9,0,0,0.9078
1406,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,Month-to-month,Yes,Electronic check,29.600000,346.45,0,0,0.8368
1407,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,Month-to-month,Yes,Mailed check,74.400002,306.6,1,1,0.6953


In [32]:
# Save Model

save_model(gbc,"gbc_model")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['SeniorCitizen', 'tenure',
                                              'MonthlyCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose='deprecated'))),
                 ('categorical_imputer',
                  TransformerWrapper(e...
                                             criterion='friedman_mse',

# New section