# Telco_Customer_Churn_Classification

#### INTRODUCTION

1) TELCO - Telephone company, a provider of telecommunications services, such as telephony and data communications.

2) Customer churn rate is the percentage of a company's total customers that stop doing business with the company over a specified period of time. 

3) Customers who left within the last month – the column is called Churn. When evaluated alongside other key customer retention metrics, churn rate is a powerful way to assess what a brand is doing well and where it needs to improve


#### PROBLEM STATEMENT

Now a days, The company losing their customers. So, by finding the customer is churn or not we can easily take necessary steps to retain customers.

#### OBJECTIVE

1) Why customers are leaving 

2) Predict behavior to retain customers.

3) Analyze all relevant customer data and develop focused customer retention programs.

### Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

### Importing Dataset 

In [3]:
data = pd.read_csv('Telco-Customer-Churn.csv')

In [4]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Data Preprocessing

In [6]:
data.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

There is no missing values in the dataset 

In [5]:
data.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [6]:
data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

Here many of the data types are Object Data Type, So we have to label the data or convert the data in numeric

To label the data. First we have to convert the object data type into Category data type

In [7]:
list_obj_cols = data.columns[data.dtypes == "object"].tolist()
for obj_cols in list_obj_cols:
    data[obj_cols] = data[obj_cols].astype("category")

In [8]:
list_obj_cols = data.columns[data.dtypes == "category"].tolist()

In [9]:
list_obj_cols

['customerID',
 'gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

In [10]:
from sklearn.preprocessing import LabelEncoder

In [11]:
categ =['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'Churn']

le = LabelEncoder()
data[categ] = data[categ].apply(le.fit_transform)

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   customerID        7043 non-null   category
 1   gender            7043 non-null   int32   
 2   SeniorCitizen     7043 non-null   int64   
 3   Partner           7043 non-null   int32   
 4   Dependents        7043 non-null   int32   
 5   tenure            7043 non-null   int64   
 6   PhoneService      7043 non-null   int32   
 7   MultipleLines     7043 non-null   int32   
 8   InternetService   7043 non-null   int32   
 9   OnlineSecurity    7043 non-null   int32   
 10  OnlineBackup      7043 non-null   int32   
 11  DeviceProtection  7043 non-null   int32   
 12  TechSupport       7043 non-null   int32   
 13  StreamingTV       7043 non-null   int32   
 14  StreamingMovies   7043 non-null   int32   
 15  Contract          7043 non-null   int32   
 16  PaperlessBilling  7043 n

Now we have to convert the TotalCharge column into int data type

Totalcharges has some missing values, we can fill the missing with zeros but i will use the mean as it's doesnt make scence that our customer doesn't pay anything

In [13]:
data.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

Totalcharges has some missing values, we can fill the missing with zeros but i will use the mean as it's doesnt make scence that our customer doesn't pay anything

In [14]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

If we look closely, we can see that all customers with null TotalCharges are those who have tenure as 0 in the dataset.

It's possible to presume that these are new clients or they're on a trial period. I will, then, change their TotalCharges value to 0 instead of simply dropping them out.

In [15]:
data.TotalCharges.fillna(data.TotalCharges.mean(), inplace=True)

In [16]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.3,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.7,151.65,1


### DATA VISUALIZATION

a) Demo graphics

### SPLITING THE DATASET INTO TRAIN AND TEST SET

In [18]:
df_dummies = pd.get_dummies(data)
df_dummies.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,customerID_9975-SKRNR,customerID_9978-HYCIN,customerID_9979-RGMZT,customerID_9985-MWVIX,customerID_9986-BONCE,customerID_9987-LUTYD,customerID_9992-RRAMN,customerID_9992-UJOEL,customerID_9993-LHIEB,customerID_9995-HOTOH
0,0,0,1,0,1,0,1,0,0,2,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,34,1,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,2,1,0,0,2,2,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,45,0,1,0,2,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,2,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
X = data.Churn.values

In [20]:
y = data.drop(columns='Churn')

In [21]:
print(X)

[0 0 1 ... 0 1 0]


In [22]:
print(y.shape)

(7043, 20)


In [23]:
data.drop(columns='customerID',inplace=True)

In [24]:
data.head(15)

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,0,0,1,0,1,0,1,0,0,2,0,0,0,0,0,1,2,29.85,29.85,0
1,1,0,0,0,34,1,0,0,2,0,2,0,0,0,1,0,3,56.95,1889.5,0
2,1,0,0,0,2,1,0,0,2,2,0,0,0,0,0,1,3,53.85,108.15,1
3,1,0,0,0,45,0,1,0,2,0,2,2,0,0,1,0,0,42.3,1840.75,0
4,0,0,0,0,2,1,0,1,0,0,0,0,0,0,0,1,2,70.7,151.65,1
5,0,0,0,0,8,1,2,1,0,0,2,0,2,2,0,1,2,99.65,820.5,1
6,1,0,0,1,22,1,2,1,0,2,0,0,2,0,0,1,1,89.1,1949.4,0
7,0,0,0,0,10,0,1,0,2,0,0,0,0,0,0,0,3,29.75,301.9,0
8,0,0,1,0,28,1,2,1,0,0,2,2,2,2,0,1,2,104.8,3046.05,1
9,1,0,0,1,62,1,0,0,2,2,0,0,0,0,1,0,0,56.15,3487.95,0


In [18]:
data.corr()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
gender,1.0,-0.001874,-0.001808,0.010517,0.005106,-0.006488,-0.006739,-0.000863,-0.015017,-0.012057,0.000549,-0.006825,-0.006421,-0.008743,0.000126,-0.011754,0.017352,-0.014569,4.8e-05,-0.008612
SeniorCitizen,-0.001874,1.0,0.016479,-0.211185,0.016567,0.008576,0.146185,-0.03231,-0.128221,-0.013632,-0.021398,-0.151268,0.030776,0.047266,-0.142554,0.15653,-0.038551,0.220173,0.102395,0.150889
Partner,-0.001808,0.016479,1.0,0.452676,0.379697,0.017706,0.14241,0.000891,0.150828,0.15313,0.16633,0.126733,0.137341,0.129574,0.294806,-0.014877,-0.154798,0.096848,0.318812,-0.150448
Dependents,0.010517,-0.211185,0.452676,1.0,0.159712,-0.001762,-0.024991,0.04459,0.152166,0.091015,0.080537,0.133524,0.046885,0.021321,0.243187,-0.111377,-0.040292,-0.11389,0.064535,-0.164221
tenure,0.005106,0.016567,0.379697,0.159712,1.0,0.008448,0.343032,-0.030359,0.325468,0.370876,0.371105,0.322942,0.289373,0.296866,0.671607,0.006152,-0.370436,0.2479,0.824757,-0.352229
PhoneService,-0.006488,0.008576,0.017706,-0.001762,0.008448,1.0,-0.020538,0.387436,-0.015198,0.024105,0.003727,-0.019158,0.055353,0.04387,0.002247,0.016505,-0.004184,0.247398,0.112851,0.011942
MultipleLines,-0.006739,0.146185,0.14241,-0.024991,0.343032,-0.020538,1.0,-0.109216,0.007141,0.117327,0.122318,0.011466,0.175059,0.180957,0.110842,0.165146,-0.176793,0.433576,0.452883,0.038037
InternetService,-0.000863,-0.03231,0.000891,0.04459,-0.030359,0.387436,-0.109216,1.0,-0.028416,0.036138,0.044944,-0.026047,0.107417,0.09835,0.099721,-0.138625,0.08614,-0.32326,-0.175429,-0.047291
OnlineSecurity,-0.015017,-0.128221,0.150828,0.152166,0.325468,-0.015198,0.007141,-0.028416,1.0,0.185126,0.175985,0.285028,0.044669,0.055954,0.374416,-0.157641,-0.096726,-0.053878,0.254308,-0.289309
OnlineBackup,-0.012057,-0.013632,0.15313,0.091015,0.370876,0.024105,0.117327,0.036138,0.185126,1.0,0.187757,0.195748,0.147186,0.136722,0.28098,-0.01337,-0.124847,0.119777,0.375362,-0.195525


### DATA MODELING

In [25]:
# Logistic Regression

In [26]:
y = df_dummies.Churn.values
X = df_dummies.drop(columns = ['Churn'])

from sklearn.preprocessing import MinMaxScaler
features = X.columns.values
scaler = MinMaxScaler(feature_range=(0,1))

X = pd.DataFrame(scaler.fit_transform(X))
X.columns = features

In [27]:
y

array([0, 0, 1, ..., 0, 1, 0])

In [28]:
X

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,customerID_9975-SKRNR,customerID_9978-HYCIN,customerID_9979-RGMZT,customerID_9985-MWVIX,customerID_9986-BONCE,customerID_9987-LUTYD,customerID_9992-RRAMN,customerID_9992-UJOEL,customerID_9993-LHIEB,customerID_9995-HOTOH
0,0.0,0.0,1.0,0.0,0.013889,0.0,0.5,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.472222,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.027778,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.625000,0.0,0.5,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.027778,1.0,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,1.0,0.0,1.0,1.0,0.333333,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7039,0.0,0.0,1.0,1.0,1.000000,1.0,1.0,0.5,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7040,0.0,0.0,1.0,1.0,0.152778,0.0,0.5,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7041,1.0,1.0,1.0,0.0,0.055556,1.0,1.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
data[['tenure','MonthlyCharges','TotalCharges']].describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0
mean,32.371149,64.761692,2283.300441
std,24.559481,30.090047,2265.000258
min,0.0,18.25,18.8
25%,9.0,35.5,402.225
50%,29.0,70.35,1400.55
75%,55.0,89.85,3786.6
max,72.0,118.75,8684.8


### LOGISTIC REGRESSION

In [40]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()


In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
from sklearn import metrics
pred_lg = model.predict(x_test)
score = metrics.accuracy_score(pred_lg, y_test)
score*100
pred_lg_train = model.predict(x_train)
score = metrics.accuracy_score(pred_lg_train, y_train)
score*100

In [41]:
model.fit(x_train, y_train)

In [42]:
from sklearn import metrics
pred_lg = model.predict(x_test)
score = metrics.accuracy_score(pred_lg, y_test)
score*100

78.85024840312278

In [49]:
pred_lg_train = model.predict(x_train)
score = metrics.accuracy_score(pred_lg_train, y_train)
score*100

88.1966631167909

### RANDOM FOREST CLASSIFIER

In [34]:
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model_rf = RandomForestClassifier(n_estimators=1000 , oob_score = True, n_jobs = -1,
                                  random_state =50, max_features = "auto",
                                  max_leaf_nodes = 30)
model_rf.fit(X_train, y_train)
# Make predictions
predict_rfc = model_rf.predict(X_test)
print (metrics.accuracy_score(y_test, predict_rfc)*100)

  warn(


73.31440738112136


In [50]:
predict_rfc_train = model_rf.predict(X_train)
print (metrics.accuracy_score(y_train, predict_rfc_train)*100)

72.98544550940717


### XGBOOST

In [35]:
from xgboost import XGBClassifier
model = XGBClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model.fit(X_train, y_train)
preds_xg = model.predict(X_test)
metrics.accuracy_score(y_test, preds_xg)*100

78.28246983676365

In [52]:
preds_xg_train = model.predict(X_train)
metrics.accuracy_score(y_train, preds_xg_train)*100

63.791267305644304

### SUPPORT VECTOR CLASSIFIER

In [36]:
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model.svm = SVC(kernel='linear') 
model.svm.fit(X_train,y_train)
preds_svm = model.svm.predict(X_test)
metrics.accuracy_score(y_test, preds_svm)*100

79.77288857345636

In [57]:
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
model.svm = SVC(kernel='linear') 
model.svm.fit(X_train,y_train)
preds_svm = model.svm.predict(X_test)
metrics.accuracy_score(y_test, preds_svm)*100

79.77288857345636

In [59]:
preds_svm_train = model.svm.predict(X_train)
metrics.accuracy_score(y_train, preds_svm_train)*100

98.66879659211928

### ADA BOOST CLASSIFIER

In [37]:
# AdaBoost Algorithm
from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier()
# n_estimators = 50 (default value) 
# base_estimator = DecisionTreeClassifier (default value)
model.fit(X_train,y_train)
preds_ab = model.predict(
    X_test)
metrics.accuracy_score(y_test, preds_ab)*100

80.1277501774308

In [58]:
preds_ab = model.predict(X_train)
metrics.accuracy_score(y_train, preds_ab)*100

86.47497337593184