# **Customer Churn Analysis**

### Table of Contents

* [Importing Libraries](#head1)
* [Loading Data](#head2)
* [Checking Null Values](#head3)
* [Checking Imbalanced Data](#head4)
* [Preprocessing](#head5)
* [Seperate Features & Target Variable](#head6)
* [Train Test Split](#head7)
* [Categorical & Numeric Features](#head8)
* [Model Pipeline](#head9)
* [Best Hyperparameters for Random Forest Classifier Using RandomizedSearchCv using a pipeline](#head10)
* [Model Prediction](#head11)
* [Model Evaluation](#head12)

### **Importing Libraries** <a id="head1"></a>

In [1]:
import numpy as np
import pandas as pd
                                                                                   
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore')

### **Loading Data** <a id="head2"></a>

In [2]:
telco_customer = pd.read_csv("ChurnTrainDataset.csv")

In [3]:
telco_customer.head()

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,OH,107.0,area_code_415,no,yes,26.0,161.6,123.0,27.47,195.5,103.0,16.62,254.4,103.0,11.45,13.7,3.0,3.7,1.0,no
1,NJ,137.0,area_code_415,no,no,0.0,243.4,114.0,41.38,121.2,110.0,10.3,162.6,104.0,7.32,12.2,5.0,3.29,0.0,no
2,OH,84.0,area_code_408,yes,no,0.0,299.4,71.0,50.9,61.9,88.0,5.26,196.9,89.0,8.86,6.6,7.0,1.78,2.0,no
3,OK,75.0,area_code_415,yes,no,0.0,166.7,113.0,28.34,148.3,122.0,12.61,186.9,121.0,8.41,10.1,3.0,2.73,3.0,no
4,MA,121.0,area_code_510,no,yes,24.0,218.2,88.0,37.09,348.5,108.0,29.62,212.6,118.0,9.57,7.5,7.0,2.03,3.0,no


In [4]:
telco_customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4250 entries, 0 to 4249
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          4232 non-null   object 
 1   account_length                 4216 non-null   float64
 2   area_code                      4234 non-null   object 
 3   international_plan             4250 non-null   object 
 4   voice_mail_plan                4237 non-null   object 
 5   number_vmail_messages          4216 non-null   float64
 6   total_day_minutes              4240 non-null   float64
 7   total_day_calls                4248 non-null   float64
 8   total_day_charge               4242 non-null   float64
 9   total_eve_minutes              4215 non-null   float64
 10  total_eve_calls                4233 non-null   float64
 11  total_eve_charge               4242 non-null   float64
 12  total_night_minutes            4248 non-null   f

### **Checking Null Values** <a id="head3"></a>

In [5]:
telco_customer.isnull().sum()

state                            18
account_length                   34
area_code                        16
international_plan                0
voice_mail_plan                  13
number_vmail_messages            34
total_day_minutes                10
total_day_calls                   2
total_day_charge                  8
total_eve_minutes                35
total_eve_calls                  17
total_eve_charge                  8
total_night_minutes               2
total_night_calls                 5
total_night_charge                7
total_intl_minutes                5
total_intl_calls                 13
total_intl_charge                30
number_customer_service_calls     3
churn                            22
dtype: int64

### **Checking Imbalanced Data** <a id="head4"></a>

In [6]:
target_count = telco_customer['churn'].value_counts()
print('No Churn:', target_count[0])
print('Churn:', target_count[1])

No Churn: 3634
Churn: 594


### **Preprocessing** <a id="head5"></a>

In [7]:
# Encoding categorical data using cat codes
for col in telco_customer.columns[telco_customer.dtypes == 'object']:
    if col!='churn':
        telco_customer[col]=telco_customer[col].astype('category').cat.codes

# Fill Null Values of target column
telco_customer['churn'] =  telco_customer['churn'].fillna(telco_customer['churn'].mode()[0])

# Label Encoding of target column
le=LabelEncoder()
telco_customer['churn'] = le.fit_transform(telco_customer['churn'])

### **Seperate Features & Target Variable** <a id="head6"></a>

In [8]:
X = telco_customer.drop('churn',axis=1)
y=telco_customer['churn']

### **Train Test Split**  <a id="head7"></a>

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)

### **Categorical & Numeric Features**  <a id="head8"></a>

In [10]:
cat_features = X_train.columns[X_train.dtypes == 'int8'].values
num_features = X_train.columns[X_train.dtypes == 'float64'].values

### **Model Pipeline** <a id="head9"></a>

In [11]:
# Imputation Transformer to fill null values
fill_null_col = ColumnTransformer([
        ('FillCat',SimpleImputer(strategy='most_frequent'),cat_features),
        ('FillNumeric',SimpleImputer(strategy='median'),num_features),
    ],remainder='passthrough')

# Scaling
scale_col = ColumnTransformer([
    ('scale', MinMaxScaler(),slice(0,18))
])

# Principal Component Analysis
pca_col = ColumnTransformer([
    ('PCA', PCA(n_components=10),slice(0,18))
])

# Model
classifier = RandomForestClassifier()

In [12]:
pipeline_model = Pipeline([('fill_null_col', fill_null_col),
                 ('scale_col', scale_col),
                 ('pca_col', pca_col),
                 ('classifier', classifier)])

pipeline_model.fit(X_train,y_train)
pipeline_model.score(X_test,y_test)

0.9

In [13]:
print(fill_null_col.named_transformers_['FillCat'].statistics_)
print(fill_null_col.named_transformers_['FillNumeric'].statistics_)
print(scale_col.fit_transform(X_train))

[49.  1.  0.  0.]
[100.     0.   180.4  100.    30.65 200.3  100.    17.02 200.2  100.
   9.01  10.3    4.     2.81   1.  ]
[[0.49019608 0.77056277 1.         ... 0.525      0.2        0.52592593]
 [0.84313725 0.31168831 1.         ... 0.34       0.55       0.34074074]
 [0.7254902  0.37229437 0.66666667 ... 0.67       0.15       0.67037037]
 ...
 [0.25490196 0.55844156 0.66666667 ... 0.615      0.2        0.61481481]
 [0.7254902  0.4978355  1.         ... 0.375      0.2        0.37592593]
 [0.43137255 0.23809524 0.33333333 ... 0.355      0.25       0.35555556]]


In [14]:
pipeline_model.steps

[('fill_null_col',
  ColumnTransformer(remainder='passthrough',
                    transformers=[('FillCat',
                                   SimpleImputer(strategy='most_frequent'),
                                   array(['state', 'area_code', 'international_plan', 'voice_mail_plan'],
        dtype=object)),
                                  ('FillNumeric',
                                   SimpleImputer(strategy='median'),
                                   array(['account_length', 'number_vmail_messages', 'total_day_minutes',
         'total_day_calls', 'total_day_charge', 'total_eve_minutes',
         'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
         'total_night_calls', 'total_night_charge', 'total_intl_minutes',
         'total_intl_calls', 'total_intl_charge',
         'number_customer_service_calls'], dtype=object))])),
 ('scale_col',
  ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 18, None))])),
 ('pca_col',
  ColumnTransformer(t

### **Best Hyperparameters for Random Forest Classifier Using RandomizedSearchCv using a pipeline** <a id="head10"></a>

In [15]:
params = [{"classifier__criterion": ['gini', 'entropy'],
           "classifier__max_features" :['auto', 'sqrt'],
           "classifier__n_estimators": range(100, 501, 10),
           "classifier__max_depth":range(10,30),
           "classifier__min_samples_leaf":[1, 2, 4],
           "classifier__min_samples_split": [2, 5,10],
           "classifier__bootstrap" :[True, False]
          }]
                 
rf_randomized = RandomizedSearchCV(estimator = pipeline_model,
                           param_distributions = params,
                           cv = 10,verbose = 2, n_jobs=-1)

rf_randomized.fit(X_train,y_train)
rf_randomized.best_params_

Fitting 10 folds for each of 10 candidates, totalling 100 fits


{'classifier__n_estimators': 420,
 'classifier__min_samples_split': 5,
 'classifier__min_samples_leaf': 1,
 'classifier__max_features': 'sqrt',
 'classifier__max_depth': 15,
 'classifier__criterion': 'entropy',
 'classifier__bootstrap': True}

In [16]:
print(rf_randomized.best_score_)

0.9097058823529413


### **Model Prediction**<a id="head11"></a>

In [17]:
prediction = rf_randomized.predict(X_test)
prediction

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,

### **Model Evaluation** <a id="head12"></a>

In [18]:
conf_matrix = confusion_matrix(y_test, prediction)
print("confusion matrix")
print(conf_matrix)
print(classification_report(y_test,prediction))

confusion matrix
[[714  10]
 [ 75  51]]
              precision    recall  f1-score   support

           0       0.90      0.99      0.94       724
           1       0.84      0.40      0.55       126

    accuracy                           0.90       850
   macro avg       0.87      0.70      0.74       850
weighted avg       0.89      0.90      0.88       850

