# Machine Learning: Classification

### 3.1 Churn Prediction

* Data from https://www.kaggle.com/blastchar/telco-customer-churn

### 3.2 Data preparation

* Download the data, read it with pandas
* Look at the data
* Make column names and values look uniform
* Check if all the columns read correctly
* Check if the churn variable needs any preparation

In [95]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split

In [96]:
 df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [97]:
# transpose the DataFrame to see the entire record of a single entity
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [98]:
# make the data uniform by processing inconsistencies

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in categorical_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')


In [99]:
# verify dataset inconsistencies have been remove

df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [100]:
# peruse the data types

df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

In [101]:
# zooming in on "totalcharges" data types. Speculating it contains strings, numbers, and maybe whitespaces, underscore, etc
df.totalcharges         

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [102]:
# attempt to convert "totalcharges" to numbers throws an error
# pd.to_numeric(df.totalcharges)

# ValueError: Unable to parse string "_" at position 488

In [103]:
# force pandas to replace values it can't parse with number

tc = pd.to_numeric(df.totalcharges, errors='coerce')

In [104]:
# check for missing values in total charges
tc.isnull().sum()


np.int64(11)

In [105]:
# zoom-in on few columns in total charges with missing values

df[tc.isnull()][['customerid', 'totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


In [106]:
# force pandas to replace values it can't parse with number

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')

In [107]:
# fill the missing values in total charges with zeros

df.totalcharges = df.totalcharges.fillna(0)

In [108]:
# confirm the columns in total charges with missing values are filled

df[tc.isnull()][['customerid', 'totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,0.0
753,3115-czmzd,0.0
936,5709-lvoeq,0.0
1082,4367-nuyao,0.0
1340,1371-dwpaz,0.0
3331,7644-omvmy,0.0
3826,3213-vvolg,0.0
4380,2520-sgtta,0.0
5218,2923-arzlg,0.0
6670,4075-wkniu,0.0


In [109]:
# examine churn variables
df.churn

0        no
1        no
2       yes
3        no
4       yes
       ... 
7038     no
7039     no
7040     no
7041    yes
7042     no
Name: churn, Length: 7043, dtype: object

In [110]:
# examine first 5 churn variables, converting yes/no to True/False
(df.churn == 'yes').head()

0    False
1    False
2     True
3    False
4     True
Name: churn, dtype: bool

In [111]:
# convert the True/False churn variables to 1's and 0's
(df.churn == 'yes').astype('int').head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

In [112]:
# write the  1's and 0's churn variables back into churn
df.churn = (df.churn == 'yes').astype('int')

### 3.3 Setting up the validation framework

* Perform the train/validation/test split with Scikit-Learn

In [113]:
# using package :from sklearn.model_selection import train_test_split

# review sklearn train_test documentation
# ascertain what value to specify for "test_size" parameter
train_test_split?


[1;31mSignature:[0m
[0mtrain_test_split[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[0marrays[0m[1;33m,[0m[1;33m
[0m    [0mtest_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtrain_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mshuffle[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mstratify[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation,
``next(ShuffleSplit().split(X, y))``, and application to input data
into a single call for splitting (and optionally subsampling) data into a
one-liner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with same length / shape[0]
    Allowed inputs are lists, numpy arrays, scipy-sparse

In [114]:
# specifying 20% for "test_size" parameter
# NOTE: The train_test_split function split data into, training and testing

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [115]:
# peruse the sizes of the dataset: df_full_train, df_test

len(df_full_train), len(df_test)

(5634, 1409)

In [116]:
# NOTE: since df_test size = 20%, df_full_train contains 80%
# df_full_train is further split into train (75% of df_full_train%) and val (25% of df_full_train)

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [117]:
# confirm the sizes of the dataset: df_train, df_val and df_test
len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [118]:
# the data for all 3 datasets are shuffled. Also notice "churn" is still included and index is not sequential.
df_train.head().T

Unnamed: 0,3897,1980,6302,727,5104
customerid,8015-ihcgw,1960-uycnn,9250-wypll,6786-obwqr,1328-euzhc
gender,female,male,female,female,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,yes,yes
dependents,yes,no,no,yes,no
tenure,72,10,5,5,18
phoneservice,yes,yes,yes,yes,yes
multiplelines,yes,yes,yes,no,no
internetservice,fiber_optic,fiber_optic,fiber_optic,fiber_optic,no
onlinesecurity,yes,no,no,no,no_internet_service


In [119]:
# reset the index for all 3 datasets. This has no impact on the performance.

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [120]:
# confirm data is sequential
df_train

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,...,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,8015-ihcgw,female,0,yes,yes,72,yes,yes,fiber_optic,yes,...,yes,yes,yes,yes,two_year,yes,electronic_check,115.50,8425.15,0
1,1960-uycnn,male,0,no,no,10,yes,yes,fiber_optic,no,...,yes,no,no,yes,month-to-month,yes,electronic_check,95.25,1021.55,0
2,9250-wypll,female,0,no,no,5,yes,yes,fiber_optic,no,...,no,no,no,no,month-to-month,no,electronic_check,75.55,413.65,1
3,6786-obwqr,female,0,yes,yes,5,yes,no,fiber_optic,no,...,no,no,yes,no,month-to-month,yes,electronic_check,80.85,356.10,0
4,1328-euzhc,female,0,yes,no,18,yes,no,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,20.10,370.50,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4220,1309-xgfsn,male,1,yes,yes,52,yes,yes,dsl,no,...,yes,no,yes,yes,one_year,yes,electronic_check,80.85,4079.55,0
4221,4819-hjpiw,male,0,no,no,18,no,no_phone_service,dsl,no,...,no,no,no,no,month-to-month,no,mailed_check,25.15,476.80,0
4222,3703-vavcl,male,0,yes,yes,2,yes,no,fiber_optic,no,...,yes,yes,no,yes,month-to-month,no,credit_card_(automatic),90.00,190.05,1
4223,3812-lrzir,female,0,yes,yes,27,yes,yes,no,no_internet_service,...,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,electronic_check,24.50,761.95,0


In [121]:
# split the churn into datasets

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

In [122]:
# preview the data

y_train, y_val, y_test

(array([0, 0, 1, ..., 1, 0, 1]),
 array([0, 0, 0, ..., 0, 1, 1]),
 array([0, 0, 0, ..., 0, 0, 1]))

In [123]:
# delete y_train, y_val, and y_test from the datasets to avoid accidentally training the model with them, which would lead to overfitting

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [125]:
# confirm deletion of y_train i.e "churn". Now, notice that "churn" is not included and the dataset is also sequential

df_train.head().T # churn non-inclusion can also be confirmed with df_train.dtypes



Unnamed: 0,0,1,2,3,4
customerid,8015-ihcgw,1960-uycnn,9250-wypll,6786-obwqr,1328-euzhc
gender,female,male,female,female,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,yes,yes
dependents,yes,no,no,yes,no
tenure,72,10,5,5,18
phoneservice,yes,yes,yes,yes,yes
multiplelines,yes,yes,yes,no,no
internetservice,fiber_optic,fiber_optic,fiber_optic,fiber_optic,no
onlinesecurity,yes,no,no,no,no_internet_service


### 3.4 Exploratory Data Analysis (EDA)

* Check missing values
* Look at the target variable (churn)
* Look at numerical and categorical variables

### 3.5 Feature importance: Churn rate and risk ratio

**Feature importance analysis (part of EDA) - identifying which features affect our target variable**

* Churn rate
* Risk ratio
* Mutual information - Later 