# Classification

## Data Preparation

- Download the data, read it with pandas
- Look at the data
- Make column names and values look uniform
- Check if all columns read correctly
- Check if the churn variable needs any preparation

In [1]:
# import libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [2]:
# Enable viewing all columns regardless how many
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [3]:
# load the dataset
dataset = pd.read_csv("Telco-Customer-Churn.csv")
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
# make all column names and values uniform in lowercase letters and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(" ", "_")

categorical_columns = list(df.dtypes[df.dtypes == "object"].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(" ", "_")

In [5]:
df.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,7590-vhveg,female,0,yes,no,1,no,no_phone_service,dsl,no,yes,no,no,no,no,month-to-month,yes,electronic_check,29.85,29.85,no
1,5575-gnvde,male,0,no,no,34,yes,no,dsl,yes,no,yes,no,no,no,one_year,no,mailed_check,56.95,1889.5,no
2,3668-qpybk,male,0,no,no,2,yes,no,dsl,yes,yes,no,no,no,no,month-to-month,yes,mailed_check,53.85,108.15,yes
3,7795-cfocw,male,0,no,no,45,no,no_phone_service,dsl,yes,no,yes,yes,no,no,one_year,no,bank_transfer_(automatic),42.3,1840.75,no
4,9237-hqitu,female,0,no,no,2,yes,no,fiber_optic,no,no,no,no,no,no,month-to-month,yes,electronic_check,70.7,151.65,yes


In [6]:
# the totalcharges column contains numerical values but there are some features that makes it seem to have objects.
# when we try to convert those values to numericals they raise an error, that error is overriden by use of "coerce" 
tc = pd.to_numeric(df.totalcharges, errors='coerce')

In [7]:
# converting the totalcharges column to numeric values
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')

In [8]:
# filling the null values in totalcharges column with zeros
df.totalcharges = df.totalcharges.fillna(0)

In [9]:
# check the churn variable
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

In [10]:
# churn values are categorical (yes, no), convert them to numericals
df.churn = (df.churn == "yes").astype(int)

## Setting up the Validation Framework

In [11]:
# import train_test_split algorithm
from sklearn.model_selection import train_test_split

In [12]:
# spit the dataset into full train and test
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [13]:
# check the lengths of both full_train and test sets
len(df_full_train), len(df_test)

(5634, 1409)

In [14]:
# now split full_train into train and validation sets
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [15]:
# check the lengths of train, validation and test sets
len(df_train), len(df_val), len(df_test)

(4225, 1409, 1409)

In [16]:
# reset index of all datasets
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [17]:
# create target variables from all datasets
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

In [18]:
# delete the target variable from all sets so that it is not accidentally used as X variable
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]

## EDA

- Check missing values
- Look at the targe variable(churn)
- Look at numerical and categorical variables

In [19]:
df_full_train = df_full_train.reset_index(drop=True)

In [20]:
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [21]:
df_full_train.churn.value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [22]:
df_full_train.churn.value_counts(normalize=True)

0    0.730032
1    0.269968
Name: churn, dtype: float64

The <b> Churn Rate</b> is 0.269968

In [25]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

0.27

Calculating the mean also gives the churn rate because it calculates the average of <b>ones</b> using n as the number of those ones.

In [27]:
numerical = ["tenure", "monthlycharges", "totalcharges"]

In [28]:
df_full_train.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [31]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

In [32]:
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## Feature importance: Churn Rate and Risk Ratio

Feature importance analysis (part of EDA) - identifying which features affect our target variable
 - Churn rate
 - Risk ratio
 - Mutual Information - later

#### Churn rate

In [33]:
df_full_train.head()

Unnamed: 0,customerid,gender,seniorcitizen,partner,dependents,tenure,phoneservice,multiplelines,internetservice,onlinesecurity,onlinebackup,deviceprotection,techsupport,streamingtv,streamingmovies,contract,paperlessbilling,paymentmethod,monthlycharges,totalcharges,churn
0,5442-pptjy,male,0,yes,yes,12,yes,no,no,no_internet_service,no_internet_service,no_internet_service,no_internet_service,no_internet_service,no_internet_service,two_year,no,mailed_check,19.7,258.35,0
1,6261-rcvns,female,0,no,no,42,yes,no,dsl,yes,yes,yes,yes,no,yes,one_year,no,credit_card_(automatic),73.9,3160.55,1
2,2176-osjuv,male,0,yes,no,71,yes,yes,dsl,yes,yes,no,yes,no,no,two_year,no,bank_transfer_(automatic),65.15,4681.75,0
3,6161-erdgd,male,0,yes,yes,71,yes,yes,dsl,yes,no,yes,yes,yes,yes,one_year,no,electronic_check,85.45,6300.85,0
4,2364-ufrom,male,0,no,no,30,yes,no,dsl,yes,yes,no,yes,yes,no,one_year,no,electronic_check,70.4,2044.75,0


In [35]:
churn_female = df_full_train[df_full_train.gender == "female"].churn.mean()
churn_female

0.27682403433476394

In [36]:
churn_male = df_full_train[df_full_train.gender == "male"].churn.mean()
churn_male

0.2632135306553911

In [37]:
global_churn = df_full_train.churn.mean()
global_churn

0.26996805111821087

In [38]:
churn_partner = df_full_train[df_full_train.partner == "yes"].churn.mean()
churn_partner

0.20503330866025166

In [39]:
churn_no_partner = df_full_train[df_full_train.partner == "no"].churn.mean()
churn_no_partner

0.3298090040927694