# Naive Bayes - Class Exercise 1

## Introduction

## Metadata (Data Dictionary)

| No.| Variable | Data Type | Description |
|----|----------|-----------|-------------|
| 1  | customerID | string | ID of the customer |
| 2  | gender | string | Gender of the customer |
| 3  | SeniorCitizen | string | Whether the customer is a senior citizen (1) or not (0) |
| 4  | Partner | string | Whether the customer has a partner |
| 5  | Dependents | string | Whether the customer has dependent(s) |
| 6  | tenure | int | The duration as a customer (months) |
| 7  | PhoneService | string | Whether the customer subscribed to the phone service |
| 8  | MultipleLines | string | Whether the customer subscribed to multiple phone services |
| 9  | InternetService | string | Type of Internet Service |
| 10 | OnlineSecurity | string | Whether the customer subscribed to online security |
| 11 | OnlineBackup | string | Whether the customer subscribed to online backup |
| 12 | DeviceProtection | string | Whether the customer subscribed to online device protection |
| 13 | TechSupport | string | Whether the customer subscribed to technical support |
| 14 | StreamingTV | string | Whether the customer subscribed to streaming TV |
| 15 | StreamingMovies | string | Whether the customer subscribed to streaming movies |
| 16 | Contract | string | Type of Contract |
| 17 | PaperlessBilling | string | Whether the customer activated paperless billing |
| 18 | PaymentMethod | string | Payment method of the customer |
| 19 | MonthlyCharges | float | Monthly charge of the customer |
| 20 | TotalCharges | float | Total Charges of the customer |
| 21 | Churn | string | Whether the customer left within the last month or not |


## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

In [2]:
df = pd.read_csv('Customer-Churn-Telco.csv')
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1.0,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34.0,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2.0,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45.0,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2.0,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24.0,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72.0,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11.0,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4.0,Yes,Yes,Fiber optic,No,...,No,,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


# Handling missing values

In [3]:
# Examinze the missing values

columns_with_missing_values = df.columns[df.isnull().any()].to_list()
columns_with_missing_values

['Dependents',
 'tenure',
 'MultipleLines',
 'InternetService',
 'TechSupport',
 'StreamingTV',
 'PaperlessBilling',
 'PaymentMethod']

In [4]:
# Preview the data types

for column in columns_with_missing_values:
    print(column, df[column].unique()[:5])

Dependents ['No' 'Yes' nan]
tenure [ 1. 34.  2. 45.  8.]
MultipleLines ['No phone service' 'No' 'Yes' nan]
InternetService ['DSL' 'Fiber optic' 'No' nan]
TechSupport ['No' 'Yes' 'No internet service' nan]
StreamingTV ['No' 'Yes' 'No internet service' nan]
PaperlessBilling ['Yes' 'No' nan]
PaymentMethod ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)' nan]


In [5]:
# "tenure" is a numeric variable
# the rest are categorical variables
# We can fill the missing values by mean for "tenure" and by mode for categorical variables

for column in columns_with_missing_values:
    if column == 'tenure':
        df[column] = df[column].fillna(df[column].mean())
    else:
        df[column] = df[column].fillna(df[column].mode()[0])

# Handle Categorical Values

In [6]:
# Extract the columns in object type

object_columns = [column for column in df.columns if df[column].dtype == np.dtype('object')]
object_columns

['customerID',
 'gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'TotalCharges',
 'Churn']

In [7]:
# Preview the values in object columns

for column in object_columns:
    print(column, df[column].unique()[:5])

customerID ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' '7795-CFOCW' '9237-HQITU']
gender ['Female' 'Male']
Partner ['Yes' 'No']
Dependents ['No' 'Yes']
PhoneService ['No' 'Yes']
MultipleLines ['No phone service' 'No' 'Yes']
InternetService ['DSL' 'Fiber optic' 'No']
OnlineSecurity ['No' 'Yes' 'No internet service']
OnlineBackup ['Yes' 'No' 'No internet service']
DeviceProtection ['No' 'Yes' 'No internet service']
TechSupport ['No' 'Yes' 'No internet service']
StreamingTV ['No' 'Yes' 'No internet service']
StreamingMovies ['No' 'Yes' 'No internet service']
Contract ['Month-to-month' 'One year' 'Two year']
PaperlessBilling ['Yes' 'No']
PaymentMethod ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
TotalCharges ['29.85' '1889.5' '108.15' '1840.75' '151.65']
Churn ['No' 'Yes']


<font color=red><b>Question:</b></font> Why is "TotalCharges" in object type? We can take a closer look.

In [8]:
# Among all object columns, it seems "TotalCharges" should be numberic.
# It implies that this column contains some non-numeric alphabets

df['TotalCharges']

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: TotalCharges, Length: 7043, dtype: object

At a glance, it seems to be numeric values. We will need to filter those "number-like" values out.

How does a "number-like" value look like? It should have at most 1 "." and all other characters should be digits.

In [9]:
# Make a boolean Series with "number-like" values to be True
mask = df['TotalCharges'].apply(lambda x: x.count('.') <= 1 and x.replace('.', '').isdigit())

# Filter out those "number-like" values
df['TotalCharges'][~mask]

488      
753      
936      
1082     
1340     
3331     
3826     
4380     
5218     
6670     
6754     
Name: TotalCharges, dtype: object

<font color=red><b>Question:</b></font> What are they? <br>
They are missing values in nature. But it has a blank space (" ") in the cell, so it was not recognized as np.nan.<br>
We can change them to np.nan and then replace them by fillna(). However, are they all having 1 black space characters only? We are not abel to visualize. So, we can replace it with empty string ("") first and then change them to np.nan.

In [10]:
# Replace ' ' (the blank space) by '' (empty string)
df['TotalCharges'] = df['TotalCharges'].map(lambda x: x.replace(' ', ''))

# Replace '' by np.nan
df['TotalCharges'] = df['TotalCharges'].replace('', np.nan)

# Convert it to float type
df['TotalCharges'] = df['TotalCharges'].astype('float64')

# Fill the missing values by mean
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].mean())

# Check if there are still missing values
df['TotalCharges'].isnull().any()

False

"False" means there is no missing value anymore in "TotalCharges".

In [11]:
# Extract object columns again
object_columns = [column for column in df.columns if df[column].dtype == np.dtype('object')]
object_columns

['customerID',
 'gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'Churn']

Now, "TotalCharges" is not in it anymore. The rest are truly object columns.

# OneHotEncoder
We can covert all object columns to multiple binary variables to denote each category.
Remember to exclude "customerID". You DO NO want to do one-hot encoding on them.

In [12]:
# Extract the original columns and column names

raw_columns = df[object_columns[1:]]
raw_column_names = list(raw_columns.columns)

In [13]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
# Initialize the encoder
# drop="first" means it will use the first value as a default value and drop it


enc = OneHotEncoder(drop='first')
enc.fit(raw_columns)

In [15]:
# Make the encoded columns

encoded_columns = enc.transform(raw_columns).toarray()
encoded_columns

array([[0., 1., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 1., 0.],
       [1., 0., 0., ..., 0., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 0., 0.],
       [1., 1., 0., ..., 0., 1., 1.],
       [1., 0., 0., ..., 0., 0., 0.]])

In [16]:
# The encoded columns are stored as np.array
# To put them into df, we need to extract the column names

encoded_column_names = list(enc.get_feature_names_out())
encoded_column_names

['gender_Male',
 'Partner_Yes',
 'Dependents_Yes',
 'PhoneService_Yes',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_Fiber optic',
 'InternetService_No',
 'OnlineSecurity_No internet service',
 'OnlineSecurity_Yes',
 'OnlineBackup_No internet service',
 'OnlineBackup_Yes',
 'DeviceProtection_No internet service',
 'DeviceProtection_Yes',
 'TechSupport_No internet service',
 'TechSupport_Yes',
 'StreamingTV_No internet service',
 'StreamingTV_Yes',
 'StreamingMovies_No internet service',
 'StreamingMovies_Yes',
 'Contract_One year',
 'Contract_Two year',
 'PaperlessBilling_Yes',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check',
 'Churn_Yes']

In [17]:
# Put them into df

df[encoded_column_names] = encoded_columns
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes
0,7590-VHVEG,Female,0,Yes,No,1.0,No,No phone service,DSL,No,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,5575-GNVDE,Male,0,No,No,34.0,Yes,No,DSL,Yes,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3668-QPYBK,Male,0,No,No,2.0,Yes,No,DSL,Yes,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,7795-CFOCW,Male,0,No,No,45.0,No,No phone service,DSL,Yes,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9237-HQITU,Female,0,No,No,2.0,Yes,No,Fiber optic,No,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24.0,Yes,Yes,DSL,Yes,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
7039,2234-XADUH,Female,0,Yes,Yes,72.0,Yes,Yes,Fiber optic,No,...,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
7040,4801-JZAZL,Female,0,Yes,Yes,11.0,No,No phone service,DSL,Yes,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
7041,8361-LTMKD,Male,1,Yes,No,4.0,Yes,Yes,Fiber optic,No,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0


In [18]:
label = 'Churn_Yes'
excluded_features = [label, 'customerID'] + raw_column_names
features = [feature for feature in list(df) if feature not in excluded_features]
features

['SeniorCitizen',
 'tenure',
 'MonthlyCharges',
 'TotalCharges',
 'gender_Male',
 'Partner_Yes',
 'Dependents_Yes',
 'PhoneService_Yes',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_Fiber optic',
 'InternetService_No',
 'OnlineSecurity_No internet service',
 'OnlineSecurity_Yes',
 'OnlineBackup_No internet service',
 'OnlineBackup_Yes',
 'DeviceProtection_No internet service',
 'DeviceProtection_Yes',
 'TechSupport_No internet service',
 'TechSupport_Yes',
 'StreamingTV_No internet service',
 'StreamingTV_Yes',
 'StreamingMovies_No internet service',
 'StreamingMovies_Yes',
 'Contract_One year',
 'Contract_Two year',
 'PaperlessBilling_Yes',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']

# Manual Way
We have learned the theory about naive bayes classifier in the lesson and we practiced doing the calculation manually.<br>
Since our memory is still fresh, we can try to perform the manual work by codes.

In [19]:
# We will pick 4 variables for the demonstration

feature_set = ['SeniorCitizen', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes']

In [20]:
# Preview them
# They are all binary variables

df[feature_set]

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes
0,0,1.0,0.0,0.0
1,0,0.0,0.0,1.0
2,0,0.0,0.0,1.0
3,0,0.0,0.0,0.0
4,0,0.0,0.0,1.0
...,...,...,...,...
7038,0,1.0,1.0,1.0
7039,0,1.0,1.0,1.0
7040,0,1.0,1.0,0.0
7041,1,1.0,0.0,1.0


In [21]:
# Extract the feature columns and the label column

df_x = df[feature_set]
df_y = df[label]

In [22]:
# Determine the prior probability of y = 1

prior_prob = df_y.mean()
prior_prob

0.2653698707936959

In [23]:
# Extract all combinations among the feature set

combin_df = df[feature_set].drop_duplicates().sort_values(feature_set).reset_index(drop=True)
combin_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes
0,0,0.0,0.0,0.0
1,0,0.0,0.0,1.0
2,0,0.0,1.0,0.0
3,0,0.0,1.0,1.0
4,0,1.0,0.0,0.0
5,0,1.0,0.0,1.0
6,0,1.0,1.0,0.0
7,0,1.0,1.0,1.0
8,1,0.0,0.0,0.0
9,1,0.0,0.0,1.0


In [24]:
# We are going to create 2 DataFrames
# One stores the probability of each category of each feature
# The other stores the conditional probability

prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

In [25]:
# We will demonstrate how to update prob_df and cond_prob_df, using "SeniorCitizen" feature

feature = 'SeniorCitizen'

In [26]:
# This will determine the probability of each value for  "SeniorCitizen"

prob = df[feature].value_counts() / df.shape[0]
prob

SeniorCitizen
0    0.837853
1    0.162147
Name: count, dtype: float64

In [27]:
# This will filter the DataFrame with only positive class
# Hence, it will determine the conditional probability of each value for  "SeniorCitizen"

filtered_df = df[df[label] == 1]
cond_prob = filtered_df[feature].value_counts() / filtered_df.shape[0]
cond_prob

SeniorCitizen
0    0.745318
1    0.254682
Name: count, dtype: float64

In [28]:
# Now, we can update it to prob_df (and to cond_prob_df)

prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
prob_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes
0,0.837853,0.0,0.0,0.0
1,0.837853,0.0,0.0,1.0
2,0.837853,0.0,1.0,0.0
3,0.837853,0.0,1.0,1.0
4,0.837853,1.0,0.0,0.0
5,0.837853,1.0,0.0,1.0
6,0.837853,1.0,1.0,0.0
7,0.837853,1.0,1.0,1.0
8,0.162147,0.0,0.0,0.0
9,0.162147,0.0,0.0,1.0


In [29]:
# We can restart and use a loop to for all features

prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

for feature in feature_set:
    prob = df[feature].value_counts() / df.shape[0]
    
    filtered_df = df[df[label] == 1]
    cond_prob = filtered_df[feature].value_counts() / filtered_df.shape[0]
    
    prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
    cond_prob_df[feature] = cond_prob_df[feature].map(lambda x: cond_prob[x])

In [30]:
prob_df.head()

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes
0,0.837853,0.516967,0.700412,0.096834
1,0.837853,0.516967,0.700412,0.903166
2,0.837853,0.516967,0.299588,0.096834
3,0.837853,0.516967,0.299588,0.903166
4,0.837853,0.483033,0.700412,0.096834


In [31]:
cond_prob_df.head()

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes
0,0.745318,0.642055,0.825575,0.090958
1,0.745318,0.642055,0.825575,0.909042
2,0.745318,0.642055,0.174425,0.090958
3,0.745318,0.642055,0.174425,0.909042
4,0.745318,0.357945,0.825575,0.090958


In [32]:
# Now, we can compute the joint probability
combin_df['P(B)'] = prob_df.product(axis=1)

# And the joint conditional probability
combin_df['P(B|A=1)'] = cond_prob_df.product(axis=1)

# Then, we can determine the posterior probability of the target variable
combin_df['P(A=1|B)'] = combin_df['P(B|A=1)'] * prior_prob / combin_df['P(B)']
combin_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes,P(B),P(B|A=1),P(A=1|B)
0,0,0.0,0.0,0.0,0.029377,0.035934,0.324602
1,0,0.0,0.0,1.0,0.274001,0.359132,0.34782
2,0,0.0,1.0,0.0,0.012566,0.007592,0.160336
3,0,0.0,1.0,1.0,0.117199,0.075876,0.171804
4,0,1.0,0.0,0.0,0.027449,0.020033,0.193679
5,0,1.0,0.0,1.0,0.256015,0.200216,0.207532
6,0,1.0,1.0,0.0,0.011741,0.004233,0.095667
7,0,1.0,1.0,1.0,0.109506,0.042301,0.10251
8,1,0.0,0.0,0.0,0.005685,0.012279,0.573147
9,1,0.0,0.0,1.0,0.053026,0.122719,0.614143


In [33]:
# Extract a subset df

sub_df = df[feature_set+[label]]
sub_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes,Churn_Yes
0,0,1.0,0.0,0.0,0.0
1,0,0.0,0.0,1.0,0.0
2,0,0.0,0.0,1.0,1.0
3,0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,1.0,1.0
...,...,...,...,...,...
7038,0,1.0,1.0,1.0,0.0
7039,0,1.0,1.0,1.0,0.0
7040,0,1.0,1.0,0.0,0.0
7041,1,1.0,0.0,1.0,1.0


In [34]:
# Perform a vlookup to get P(A=1|B), which is the posterior probability, from our combin_df

sub_df = pd.merge(sub_df, combin_df[feature_set+['P(A=1|B)']], on=feature_set)
sub_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes,Churn_Yes,P(A=1|B)
0,0,1.0,0.0,0.0,0.0,0.193679
1,0,0.0,0.0,1.0,0.0,0.347820
2,0,0.0,0.0,1.0,1.0,0.347820
3,0,0.0,0.0,0.0,0.0,0.324602
4,0,0.0,0.0,1.0,1.0,0.347820
...,...,...,...,...,...,...
7038,0,1.0,1.0,1.0,0.0,0.102510
7039,0,1.0,1.0,1.0,0.0,0.102510
7040,0,1.0,1.0,0.0,0.0,0.095667
7041,1,1.0,0.0,1.0,1.0,0.366438


In [35]:
# If the posterior probability is larger than 0.5, we will predict it as Class 1 (positive)

sub_df['yhat'] = (sub_df['P(A=1|B)'] > 0.5).astype(int)
sub_df

Unnamed: 0,SeniorCitizen,Partner_Yes,Dependents_Yes,PhoneService_Yes,Churn_Yes,P(A=1|B),yhat
0,0,1.0,0.0,0.0,0.0,0.193679,0
1,0,0.0,0.0,1.0,0.0,0.347820,0
2,0,0.0,0.0,1.0,1.0,0.347820,0
3,0,0.0,0.0,0.0,0.0,0.324602,0
4,0,0.0,0.0,1.0,1.0,0.347820,0
...,...,...,...,...,...,...,...
7038,0,1.0,1.0,1.0,0.0,0.102510,0
7039,0,1.0,1.0,1.0,0.0,0.102510,0
7040,0,1.0,1.0,0.0,0.0,0.095667,0
7041,1,1.0,0.0,1.0,1.0,0.366438,0


In [36]:
from sklearn import metrics

In [37]:
metrics.confusion_matrix(sub_df[label], sub_df['yhat'])

array([[4889,  285],
       [1593,  276]], dtype=int64)

# Now let's go back to sklearn

In [38]:
from sklearn.naive_bayes import CategoricalNB

In [39]:
gnb = CategoricalNB()

In [40]:
df_yhat = gnb.fit(df_x, df_y).predict(df_x)

In [41]:
metrics.confusion_matrix(df_y, df_yhat)

array([[4889,  285],
       [1593,  276]], dtype=int64)

See, we get the completely same result with what sklearn does. It shows you are capable of writing the codes behind the imported library!

However, it isn't considered as a great result.<br>
Both the precision and the recall are not good.<br>
We shall try again and include all features.

# Time to get serious. Train Test Split!

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=0)

In [44]:
# We have set the "label" variable and the "features" variable previously
# Re-run it just in case they are changed

label = 'Churn_Yes'
excluded_features = [label, 'customerID'] + raw_column_names
features = [feature for feature in list(df) if feature not in excluded_features]

In [45]:
train_x = train_df[features]
train_y = train_df[label]

test_x = test_df[features]
test_y = test_df[label]

In [46]:
model = CategoricalNB()
model.fit(train_x, train_y)

train_yhat = model.predict(train_x)
test_yhat = model.predict(test_x)

In [47]:
metrics.confusion_matrix(train_y, train_yhat)

array([[3092, 1041],
       [ 244, 1257]], dtype=int64)

In [48]:
metrics.confusion_matrix(test_y, test_yhat)

array([[723, 318],
       [ 85, 283]], dtype=int64)

Although the precision is still not good (~50%), but the precision has been significantly improved!

Subsequently, we will be useing f1-score as the single metric to measure the performance.

In [49]:
metrics.f1_score(test_y, test_yhat)

0.5841073271413829

# Backward Elimination
Remember that we did it for linear regression? We can try it here as well.

In [50]:
# To begin with, we are dropping no feature
features_to_drop = []

# Record the best F-score in the iterations
best_f1_score = 0

# We will loop until we have only 1 variable left
# So, if we have n variables, we will loop for n-1 times

for i in range(len(features)-1):

    # Record the best feature to drop at this iteration
    best_feature_to_drop = None
    
    # We make a for loop to drop a feature each time
    for feature_to_drop in features:
    
        # Skip if feature_to_drop is already in features_to_drop
        if feature_to_drop in features_to_drop:
            continue

        # Re-initiate the training features and the test features
        # Drop feature_to_drop and the features in features_to_drop
        train_x = train_df[features].drop(features_to_drop, axis=1).drop([feature_to_drop], axis=1)
        test_x = test_df[features].drop(features_to_drop, axis=1).drop([feature_to_drop], axis=1)
        
        model = CategoricalNB()
        model.fit(train_x, train_y)
        
        test_yhat = model.predict(test_x)
        f1_score = metrics.f1_score(test_y, test_yhat)

        if f1_score >= best_f1_score:
            best_f1_score = f1_score
            best_feature_to_drop = feature_to_drop

    if best_feature_to_drop is None:
        print('Iteration {}: Mo improvement. Stop.'.format(i))
        break
            
    features_to_drop.append(best_feature_to_drop)
    print('Iteration {}: Dropping {}, F1-score = {}'.format(i, best_feature_to_drop, best_f1_score))

Iteration 0: Dropping Dependents_Yes, F1-score = 0.5896907216494846
Iteration 1: Dropping MonthlyCharges, F1-score = 0.6010471204188482
Iteration 2: Dropping OnlineBackup_Yes, F1-score = 0.6056191467221644
Iteration 3: Dropping MultipleLines_Yes, F1-score = 0.6083333333333334
Iteration 4: Dropping gender_Male, F1-score = 0.6083333333333334
Iteration 5: Mo improvement. Stop.


It stopped at the 5th iteration, where dropping more variables would no longer improve the test performance.

In [51]:
# Extract all features except the features in features_to_drop

train_x = train_df[features].drop(features_to_drop, axis=1)
test_x = test_df[features].drop(features_to_drop, axis=1)

In [52]:
# Preview it to check

train_x

Unnamed: 0,SeniorCitizen,tenure,TotalCharges,Partner_Yes,PhoneService_Yes,MultipleLines_No phone service,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
2920,0,72.0,6155.40,1.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
2966,1,14.0,672.70,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
6099,0,71.0,1810.55,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5482,0,33.0,2405.05,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
2012,0,47.0,4533.70,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4931,0,15.0,1539.80,1.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3264,0,10.0,964.35,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1653,0,58.0,1185.95,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2607,1,1.0,69.75,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


In [53]:
model = CategoricalNB()
model.fit(train_x, train_y)
test_yhat = model.predict(test_x)

In [54]:
metrics.confusion_matrix(test_y, test_yhat)

array([[741, 300],
       [ 76, 292]], dtype=int64)

In [55]:
metrics.f1_score(test_y, test_yhat)

0.6083333333333334

# Improving the model by turing alpha (smoothing parameter)

In [56]:
best_alpha = None
best_f1_score = 0
for i in range(11):
    alpha = 0.5 + i * 0.05
    
    model = CategoricalNB(alpha=alpha)
    model.fit(train_x, train_y)
    test_yhat = model.predict(test_x)
    
    f1_score = metrics.f1_score(test_y, test_yhat)

    if f1_score > best_f1_score:
        best_alpha = alpha
        best_f1_score = f1_score
        
    print('Alpha: {:.3f}, F1_score: {:.3f}'.format(alpha, f1_score))

print('Best:', best_alpha, best_f1_score)

Alpha: 0.500, F1_score: 0.601
Alpha: 0.550, F1_score: 0.604
Alpha: 0.600, F1_score: 0.606
Alpha: 0.650, F1_score: 0.610
Alpha: 0.700, F1_score: 0.610
Alpha: 0.750, F1_score: 0.608
Alpha: 0.800, F1_score: 0.608
Alpha: 0.850, F1_score: 0.609
Alpha: 0.900, F1_score: 0.609
Alpha: 0.950, F1_score: 0.610
Alpha: 1.000, F1_score: 0.608
Best: 0.65 0.6096033402922756


Well, there isn't much improvement. But that is how it works.