# Naive Bayes - Class Exercise 1

## Introduction

## Metadata (Data Dictionary)

| No.| Variable | Data Type | Description |
|----|----------|-----------|-------------|
| 1  | customerID | string | ID of the customer |
| 2  | gender | string | Gender of the customer |
| 3  | SeniorCitizen | string | Whether the customer is a senior citizen (1) or not (0) |
| 4  | Partner | string | Whether the customer has a partner |
| 5  | Dependents | string | Whether the customer has dependent(s) |
| 6  | tenure | int | The duration as a customer (months) |
| 7  | PhoneService | string | Whether the customer subscribed to the phone service |
| 8  | MultipleLines | string | Whether the customer subscribed to multiple phone services |
| 9  | InternetService | string | Type of Internet Service |
| 10 | OnlineSecurity | string | Whether the customer subscribed to online security |
| 11 | OnlineBackup | string | Whether the customer subscribed to online backup |
| 12 | DeviceProtection | string | Whether the customer subscribed to online device protection |
| 13 | TechSupport | string | Whether the customer subscribed to technical support |
| 14 | StreamingTV | string | Whether the customer subscribed to streaming TV |
| 15 | StreamingMovies | string | Whether the customer subscribed to streaming movies |
| 16 | Contract | string | Type of Contract |
| 17 | PaperlessBilling | string | Whether the customer activated paperless billing |
| 18 | PaymentMethod | string | Payment method of the customer |
| 19 | MonthlyCharges | float | Monthly charge of the customer |
| 20 | TotalCharges | float | Total Charges of the customer |
| 21 | Churn | string | Whether the customer left within the last month or not |


## Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

In [None]:
df = pd.read_csv('_.csv')
df

# Handling missing values

In [None]:
# Examinze the missing values

columns_with_missing_values = df.columns[df.isnull().any()].to_list()
columns_with_missing_values

In [None]:
# Preview the data types

for column in columns_with_missing_values:
    print(column, _)

In [None]:
# "tenure" is a numeric variable
# the rest are categorical variables
# We can fill the missing values by mean for "tenure" and by mode for categorical variables

for column in columns_with_missing_values:
    if column == 'tenure':
        df[column] = df[column].fillna(_)
    else:
        df[column] = df[column].fillna(_)

# Handle Categorical Values

In [None]:
# Extract the columns in object type

object_columns = [column for column in df.columns if _]
object_columns

In [None]:
# Preview the values in object columns

for column in object_columns:
    print(column, df[column].unique()[:5])

<font color=red><b>Question:</b></font> Why is "TotalCharges" in object type? We can take a closer look.

In [None]:
# Among all object columns, it seems "TotalCharges" should be numberic.
# It implies that this column contains some non-numeric alphabets

df['TotalCharges']

At a glance, it seems to be numeric values. We will need to filter those "number-like" values out.

How does a "number-like" value look like? It should have at most 1 "." and all other characters should be digits.

In [None]:
# Make a boolean Series with "number-like" values to be True
mask = df['TotalCharges'].apply(_)

# Filter out those "number-like" values
df['TotalCharges'][~mask]

<font color=red><b>Question:</b></font> What are they? <br>
They are missing values in nature. But it has a blank space (" ") in the cell, so it was not recognized as np.nan.<br>
We can change them to np.nan and then replace them by fillna(). However, are they all having 1 black space characters only? We are not abel to visualize. So, we can replace it with empty string ("") first and then change them to np.nan.

In [None]:
# Replace ' ' (the blank space) by '' (empty string)
df['TotalCharges'] = df['TotalCharges'].map(_)

# Replace '' by np.nan
df['TotalCharges'] = df['TotalCharges'].replace(_, np.nan)

# Convert it to float type
df['TotalCharges'] = df['TotalCharges'].astype('float64')

# Fill the missing values by mean
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].mean())

# Check if there are still missing values
df['TotalCharges'].isnull().any()

"False" means there is no missing value anymore in "TotalCharges".

In [None]:
# Extract object columns again
object_columns = [column for column in df.columns if df[column].dtype == np.dtype('object')]
object_columns

Now, "TotalCharges" is not in it anymore. The rest are truly object columns.

# OneHotEncoder
We can covert all object columns to multiple binary variables to denote each category.
Remember to exclude "customerID". You DO NO want to do one-hot encoding on them.

In [None]:
# Extract the original columns and column names

raw_columns = df[_]
raw_column_names = list(raw_columns.columns)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Initialize the encoder
# drop="first" means it will use the first value as a default value and drop it


enc = OneHotEncoder(drop='first')
enc.fit(raw_columns)

In [None]:
# Make the encoded columns

encoded_columns = enc.transform(raw_columns).toarray()
encoded_columns

In [None]:
# The encoded columns are stored as np.array
# To put them into df, we need to extract the column names

encoded_column_names = list(enc.get_feature_names_out())
encoded_column_names

In [None]:
# Put them into df

df[encoded_column_names] = encoded_columns
df

In [None]:
label = 'Churn_Yes'
excluded_features = [label, 'customerID'] + raw_column_names
features = [feature for feature in list(df) if feature not in excluded_features]
features

# Manual Way
We have learned the theory about naive bayes classifier in the lesson and we practiced doing the calculation manually.<br>
Since our memory is still fresh, we can try to perform the manual work by codes.

In [None]:
# We will pick 4 variables for the demonstration

feature_set = ['SeniorCitizen', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes']

In [None]:
# Preview them
# They are all binary variables

df[_]

In [None]:
# Extract the feature columns and the label column

df_x = df[_]
df_y = df[_]

In [None]:
# Determine the prior probability of y = 1

prior_prob = _
prior_prob

In [None]:
# Extract all combinations among the feature set

combin_df = df[feature_set].drop_duplicates().sort_values(feature_set).reset_index(drop=True)
combin_df

In [None]:
# We are going to create 2 DataFrames
# One stores the probability of each category of each feature
# The other stores the conditional probability

prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

In [None]:
# We will demonstrate how to update prob_df and cond_prob_df, using "SeniorCitizen" feature

feature = 'SeniorCitizen'

In [None]:
# This will determine the probability of each value for  "SeniorCitizen"

prob = df[feature].value_counts() / df.shape[0]
prob

In [None]:
# This will filter the DataFrame with only positive class
# Hence, it will determine the conditional probability of each value for  "SeniorCitizen"

filtered_df = df[df[label] == 1]
cond_prob = filtered_df[feature].value_counts() / filtered_df.shape[0]
cond_prob

In [None]:
# Now, we can update it to prob_df (and to cond_prob_df)

prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
prob_df

In [None]:
# We can restart and use a loop to for all features

prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

for feature in feature_set:
    prob = df[feature].value_counts() / df.shape[0]
    
    filtered_df = df[df[label] == 1]
    cond_prob = filtered_df[feature].value_counts() / filtered_df.shape[0]
    
    prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
    cond_prob_df[feature] = cond_prob_df[feature].map(lambda x: cond_prob[x])

In [None]:
prob_df.head()

In [None]:
cond_prob_df.head()

In [None]:
# Now, we can compute the joint probability
combin_df['P(B)'] = prob_df._

# And the joint conditional probability
combin_df['P(B|A=1)'] = cond_prob_df._

# Then, we can determine the posterior probability of the target variable
combin_df['P(A=1|B)'] = _
combin_df

In [None]:
# Extract a subset df

sub_df = df[feature_set+[label]]
sub_df

In [None]:
# Perform a vlookup to get P(A=1|B), which is the posterior probability, from our combin_df

sub_df = pd.merge(sub_df, combin_df[feature_set+['P(A=1|B)']], on=feature_set)
sub_df

In [None]:
# If the posterior probability is larger than 0.5, we will predict it as Class 1 (positive)

sub_df['yhat'] = (_).astype(int)
sub_df

In [None]:
from sklearn import metrics

In [None]:
metrics.confusion_matrix(_, _)

# Now let's go back to sklearn

In [None]:
from sklearn.naive_bayes import CategoricalNB

In [None]:
gnb = CategoricalNB()

In [None]:
df_yhat = gnb.fit(_, _).predict(_)

In [None]:
metrics.confusion_matrix(_, _)

See, we get the completely same result with what sklearn does. It shows you are capable of writing the codes behind the imported library!

However, it isn't considered as a great result.<br>
Both the precision and the recall are not good.<br>
We shall try again and include all features.

# Time to get serious. Train Test Split!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=0)

In [None]:
# We have set the "label" variable and the "features" variable previously
# Re-run it just in case they are changed

label = 'Churn_Yes'
excluded_features = [label, 'customerID'] + raw_column_names
features = [feature for feature in list(df) if feature not in excluded_features]

In [None]:
train_x = _
train_y = _

test_x = _
test_y = _

In [None]:
model = CategoricalNB()
model.fit(_, _)

train_yhat = model.predict(_)
test_yhat = model.predict(_)

In [None]:
metrics.confusion_matrix(_, _)

In [None]:
metrics.confusion_matrix(_, _)

Although the precision is still not good (~50%), but the precision has been significantly improved!

Subsequently, we will be useing f1-score as the single metric to measure the performance.

In [None]:
metrics.f1_score(_, -)

# Backward Elimination
Remember that we did it for linear regression? We can try it here as well.

In [None]:
# To begin with, we are dropping no feature
features_to_drop = []

# Record the best F-score in the iterations
best_f1_score = 0

# We will loop until we have only 1 variable left
# So, if we have n variables, we will loop for n-1 times

for i in range(len(features)-1):

    # Record the best feature to drop at this iteration
    best_feature_to_drop = None
    
    # We make a for loop to drop a feature each time
    for feature_to_drop in features:
    
        # Skip if feature_to_drop is already in features_to_drop
        if feature_to_drop in features_to_drop:
            continue

        # Re-initiate the training features and the test features
        # Drop feature_to_drop and the features in features_to_drop
        train_x = train_df[features].drop(features_to_drop, axis=1).drop([feature_to_drop], axis=1)
        test_x = test_df[features].drop(features_to_drop, axis=1).drop([feature_to_drop], axis=1)
        
        model = CategoricalNB()
        model.fit(train_x, train_y)
        
        test_yhat = model.predict(test_x)
        f1_score = metrics.f1_score(test_y, test_yhat)

        if f1_score >= best_f1_score:
            best_f1_score = _
            best_feature_to_drop = _

    if best_feature_to_drop is None:
        print('Iteration {}: Mo improvement. Stop.'.format(i))
        break
            
    features_to_drop.append(best_feature_to_drop)
    print('Iteration {}: Dropping {}, F1-score = {}'.format(i, best_feature_to_drop, best_f1_score))

It stopped at the 5th iteration, where dropping more variables would no longer improve the test performance.

In [None]:
# Extract all features except the features in features_to_drop

train_x = _
test_x = _

In [None]:
# Preview it to check

train_x

In [None]:
model = CategoricalNB()
model.fit(_, _)
test_yhat = model.predict(_)

In [None]:
metrics.confusion_matrix(_, _)

In [None]:
metrics.f1_score(_, _)

# Improving the model by turing alpha (smoothing parameter)

In [None]:
best_alpha = None
best_f1_score = 0

for i in range(11):
    
    alpha = 0.5 + i * 0.05
    
    model = CategoricalNB(alpha=alpha)
    model.fit(_, _)
    test_yhat = model.predict(test_x)
    
    f1_score = metrics.f1_score(_, _)

    if f1_score > best_f1_score:
        best_alpha = _
        best_f1_score = _
        
    print('Alpha: {:.3f}, F1_score: {:.3f}'.format(_, _))

print('Best:', best_alpha, best_f1_score)

Well, there isn't much improvement. But that is how it works.