# Naive Bayes - Class Exercise 1

## Introduction

## Metadata (Data Dictionary)

| No.| Variable | Data Type | Description |
|----|----------|-----------|-------------|
| 1  | customerID | string | ID of the customer |
| 2  | gender | string | Gender of the customer |
| 3  | SeniorCitizen | string | Whether the customer is a senior citizen (1) or not (0) |
| 4  | Partner | string | Whether the customer has a partner |
| 5  | Dependents | string | Whether the customer has dependent(s) |
| 6  | tenure | int | The duration as a customer (months) |
| 7  | PhoneService | string | Whether the customer subscribed to the phone service |
| 8  | MultipleLines | string | Whether the customer subscribed to multiple phone services |
| 9  | InternetService | string | Type of Internet Service |
| 10 | OnlineSecurity | string | Whether the customer subscribed to online security |
| 11 | OnlineBackup | string | Whether the customer subscribed to online backup |
| 12 | DeviceProtection | string | Whether the customer subscribed to online device protection |
| 13 | TechSupport | string | Whether the customer subscribed to technical support |
| 14 | StreamingTV | string | Whether the customer subscribed to streaming TV |
| 15 | StreamingMovies | string | Whether the customer subscribed to streaming movies |
| 16 | Contract | string | Type of Contract |
| 17 | PaperlessBilling | string | Whether the customer activated paperless billing |
| 18 | PaymentMethod | string | Payment method of the customer |
| 19 | MonthlyCharges | float | Monthly charge of the customer |
| 20 | TotalCharges | float | Total Charges of the customer |
| 21 | Churn | string | Whether the customer left within the last month or not |


## Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

In [None]:
df = pd.read_csv('_.csv')
df

# Handling missing values

In [None]:
# Examinze the missing values

columns_with_missing_values = _
columns_with_missing_values

In [None]:
# Preview the data types

for column in columns_with_missing_values:
    print(column, df[column].unique()[:5])

In [None]:
# "tenure" is a numeric variable
# the rest are categorical variables
# We can fill the missing values by mean for "tenure" and by mode for categorical variables

for column in columns_with_missing_values:
    if column == 'tenure':
        df[column] = _
    else:
        df[column] = _

# Handle Categorical Values

In [None]:
object_columns = [column for column in df.columns if df[column].dtype == np.dtype('object')]
object_columns

In [None]:
# Preview the data types

for column in object_columns:
    print(column, df[column].unique()[:5])

<font color=red><b>Question:</b></font> Why is "TotalCharges" in object type? We can take a closer look.

In [None]:
df['TotalCharges']

At a glance, it seems to be numeric values. We will need to filter those "number-like" values out.

How does a "number-like" value look like? It should have at most 1 "." and all other characters should be digits.

In [None]:
# Make a boolean Series with "number-like" values to be True
mask = _

# Filter out those "number-like" values
df['TotalCharges'][~mask]

<font color=red><b>Question:</b></font> What are they? <br>
They are missing values in nature. But it has a blank space (" ") in the cell, so it was not recognized as np.nan.<br>
We can change them to np.nan and then replace them by fillna(). However, are they all having 1 black space characters only? We are not abel to visualize. So, we can replace it with empty string ("") first and then change them to np.nan.

In [None]:
df['TotalCharges'] = df['TotalCharges'].map(lambda x: x.replace(' ', ''))
df['TotalCharges'] = df['TotalCharges'].map(lambda x: np.nan if x == '' else x)
df['TotalCharges'] = df['TotalCharges'].astype('float64')
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].mean())
df['TotalCharges'].isnull().any()

"False" means there is no missing value anymore in "TotalCharges".

In [None]:
# Extract object columns again.
object_columns = [column for column in df.columns if df[column].dtype == np.dtype('object')]
object_columns

# OneHotEncoder
We can covert all object columns to multiple binary variables to denote each category.
Remember to exclude "customerID". You DO NO want to do one-hot encoding on them.

In [None]:
raw_columns = df[object_columns[1:]]
raw_column_names = list(raw_columns.columns)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
enc = OneHotEncoder(drop='first')
enc.fit(raw_columns)

In [None]:
# Make the encoded columns
encoded_columns = enc.transform(raw_columns).toarray()
encoded_columns

In [None]:
# The encoded columns are stored as np.array
# To put them into df, we need to extract the column names
encoded_column_names = list(enc.get_feature_names_out())
encoded_column_names

In [None]:
# Put them into df
df[encoded_names] = encoded_columns
df

In [None]:
label = 'Churn_Yes'
excluded_features = [label, 'customerID'] + raw_column_names
features = [feature for feature in list(df) if feature not in excluded_features]
features

In [None]:
feature_set = ['SeniorCitizen', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes']

In [None]:
df[feature_set]

In [None]:
df[label]

In [None]:
df_x = df[feature_set]
df_y = df[label]

# Manual Way

In [None]:
# Determine the prior probability of y = 1
prior_prob = _
prior_prob

In [None]:
# Extract all combinations among the feature set
combin_df = df[feature_set].drop_duplicates().sort_values(feature_set).reset_index(drop=True)
combin_df

In [None]:
# We are going to create 2 DataFrames
# One stores the probability of features
# The other stores the conditional probability
prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

In [None]:
# We will demonstrate how to update prob_df and cond_prob_df, using "SeniorCitizen" feature
feature = 'SeniorCitizen'

In [None]:
# This will determine the probability of each value for  "SeniorCitizen"
prob = _
prob

In [None]:
# This will filter the DataFrame with only positive class
# Hence, it will determine the conditional probability of each value for  "SeniorCitizen"
filtered_df = df[df[label] == 1]
cond_prob = _
cond_prob

In [None]:
# Now, we can update it to prob_df (and to cond_prob_df)
prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
prob_df

In [None]:
# We can restart and use a loop to for all features

prob_df = combin_df.copy()
cond_prob_df = combin_df.copy()

for feature in feature_set:
    prob = df[feature].value_counts() / df.shape[0]
    
    filtered_df = df[df[label] == 1]
    cond_prob = filtered_df[feature].value_counts() / filtered_df.shape[0]
    
    prob_df[feature] = prob_df[feature].map(lambda x: prob[x])
    cond_prob_df[feature] = cond_prob_df[feature].map(lambda x: cond_prob[x])

In [None]:
prob_df.head()

In [None]:
con_prob_df.head()

In [None]:
# Now, we can compute the joint probability
combin_df['P(B)'] = _

# And the joint conditional probability
combin_df['P(B|A=1)'] = _

# Then, we can determine the posterior probability of the target variable
combin_df['P(A=1|B)'] = _
combin_df

In [None]:
sub_df = df[feature_set+[label]]
sub_df

In [None]:
sub_df = pd.merge(sub_df, combin_df[feature_set+['P(A=1|B)']], on=feature_set)
sub_df

In [None]:
sub_df['yhat'] = (_).astype(int)
sub_df

In [None]:
metrics.confusion_matrix(_, _)

# Now let's go back to sklearn

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
gnb = _()

In [None]:
df_yhat = gnb.fit(_, _).predict(_)

In [None]:
from sklearn import metrics

In [None]:
metrics.confusion_matrix(_, _)

See, the result is pretty similar to what we have calculated manually.