# Project: Naive Bayes Classifier
### Mobin Roohi
### SID: 610300060

In this project, we will implement a Naive Bayes classifier from the ground up and compare it with the Sci-kit implementation using two datasets.

### a) What is Naive Bayes?
The optimal Bayes classifier works by using the bayes rule and class-conditional probability distributions available. 
$$P(\omega_i|\mathbf{x}) = \frac{P(\omega_i)p(\mathbf{x}|\omega_i)}{p(\mathbf{x})},$$

Following this, the discriminant function for the classifier becomes,
$$g_i(\mathbf{x}) = P(\omega_i)p(\mathbf{x}|\omega_i).$$
Using this discrimnant function, the Bayes classifier produces optimal results in terms of probability of classification error. However, there are two issues associated with the Bayes classifier that prevents it from being a practical classifier and mostly making it a theoretical and optimal benchmark to strive for. These issue are:

1. There is an assumption that we know the class-conditional (likelihood) probability distributions and that they are available to us. In reality, this is usually not the case and these distributions need to be estimated, which may produce suboptimal results. There is also the case that estimating these distribution maybe too complex and challenging.

2. The estimation of these joint distributions, which are needed for the classifier, is computationally expensive.

Assuming that we can estimate the joint distributions, and that it is possible, the remaining issue is the second one mentioned. To fix this, we can make a "naive" assumption that there is conditional independence between every pair of features given the value of the class variable.
Thus we can rewrite the result of the Bayes theorem as:
$$P(\omega_i|\mathbf{x}) = \frac{P(\omega_i)p(\mathbf{x}|\omega_i)}{p(\mathbf{x})} = \frac{P(\omega_i)\prod_{j=1}^n p(x_j|\omega_i)}{p(\mathbf{x})}$$
where $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$.

We can rewrite this as the following discriminant function,
$$g_i(\mathbf{x})={P(\omega_i)\prod_{j=1}^n p(x_j|\omega_i)}$$
This is the discriminant function for the ***Naive Bayes classifier***. So it is the Bayes classifier with the assumption of conditional independence. 

Naive Bayes classifiers can be fast compared to more complex methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This helps solve issues that are cause by the curse of dimensionality.

Although, the assumption of conditional independence makes the task less computationally expensive, it is clear that it may lead to more classification error, since the assumption may not be true. Despite this and it simplicity, Naive Bayes classifier has shown to work well in a wide range of application, such as text specific applications like email spam detection.

Despite them being rare, independent data and features, could be a good indicator that this classifier will perform well. Naive Bayes also works well in high-dimensional spaces, such as text classification, where the dimensionality of the data (e.g., the number of unique words in the dataset) is very high.

### b) Implementation From Scratch

First of all, we will familiarize ourselves with the dataset that we will be working with.

#### 1. Import Libraries

In [603]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### 2. Import Dataset

In [604]:
data = "./data/survey lung cancer.csv"

df = pd.read_csv(data, sep=',')

#### 3. Exploratory Data Analysis

In [605]:
# Dimensions (data_num, feature_num)
df.shape

(309, 16)

In [606]:
# Preview of the first 5 rows
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


In [607]:
# Summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    object
 2   SMOKING                309 non-null    object
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    object
 6   CHRONIC DISEASE        309 non-null    object
 7   FATIGUE                309 non-null    object
 8   ALLERGY                309 non-null    object
 9   WHEEZING               309 non-null    object
 10  ALCOHOL CONSUMING      309 non-null    object
 11  COUGHING               309 non-null    object
 12  SHORTNESS OF BREATH    309 non-null    object
 13  SWALLOWING DIFFICULTY  309 non-null    object
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            

#### 4. Preprocess the Data
Convert the categorical data into numerical values and remove missing data.

In [608]:
train_data = df.values
train_data[:5]

array([['M', '69', '1', 2, 2, '1', '1', '2', '1', '2', '2', '2', '2',
        '2', 2, 'YES'],
       ['M', '74', '2', 1, 1, '1', '2', '2', '2', '1', '1', '1', '2',
        '2', 2, 'YES'],
       ['F', '59', '1', 1, 1, '2', '1', '2', '1', '2', '1', '2', '2',
        '1', 2, 'NO'],
       ['M', '63', '2', 2, 2, '1', '1', '1', '1', '1', '2', '1', '1',
        '2', 2, 'NO'],
       ['F', '63', '1', 2, 1, '1', '1', '1', '1', '2', '1', '2', '2',
        '1', 1, 'NO']], dtype=object)

In [609]:
train_data = df.values
train_data[train_data == "YES"] = "2"
train_data[train_data == "NO"] = "1"
train_data[train_data == "M"] = "1"
train_data[train_data == "F"] = "2"
mask = (train_data == 'x')
rows_with_x = np.any(mask, axis=1)
train_data = train_data[~rows_with_x]
train_data[:5]
train_data = train_data.astype(int)

#### 5. Split Data Into Separate Training and Test Set

In [610]:
x_data = train_data[:, 0:train_data.shape[1] - 1]
y_data = train_data[:, -1]
train_size = int(0.8 * x_data.shape[0])
indices = np.arange(x_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:train_size]
test_indices = indices[train_size:]
x_train = x_data[train_indices]
y_train = y_data[train_indices]
x_test = x_data[test_indices]
y_test = y_data[test_indices]
y_train

array([2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2])

#### 6. Implement the Naive Bayes
Here we implement the Naive Bayes classifier, using multinomial NB for the categorical features and Gaussian NB for the continuous features.

In [611]:
class Naive_Bayes:
    def __init__(self, x_train, y_train):
        self.x_train = x_train
        self.y_train = y_train
        self.means = {}
        self.vars = {}
        self.priors = {}
        self.class_counts = {} 
        self.feature_counts = {} 
        for class_value in [1, 2]:
            indices = np.where(self.y_train == class_value)
            self.means[class_value] = np.mean(self.x_train[indices, 1], axis=1)  
            self.vars[class_value] = np.var(self.x_train[indices, 1], axis=1)
            self.priors[class_value] = len(indices[0]) / float(len(self.y_train))
            self.class_counts[class_value] = len(indices[0])
            for i in range(self.x_train.shape[1]):
                if i != 1: 
                    feature_values, counts = np.unique(self.x_train[indices, i], return_counts=True)
                    self.feature_counts[(class_value, i)] = dict(zip(feature_values, counts))

    def gaussian_density(self, class_val, x):
        mean = self.means[class_val]
        var = self.vars[class_val] + 1e-8
        return np.exp(- (x - mean) ** 2 / (2 * var)) / np.sqrt(2 * np.pi * var)

    def multinomial_probability(self, class_val, feature_index, feature_value):
        if (class_val, feature_index) in self.feature_counts and feature_value in \
            self.feature_counts[(class_val, feature_index)]:
            return (self.feature_counts[(class_val, feature_index)][feature_value] + 1) \
                / (self.class_counts[class_val] + len(self.feature_counts[(class_val, feature_index)]))
        else:
            return 1 / (self.class_counts[class_val] + len(self.feature_counts[(class_val, feature_index)]))

    def predict(self, x):
        posteriors = []
        for class_val in self.priors.keys():
            prior = np.log(self.priors[class_val])
            conditional_gaussian = np.log(self.gaussian_density(class_val, x[1]))
            conditional_multinomial = np.sum([np.log(self.multinomial_probability(class_val, i, x[i])) \
                                              for i in range(len(x)) if i != 1])
            posterior = prior + conditional_gaussian + conditional_multinomial
            posteriors.append(posterior)
        return np.argmax(posteriors) + 1

    def get_prediction(self, x_test):
        return [self.predict(x) for x in x_test]


In [612]:
NB1 = Naive_Bayes(x_train, y_train)
y_train_pred = NB1.get_prediction(x_train)
y_pred = NB1.get_prediction(x_test)

Now that we have implemented the Naive Bayes from scratch without using any libraries, we will now import the Sci-Kit learn library to easily evaluate the model with metrics and confusion matrix.

In [613]:
from sklearn.metrics import accuracy_score, precision_score,\
      recall_score, f1_score, confusion_matrix, classification_report


In [614]:
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy_test}")

average_type = 'binary' 

precision_test = precision_score(y_test, y_pred, average=average_type)
recall_test = recall_score(y_test, y_pred, average=average_type)
f1_test = f1_score(y_test, y_pred, average=average_type)

print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")


Test Accuracy: 0.9322033898305084
Test Precision: 0.6
Test Recall: 0.6
Test F1 Score: 0.6


In [615]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[ 3  2]
 [ 2 52]]


In [616]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.60      0.60      0.60         5
           2       0.96      0.96      0.96        54

    accuracy                           0.93        59
   macro avg       0.78      0.78      0.78        59
weighted avg       0.93      0.93      0.93        59



### c) Implement Using Sci-Kit Learn

In [617]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(x_train, y_train)
y_pred1 = model.predict(x_test)

In [618]:
accuracy_test = accuracy_score(y_test, y_pred1)
print(f"Test Accuracy: {accuracy_test}")

average_type = 'binary'

precision_test = precision_score(y_test, y_pred1, average=average_type)
recall_test = recall_score(y_test, y_pred1, average=average_type)
f1_test = f1_score(y_test, y_pred1, average=average_type)

print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")


Test Accuracy: 0.9152542372881356
Test Precision: 0.5
Test Recall: 0.6
Test F1 Score: 0.5454545454545454


In [619]:
cm = confusion_matrix(y_test, y_pred1)
print("Confusion Matrix:\n", cm)


Confusion Matrix:
 [[ 3  2]
 [ 3 51]]


Comparing the results, looking at accuracy, precision, recall and F1-score, we can see that our model has similar performance to Sci-Kit Learn's model. Something I found to be interesting was that the error metrics for a lot of runs were the same between the two models.

Lung cancer detection is highly important. That is why we need to look for classifiers that are safe and sure in detecting when there is cancer. Here, looking at the confusion matrices, we can clearly see that the classifier misses the cancer more than we can afford to. Thus for this application, although the accuracy of the classifier is good enough, but we need to look for alternatives that will not miss cancer detection as frequently. Naive Bayes can be used in tasks that are not as critical as cancer detection systems, like the classic example of email spam detection.

The reason could be that the features here are highly dependent on one another and this goes against the conditional indpendence assumption of the Naive Bayes classifer.

### d) Web Page Phishing Dataset
This time, we are using Naive Bayes for a task that we know it is known to do well on. Naive Bayes performs well in detecting text-based fraud/spam. Here, we use it to detect which websites are phishing and which are not, a very similar task in tone.

##### 1. Import Dataset

In [620]:
data = "./data/web-page-phishing.csv"

df = pd.read_csv(data, sep=',')

#### 2. Exploratory data analysis

In [621]:
# Dimensions (data_num, feature_num)
df.shape

(100077, 20)

In [622]:
# Preview of the first 5 rows
df.head()

Unnamed: 0,url_length,n_dots,n_hypens,n_underline,n_slash,n_questionmark,n_equal,n_at,n_and,n_exclamation,n_space,n_tilde,n_comma,n_plus,n_asterisk,n_hastag,n_dollar,n_percent,n_redirection,phishing
0,37,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,77,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
2,126,4,1,2,0,1,3,0,2,0,0,0,0,0,0,0,0,0,1,1
3,18,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,55,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [623]:
# Summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100077 entries, 0 to 100076
Data columns (total 20 columns):
 #   Column          Non-Null Count   Dtype
---  ------          --------------   -----
 0   url_length      100077 non-null  int64
 1   n_dots          100077 non-null  int64
 2   n_hypens        100077 non-null  int64
 3   n_underline     100077 non-null  int64
 4   n_slash         100077 non-null  int64
 5   n_questionmark  100077 non-null  int64
 6   n_equal         100077 non-null  int64
 7   n_at            100077 non-null  int64
 8   n_and           100077 non-null  int64
 9   n_exclamation   100077 non-null  int64
 10  n_space         100077 non-null  int64
 11  n_tilde         100077 non-null  int64
 12  n_comma         100077 non-null  int64
 13  n_plus          100077 non-null  int64
 14  n_asterisk      100077 non-null  int64
 15  n_hastag        100077 non-null  int64
 16  n_dollar        100077 non-null  int64
 17  n_percent       100077 non-null  int64
 18  n_re

#### 3. Split the Training/Test Sets

In [666]:
train_data = df.values + 1
x_data = train_data[:, 0:train_data.shape[1] - 1]
y_data = train_data[:, -1]
train_size = int(0.8 * x_data.shape[0])
indices = np.arange(x_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:train_size]
test_indices = indices[train_size:]
x_train = x_data[train_indices]
y_train = y_data[train_indices]
x_test = x_data[test_indices]
y_test = y_data[test_indices]
np.unique(y_train)


array([1, 2])

#### 4. Implemented from Scratch Naive Bayes

In [667]:
# Instantiate a new naive bayes classifier
NB2 = Naive_Bayes(x_train, y_train)
y_pred = NB2.get_prediction(x_test)

In [668]:
np.unique(y_pred)

array([1, 2])

In [669]:
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy_test}")

average_type = 'binary' 

precision_test = precision_score(y_test, y_pred, average=average_type)
recall_test = recall_score(y_test, y_pred, average=average_type)
f1_test = f1_score(y_test, y_pred, average=average_type)

print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")

Test Accuracy: 0.8442745803357314
Test Precision: 0.8517757557968888
Test Recall: 0.9136560409287682
Test F1 Score: 0.8816314130558615


In [670]:
y_pred[:12]

[1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [671]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[11608  1097]
 [ 2020  5291]]


#### 5. Using Sci-Kit Learn

In [672]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [674]:
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy_test}")

average_type = 'binary' 

precision_test = precision_score(y_test, y_pred, average=average_type)
recall_test = recall_score(y_test, y_pred, average=average_type)
f1_test = f1_score(y_test, y_pred, average=average_type)

print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")

Test Accuracy: 0.712330135891287
Test Precision: 0.6933912365681199
Test Recall: 0.9802439984258166
Test F1 Score: 0.8122350485880128


In [665]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[12555   234]
 [ 5364  1863]]


It seems like our implementation works better than Sci-Kit Learn's model. The reason for this is probably the fact that our implementation is mixture of a little bit of Gaussian Naive Bayes (column 1) and mostly Multinomial Naive Bayes. In this dataset the data is more categorical in a sense and thus the existence of the large multinomial element causes the classifier to have improved performance.

In [675]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()

model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [676]:
accuracy_test = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy_test}")

average_type = 'binary' 

precision_test = precision_score(y_test, y_pred, average=average_type)
recall_test = recall_score(y_test, y_pred, average=average_type)
f1_test = f1_score(y_test, y_pred, average=average_type)

print(f"Test Precision: {precision_test}")
print(f"Test Recall: {recall_test}")
print(f"Test F1 Score: {f1_test}")

Test Accuracy: 0.8175959232613909
Test Precision: 0.8188926458157227
Test Recall: 0.9149940968122786
Test F1 Score: 0.8642801382848221


In [677]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[11625  1080]
 [ 2571  4740]]
