# Naive Bayes

### Model
#### $P(y|X)=\frac{P(X|y)P(y)}{P(X)}$

if you have a lot of variables: <br>
#### $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$

__*!!! since the denominator is the same for both probabilities, it can be omitted from the calculation, and only need to consider the numerator.*__
###### $P(y|x_1, ..., x_n)=P(x_1|y)P(x_2|y)...P(x_n|y)P(y)$ 
<br>
Naive Bayes is to use variable X to classify target y based on comparasion of probability of being target 1, target 2, target n

### Assumption
1. Features are independent to each other. 
2. Every feature is equally important.

**Step**
1. separate the dataset by target, you need to have a value count table for each class
2. calculate the probability of each target: $P(Y)$ = number of Y / total number of entities
3. for loop each target group:
    * sum up the frequency for each unique word
    * calculate the probability of each word in the target group, **remark: it is $P(X|y)$, conditional prob of x given by target y**
    
Now, we have $P(Y)$, $P(X|y)$ and P(X)<br>
**Input new data** <br>
for loop each target group:
1. multiple the conditional probability for the input data: $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$
2. compare probability (likelihood) and assign the target with highest prob to the input data

In [168]:
import pandas as pd
import numpy as np
import os
import re
from collections import Counter
import random

In [169]:
def loadDataSet():
    data = []
    target = []
    root_dir = "data/email"
    for sub_folder in os.listdir(root_dir):
        for txt_file in os.listdir(f'{root_dir}/{sub_folder}'):
            with open(f'{root_dir}/{sub_folder}/{txt_file}', encoding="latin-1") as f:
                file_content = f.read()
                words = list(filter(None, re.split(r"\W+", file_content)))
                words = [word.lower() for word in words if len(word) > 2]
                data.append(words)
                if sub_folder == 'ham':
                    target.append(0) # ham = 0
                else:
                    target.append(1) # spam = 
    return data,target

# create a unique word list from ham and spam
def create_vocab_list(train_X):
    vocab_set = set()  # create empty set
    for document in train_X:
        vocab_set = vocab_set | set(document)  # union of the two sets
    return list(vocab_set)

def Conditional_Prob(vocab_list, train_X, train_y):
    return_vec_0 = np.ones(len(vocab_list)) # a value_count list for ham 
    return_vec_1 = np.ones(len(vocab_list)) # a value_count list for spam 
    for num in range(len(train_y)):
        if train_y[num] == 0:
            for word in train_X[num]:
                return_vec_0[vocab_list.index(word)] += 1
        if train_y[num] == 1: 
            for word in train_X[num]:
                return_vec_1[vocab_list.index(word)] += 1
    P_0 = Counter(train_y)[0] / len(train_y)
    P_1 = 1 - P_0
    CP_0 = return_vec_0 / np.sum(return_vec_0)
    CP_1 = return_vec_1 / np.sum(return_vec_1)
    return P_0, P_1, CP_0, CP_1

In [170]:
def Classify(vocab_list, P_0, P_1, CP_0, CP_1, test):
    test_index = [vocab_list.index(i) for i in test if i in vocab_list]
    p0 = np.sum(np.log(CP_0[test_index]))*P_0
    p1 = np.sum(np.log(CP_1[test_index]))*P_1
    if p0 > p1:
        return 0
    else:
        return 1

In [171]:
data, target = loadDataSet()
score_list = []

# repeat 1000 to see how accurate the model
for rep in range(1000):
    # split data into train and test set
    test_index = random.sample(range(50), 10)
    train_X = [data[num] for num in range(len(data)) if num not in test_index]
    train_y = [target[num] for num in range(len(target)) if num not in test_index]
    test_X = [data[i] for i in test_index]
    test_y = [target[i] for i in test_index]

    vocab_list = create_vocab_list(train_X)
    P_0, P_1, CP_0, CP_1 = Conditional_Prob(vocab_list, train_X, train_y)
    result = []
    for sample in test_X:
        result.append(Classify(vocab_list, P_0, P_1, CP_0, CP_1, sample))
    a = np.array(result)
    b = np.array(test_y)
    score = sum(a == b) / len(test_y)
    score_list.append(score)
    if rep % 100 == 0:
        print(rep)

0
100
200
300
400
500
600
700
800
900


In [172]:
print(f'Score: {sum(score_list)/1000}')

Score: 0.8940999999999941
