# Naive Bayes

### Model
#### $P(y|X)=\frac{P(X|y)P(y)}{P(X)}$

if you have a lot of variables: <br>
#### $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$

__*!!! since the denominator is the same for both probabilities, it can be omitted from the calculation, and only need to consider the numerator.*__
###### $P(y|x_1, ..., x_n)=P(x_1|y)P(x_2|y)...P(x_n|y)P(y)$ 
<br>
Naive Bayes is to use variable X to classify target y based on comparasion of probability of being target 1, target 2, target n

### Assumption
1. Features are independent to each other. 
2. Every feature is equally important.

**Step**
1. separate the dataset by target, you need to have a value count table for each class
2. calculate the probability of each target: $P(Y)$ = number of Y / total number of entities
3. for loop each target group:
    * sum up the frequency for each unique word
    * calculate the probability of each word in the target group, **remark: it is $P(X|y)$, conditional prob of x given by target y**
    
Now, we have $P(Y)$, $P(X|y)$ and P(X)<br>
**Input new data** <br>
for loop each target group:
1. multiple the conditional probability for the input data: $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$
2. compare probability (likelihood) and assign the target with highest prob to the input data

In [14]:
import pandas as pd
import numpy as np
import os
import re
from collections import Counter

In [2]:
ham_word_list = []
spam_word_list = []
for i in range(1, 26):
    ham_file = open(f'data/email/ham/{i}.txt', 'r').read()
    ham_words = list(filter(None, re.split(r"\W+", ham_file)))
    ham_words = [j.lower() for j in ham_words if len(j) > 2]
    ham_word_list.extend(ham_words)
    
    spam_file = open(f'data/email/spam/{i}.txt', 'r').read()
    spam_words = list(filter(None, re.split(r"\W+", spam_file)))
    spam_words = [j.lower() for j in spam_words if len(j) > 2]
    spam_word_list.extend(spam_words)

UnicodeDecodeError: 'cp950' codec can't decode byte 0x92 in position 884: illegal multibyte sequence

In [11]:
ham_word_list = []
spam_word_list = []
ham_file = open(f'data/email/ham/full.txt', 'r', encoding='utf-8').read()
ham_words = list(filter(None, re.split(r"\W+", ham_file)))
ham_words = [j.lower() for j in ham_words if len(j) > 2]
ham_word_list.extend(ham_words)

spam_file = open(f'data/email/spam/spam_full.txt', 'r', encoding='utf-8').read()
spam_words = list(filter(None, re.split(r"\W+", spam_file)))
spam_words = [j.lower() for j in spam_words if len(j) > 2]
spam_word_list.extend(spam_words)

In [59]:
value_count = Counter(ham_word_list)
ham_df = pd.DataFrame.from_dict(value_count, orient='index', columns=['count']).reset_index()
ham_df = ham_df.rename(columns={'index': 'word'})

value_count = Counter(spam_word_list)
spam_df = pd.DataFrame.from_dict(value_count, orient='index', columns=['count']).reset_index()
spam_df = spam_df.rename(columns={'index': 'word'})

ham_df['CP'] = ham_df['count'] / sum(ham_df['count'])
spam_df['CP'] = spam_df['count'] / sum(spam_df['count'])

# --------------------------

In [63]:
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
    return postingList,classVec

In [64]:
# create a unique word list
def create_vocab_list(data_set):
    vocab_set = set()  # create empty set
    for document in data_set:
        vocab_set = vocab_set | set(document)  # union of the two sets
    return list(vocab_set)

In [70]:
# check if input word is in unique word list
def set_of_words_2_vec(vocab_list, input_set):
    return_vec = [0] * len(vocab_list)
    for word in input_set:
        if word in vocab_list:
            return_vec[vocab_list.index(word)] = 1
        else:
            print(f"The word '{word}' is not in my vocabulary!")
    return return_vec

In [81]:
listOPosts, listClasses = loadDataSet()
myVocabList = create_vocab_list(listOPosts)
set_of_words_2_vec(myVocabList, listOPosts[2])

[1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0]