# Naive Bayes

### Model
#### $P(y|X)=\frac{P(X|y)P(y)}{P(X)}$

if you have a lot of variables: <br>
#### $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$

__*!!! since the denominator is the same for both probabilities, it can be omitted from the calculation, and only need to consider the numerator.*__
###### $P(y|x_1, ..., x_n)=P(x_1|y)P(x_2|y)...P(x_n|y)P(y)$ 
<br>
Naive Bayes is to use variable X to classify target y based on comparasion of probability of being target 1, target 2, target n

### Assumption
1. Features are independent to each other. 
2. Every feature is equally important.

**Step**
1. separate the dataset by target
2. calculate the probability of each target: $P(Y)$
3. for loop each target group:
    * sum up the frequency for each unique word
    * calculate the probability of each word in the target group, **remark: it is $P(X|y)$, conditional prob of x given by target y**
    
Now, we have $P(Y)$, $P(X|y)$ and P(X)<br>
**Input new data** <br>
for loop each target group:
1. multiple the conditional probability for the input data: $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$
2. compare probability (likelihood) and assign the target with highest prob to the input data

In [23]:
import pandas as pd
import numpy as np
import os
import re

In [72]:
ham_word_list = []
spam_word_list = []
for i in range(1, 26):
    ham_file = open(f'data/email/ham/{i}.txt', 'r').read()
    ham_words = list(filter(None, re.split(r"\W+", ham_file)))
    ham_words = [j.lower() for j in ham_words if len(j) > 2]
    ham_word_list.extend(ham_words)
    
    spam_file = open(f'data/email/spam/{i}.txt', 'r').read()
    spam_words = list(filter(None, re.split(r"\W+", spam_file)))
    spam_words = [j.lower() for j in spam_words if len(j) > 2]
    spam_word_list.extend(spam_words)

spam_word_list

['codeine',
 '15mg',
 'for',
 '203',
 'visa',
 'only',
 'codeine',
 'methylmorphine',
 'narcotic',
 'opioid',
 'pain',
 'reliever',
 'have',
 '15mg',
 '30mg',
 'pills',
 '15mg',
 'for',
 '203',
 '15mg',
 'for',
 '385',
 '15mg',
 'for',
 '562',
 'visa',
 'only',
 'hydrocodone',
 'vicodin',
 'brand',
 'watson',
 'vicodin',
 '750',
 '195',
 '120',
 '570',
 'brand',
 'watson',
 '750',
 '195',
 '120',
 '570',
 'brand',
 'watson',
 '325',
 '199',
 '120',
 '588',
 'noprescription',
 'required',
 'free',
 'express',
 'fedex',
 'days',
 'delivery',
 'for',
 'over',
 '200',
 'order',
 'major',
 'credit',
 'cards',
 'check',
 'you',
 'have',
 'everything',
 'gain',
 'incredib1e',
 'gains',
 'length',
 'inches',
 'yourpenis',
 'permanantly',
 'amazing',
 'increase',
 'thickness',
 'yourpenis',
 'betterejacu1ation',
 'control',
 'experience',
 'rock',
 'harderecetions',
 'explosive',
 'intenseorgasns',
 'increase',
 'volume',
 'ofejacu1ate',
 'doctor',
 'designed',
 'and',
 'endorsed',
 '100',
 'he