# Naive Bayes

### Model
#### $P(y|X)=\frac{P(X|y)P(y)}{P(X)}$

if you have a lot of variables: <br>
#### $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$

__*!!! since the denominator is the same for both probabilities, it can be omitted from the calculation, and only need to consider the numerator.*__
###### $P(y|x_1, ..., x_n)=P(x_1|y)P(x_2|y)...P(x_n|y)P(y)$ 
<br>
Naive Bayes is to use variable X to classify target y based on comparasion of probability of being target 1, target 2, target n

### Assumption
1. Features are independent to each other. 
2. Every feature is equally important.

**Step**
1. separate the dataset by target
2. calculate the probability of each target: $P(Y)$
3. for loop each target group:
    * sum up the frequency for each unique word
    * calculate the probability of each word in the target group, **remark: it is $P(X|y)$, conditional prob of x given by target y**
    
Now, we have $P(Y)$, $P(X|y)$ and P(X)<br>
**Input new data** <br>
for loop each target group:
1. multiple the conditional probability for the input data: $P(y|x_1, ..., x_n)=\frac{P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)}$
2. compare probability (likelihood) and assign the target with highest prob to the input data

In [1]:
import pandas as pd
import numpy as np
import os

In [5]:
ham_path = 'data\email\ham'
spam_path = 'data/email/spam/'

ham = []
for filename in os.listdir(ham_path):
    file_path = os.path.join(ham_path, filename)
    with open(file_path, encoding='cp1252') as f:
        a = f.read()
        print(a)

Hi Peter,

With Jose out of town, do you want to
meet once in a while to keep things
going and do some interesting stuff?

Let me know
Eugene
Ryan Whybrew commented on your status.

Ryan wrote:
"turd ferguson or butt horn."

Arvind Thirumalai commented on your status.

Arvind wrote:
""you know""


Reply to this email to comment on this status.


Thanks Peter.

I'll definitely check in on this. How is your book
going? I heard chapter 1 came in and it was in 
good shape. ;-)

I hope you are doing well.

Cheers,

Troy
Jay Stepp commented on your status.

Jay wrote:
""to the" ???"


Reply to this email to comment on this status.

To see the comment thread, follow the link below:


LinkedIn

Kerry Haloney requested to add you as a connection on LinkedIn:

Peter,

I'd like to add you to my professional network on LinkedIn.

- Kerry Haloney
 

Hi Peter,
 
The hotels are the ones that rent out the tent. They are all lined up on the hotel grounds : )) So much for being one with nature, more lik

In [14]:
def loadDataSet():
    postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = np.array([0,1,0,1,0,1])    #1 is abusive, 0 not
    return postingList, classVec

In [28]:
postingList,listClasses = loadDataSet()

In [29]:
postingList

[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

In [30]:
listClasses

array([0, 1, 0, 1, 0, 1])

In [None]:
# extract unique text
def createVocabList(dataSet):
    vocab_set = []
    for post in dataSet:
        vocab_set += post
    return list(set(vocab_set))

In [None]:
vocabList = createVocabList(postingList)
print(vocabList)

['licks', 'take', 'mr', 'not', 'ate', 'maybe', 'I', 'quit', 'worthless', 'park', 'to', 'love', 'steak', 'dog', 'stop', 'posting', 'garbage', 'has', 'food', 'problems', 'buying', 'my', 'how', 'please', 'flea', 'help', 'stupid', 'is', 'so', 'him', 'dalmation', 'cute']


In [None]:
# transfer text into number and 1 = exist, 0 = not exist
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
    return returnVec

In [None]:
print(setOfWords2Vec(vocabList, postingList[0]))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]


In [None]:
postingList,listClasses = loadDataSet()
vocabList = createVocabList(postingList)
trainMat = []
for doc in postingList:
    trainMat.append(setOfWords2Vec(vocabList, doc))
trainMat = np.array(trainMat)
trainMat[1]

array([0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0])

In [None]:
def Naive_bayes(X, y, y_value):
    X = X + 1
    prob_y = np.count_nonzero(y == y_value) / len(y)    # P(y)
    class_index = np.where(y == y_value)
    group = X[class_index]
    prob_x = group.sum(axis=0) / group.sum()    # P(X|y)
    return prob_x, prob_y

In [None]:
for y_value in np.unique(listClasses):
    prob_x, prob_y = Naive_bayes(trainMat, listClasses, y_value)
prob_x, prob_y

(array([0.02608696, 0.03478261, 0.02608696, 0.03478261, 0.02608696,
        0.03478261, 0.02608696, 0.03478261, 0.04347826, 0.03478261,
        0.03478261, 0.02608696, 0.02608696, 0.04347826, 0.03478261,
        0.03478261, 0.03478261, 0.02608696, 0.03478261, 0.02608696,
        0.03478261, 0.02608696, 0.02608696, 0.02608696, 0.02608696,
        0.02608696, 0.05217391, 0.02608696, 0.02608696, 0.03478261,
        0.02608696, 0.02608696]),
 0.5)