# Bayes Estimation

In [33]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

## References

- https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html

In [46]:
dataset = pd.read_csv('/opt/datasetsRepo/smsspamcollection/SMSSpamCollection', sep='\t', names=['label', 'text'])
dataset['flag'] = dataset['label'].map({ "ham" : 0, "spam" : 1})

df = pd.concat([ 
    dataset.query('flag == 1').sample(50), 
    dataset.query('flag == 0').sample(50) 
], axis = 0).sample(frac=1, random_state=0)

df.head()

Unnamed: 0,label,text,flag
531,spam,PRIVATE! Your 2003 Account Statement for 07815...,1
2919,ham,Thanks chikku..:-) gud nyt:-*,0
160,spam,You are a winner U have been specially selecte...,1
5275,ham,Oh yeah clearly it's my fault,0
5481,ham,Shall call now dear having food,0


In [47]:
stop_words = set(stopwords.words('english'))

In [48]:
df['tokenized'] = df['text'].apply(lambda x: [i for i in word_tokenize(x.lower()) if i not in stop_words])

In [49]:
df.head()

Unnamed: 0,label,text,flag,tokenized
531,spam,PRIVATE! Your 2003 Account Statement for 07815...,1,"[private, !, 2003, account, statement, 0781529..."
2919,ham,Thanks chikku..:-) gud nyt:-*,0,"[thanks, chikku, .., :, -, ), gud, nyt, :, -, *]"
160,spam,You are a winner U have been specially selecte...,1,"[winner, u, specially, selected, 2, receive, £..."
5275,ham,Oh yeah clearly it's my fault,0,"[oh, yeah, clearly, 's, fault]"
5481,ham,Shall call now dear having food,0,"[shall, call, dear, food]"


## Naive Bayes

\begin{align}
    P(Y=y | X=x) &= \frac{P(X=x | Y=y) P(Y=y)}{P(X=x)}\\
    \\
    &\text{Where } \\
    P(X=x | Y=y) &= \prod_{\alpha=1}^{d} P([X]_\alpha = x_\alpha| Y = y)
\end{align}


- Naively assumes that all the features used are independently distrubuted variables given the label Y.
- for example given that there is an email where all the words are independent given the label spam/ham.

## Bayes Classifier

\begin{align*}
    h(\vec{x}) &= {argmax\atop{y}} \frac{P(\vec{x} | y) P(y)}{z}\\
    \\
    &= {argmax\atop{y}} P(y) \prod_{\alpha} P([\vec{X}]_\alpha | y)\\
    \\
    &= {argmax\atop{y}} ( log(P(y) + \sum_\alpha log P([\vec{X}]_\alpha | y))
\end{align*}


P.S. - In computer science we dont prefer multiplying probabilities due to muliple reasons(see reference section). Hence we take log and convert multiplication to addition.