# Naive Bayes
$$ \begin{split} \mathop{argmax}_{c_k}p(y=c_k|x) &= \mathop{argmax}_{c_k}p(y=c_k)p(x|y=c_k) \\
& \left( due to: p(y=c_k|x) = \frac{p(y=c_k)p(x|y=c_k)}{p(x)} \right) \\
&= \mathop{argmax}_{c_k}p(y=c_k)\prod_jp(x^{(j)}|y=c_k) \end{split} $$
Use Maximum Likelihood Estimate(MLE) to evaluate $ p(y=c_k)$ and $ p(x^{(j)}|y=c_k) $ in datasets.
$$ \hat{p}(y=c_k) = \frac{\sum_i I(y_i=c_k)}{N} \\
\hat{p}(x^{(j)}=a_j|y=c_k) = \frac{\sum_i I(x_i^{(j)}=a_j,y=c_k)}{I(y_i=c_k)}
$$
Bayesian estimation add $ \lambda $ on numerator and denominator in MLE.

# Naive Bayes in Scikit-learn
Classifiers: GaussianNB, MultinomialNB, BernoulliNB

## Documents Classification
Use TF-IDF(Term Frequency and Inverse Document Frequency) of term in documents as feature
$$ TF-IDF = TF*IDF \\
TF(t) = \frac {\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}\\
IDF(t) = log_e\frac {\text{Total number of documents}}{\text{Number of documents with term t in it + 1}} $$
sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stop_words)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
vect = TfidfVectorizer()

In [3]:
documents=[
    'my dog has flea problems help please',
    'maybe not take him to dog park stupid',
    'my dalmation is so cute I love him',
    'stop posting stupid worthless garbage',
    'mr licks ate my steak how to stop him',
    'quit buying worthlsess dog food stupid',
]
targets=[0,1,0,1,0,1] # 0 normal, 1 insult

In [4]:
tf_matrix = vect.fit_transform(documents)

In [5]:
print(vect.get_feature_names())

['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless', 'worthlsess']


In [6]:
print(vect.vocabulary_)

{'my': 17, 'dog': 4, 'has': 8, 'flea': 5, 'problems': 22, 'help': 9, 'please': 20, 'maybe': 15, 'not': 18, 'take': 28, 'him': 10, 'to': 29, 'park': 19, 'stupid': 27, 'dalmation': 3, 'is': 12, 'so': 24, 'cute': 2, 'love': 14, 'stop': 26, 'posting': 21, 'worthless': 30, 'garbage': 7, 'mr': 16, 'licks': 13, 'ate': 0, 'steak': 25, 'how': 11, 'quit': 23, 'buying': 1, 'worthlsess': 31, 'food': 6}


In [7]:
print(tf_matrix)

  (0, 17)	0.2836156972830696
  (0, 4)	0.2836156972830696
  (0, 8)	0.40966431929307107
  (0, 5)	0.40966431929307107
  (0, 22)	0.40966431929307107
  (0, 9)	0.40966431929307107
  (0, 20)	0.40966431929307107
  (1, 4)	0.28007245489665356
  (1, 15)	0.4045463374809687
  (1, 18)	0.4045463374809687
  (1, 28)	0.4045463374809687
  (1, 10)	0.28007245489665356
  (1, 29)	0.33173378384997615
  (1, 19)	0.4045463374809687
  (1, 27)	0.28007245489665356
  (2, 17)	0.2836156972830696
  (2, 10)	0.2836156972830696
  (2, 3)	0.40966431929307107
  (2, 12)	0.40966431929307107
  (2, 24)	0.40966431929307107
  (2, 2)	0.40966431929307107
  (2, 14)	0.40966431929307107
  (3, 27)	0.3397724008063106
  (3, 26)	0.4024458035648648
  (3, 21)	0.49077900350475
  (3, 30)	0.49077900350475
  (3, 7)	0.49077900350475
  (4, 17)	0.25617597302407796
  (4, 10)	0.25617597302407796
  (4, 29)	0.30342942826735725
  (4, 26)	0.30342942826735725
  (4, 16)	0.3700294328328553
  (4, 13)	0.3700294328328553
  (4, 0)	0.3700294328328553
  (4, 25)	0