# Naive Bayes
$$ \begin{split} \mathop{argmax}_{c_k}p(y=c_k|x) &= \mathop{argmax}_{c_k}p(y=c_k)p(x|y=c_k) \\
& \left( due to: p(y=c_k|x) = \frac{p(y=c_k)p(x|y=c_k)}{p(x)} \right) \\
&= \mathop{argmax}_{c_k}p(y=c_k)\prod_jp(x^{(j)}|y=c_k) \end{split} $$
Use Maximum Likelihood Estimate(MLE) to evaluate $ p(y=c_k)$ and $ p(x^{(j)}|y=c_k) $ in datasets.
$$ \hat{p}(y=c_k) = \frac{\sum_i I(y_i=c_k)}{N} \\
\hat{p}(x^{(j)}=a_j|y=c_k) = \frac{\sum_i I(x_i^{(j)}=a_j,y=c_k)}{I(y_i=c_k)}
$$
Bayesian estimation add $ \lambda $ on numerator and denominator in MLE.

# Naive Bayes in Scikit-learn
Classifiers: GaussianNB, MultinomialNB, BernoulliNB

## Documents Classification
Use TF-IDF(Term Frequency and Inverse Document Frequency) of term in documents as feature
$$ TF-IDF = TF*IDF \\
TF(t) = \frac {\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}\\
IDF(t) = log_e\frac {\text{Total number of documents}}{\text{Number of documents with term t in it + 1}} $$
Bag of Words

### TfidfVectorizer
sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stop_words)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
vect = TfidfVectorizer()

In [3]:
documents=[
    'my dog has flea problems help please',
    'maybe not take him to dog park stupid',
    'my dalmation is so cute I love him',
    'stop posting stupid worthless garbage',
    'mr licks ate my steak how to stop him',
    'quit buying worthlsess dog food stupid',
]
targets=[0,1,0,1,0,1] # 0 normal, 1 insult

In [4]:
tf_matrix = vect.fit_transform(documents)

In [5]:
print(vect.get_feature_names())

['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless', 'worthlsess']


In [6]:
print(vect.vocabulary_)

{'my': 17, 'dog': 4, 'has': 8, 'flea': 5, 'problems': 22, 'help': 9, 'please': 20, 'maybe': 15, 'not': 18, 'take': 28, 'him': 10, 'to': 29, 'park': 19, 'stupid': 27, 'dalmation': 3, 'is': 12, 'so': 24, 'cute': 2, 'love': 14, 'stop': 26, 'posting': 21, 'worthless': 30, 'garbage': 7, 'mr': 16, 'licks': 13, 'ate': 0, 'steak': 25, 'how': 11, 'quit': 23, 'buying': 1, 'worthlsess': 31, 'food': 6}


In [7]:
tf_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.2836157 ,
        0.40966432, 0.        , 0.        , 0.40966432, 0.40966432,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.2836157 , 0.        , 0.        ,
        0.40966432, 0.        , 0.40966432, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.28007245,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.28007245, 0.        , 0.        , 0.        , 0.        ,
        0.40454634, 0.        , 0.        , 0.40454634, 0.40454634,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.28007245, 0.40454634, 0.33173378,
        0.        , 0.        ],
       [0.        , 0.        , 0.40966432, 0.40966432, 0.        ,
        0.        , 0.        , 0.        , 0.    

### CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
c_vect = CountVectorizer()
c_matrix = c_vect.fit_transform(documents)

In [10]:
print(c_vect.get_feature_names())

['ate', 'buying', 'cute', 'dalmation', 'dog', 'flea', 'food', 'garbage', 'has', 'help', 'him', 'how', 'is', 'licks', 'love', 'maybe', 'mr', 'my', 'not', 'park', 'please', 'posting', 'problems', 'quit', 'so', 'steak', 'stop', 'stupid', 'take', 'to', 'worthless', 'worthlsess']


In [11]:
c_matrix.toarray()

array([[0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 0, 1, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 0, 1]], dtype=int64)