In [1]:
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd

%matplotlib inline

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Naive Bayes Classifier
It is one of the simplest machine learning model for text classification. It uses the probabilisti distribution of tokens/words (counts) to classify documents. It is based on infamous **Bayes Theorem** which goes like this:

`Prob(B | A) = Prob(A | B) * Prob(B) / Prob(A)`

Fair enough, right?

## Text Classification
Say we have Document **D** that belongs to class **C**. So, using bayes theorem we can infer:  

`Prob(C | D) = Prob(D | C) * Prob(C) / Prob(D)`

So far so good.  

We know a document is made up of tokens (combination of tokens, commonly referred to as **ngram language model**):  
`D = [d1, d2, d3, ...]`

```bash
Prob(C | D)
= Prob(d1 | C) * Prob(d2 | C) * Prob(d3 | C) .... * Prob(C) / Prob(D)
```

Remember, we have seggregated Prob(D | C) to individual probabilities of individual tokens (ngrams) constituting
document D. This is why Naive Bayes classifier is **Naive** - it assumes  each tokens are independent of each other.  
  
Think it of as two independent events **A** and **B**. So, what's the probability of both events occuring simultaneously?  
`Prob(A, B) = Prob(A) * Prob(B)`

Now we can infer the Probabities Prob(di | C) as :  
` (count(di) that belongs to class C) / (total number of tokens)`

### Putting Things Into Perspective
And that is how we can find the probability of document **D** beloning to class **C** assuming independence 
of individual features(ngrams). 

Now, say we have classes:  
C1, C2, C3, ...  


And we want to classify a test document **D**. All we have to do is find the probabilty of this document **D**
beloning to each of the classes. And we choose the class where **Prob(D | C)** is the highest.

#### Training Steps (somewhat)
It's nothing but counting the "stuff" that matter.
- tokenize the documents for each classes
- each token can be unigram, bigram, ...
- extract features for each token -> counts or tf-idf

#### Let's classify
- extract features (count) for the document to be classified
- calculate **Prob(C1 | D)**
- Calculate **Prob(C2 | D)**
- Calculate **Prob(C3 | D)**
- choose the Class **Ci** that has max probability

**Side note**:  
Since **Prob(D)** is constant, we can ignore the denominator part and just focus on the numerator's products.  

So, all we are doing is:  

Choose class **Ci** according to argmax{ Prob(Ci | D) }

In [90]:
# noob documents for training :P
spam = [
    "you have won a lottery",
    "congratulations! you have a bonus",
    "this is bomb",
    "to use the credit, please click the link",
    "thank you for subscription. please click the link",
    "bomb"
]
Y_spam = [1 for i in range(len(spam)) ]

non_spam = [
    "i am awesome",
    "i have a meeting tomorrow",
    "you are smart",
    "get me out of here",
    "call me later"
]
Y_non_spam = [0 for i in range(len(non_spam)) ]

In [91]:
# feature extraction
count_vectorizer = CountVectorizer(ngram_range=(1, 2)).fit(spam + non_spam)
X_train_vectorized = count_vectorizer.transform(spam + non_spam)

In [92]:
# Naive Bayes Model
model = MultinomialNB(alpha=0.1)
model.fit(X_train_vectorized, Y_spam + Y_non_spam)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [93]:
documents = [
    "call you",
    "you have won"
]
predictions = model.predict(count_vectorizer.transform(documents))
print(predictions)

[0 1]


In [89]:
# convert to pandas dataframe for seamless training
spam_df = pd.DataFrame(spam, columns=['text'])
spam_df['target'] = 1
non_spam_df = pd.DataFrame(non_spam, columns=['text'])
non_spam_df['target'] = 0

# final data
data = pd.concat([spam_df, non_spam_df], ignore_index=True)
data

Unnamed: 0,text,target
0,you have won a lottery,1
1,congratulations! you have a bonus,1
2,this is bomb,1
3,"to use the credit, please click the link",1
4,thank you for subscription. please click the link,1
5,bomb,1
6,i am awesome,0
7,i have a meeting tomorrow,0
8,you are smart,0
9,get me out of here,0
