# Naive Bayes

### Recap - Conditional Probabilities

P(A|B) = P(A and B)/P(B)  

P(A|B) = P(B|A)/P(B)*P(A)  
OR  
P(A|B) = P(B|A) * P(A)/P(B)  

### Summary

This technique is very simple to use and runs prety fast.  
It offers good baselines on easy tasks such as sentiment analysis.  
More time needs to be spent on the feature representation of the data, such as BOW, TF-IDF, etc.  
  
### Additional Resources
- Week 2 of Natural Language Processing with Classification and Vector Spaces


In [14]:
import sys
sys.path.append('../src')
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
df = pd.read_parquet('../data/processed/imdb_dataset.parquet')
train_df = df[df['role']=='train']
test_df = df[df['role']=='test']
del df

### Simple BOW with vocab = 20k

In [3]:
train_df.shape

(39964, 3)

In [4]:
cv = CountVectorizer(max_features=20000)
train_x = cv.fit_transform(train_df['review'])
test_x = cv.transform(test_df['review'])

In [5]:
train_x = pd.DataFrame(train_x.toarray())

In [6]:
model = MultinomialNB().fit(train_x, train_df['sentiment'])

In [7]:
y_pred = model.predict(test_x)

In [8]:
accuracy_score(test_df['sentiment'], y_pred)

0.8541251494619371

### Simple BOW with ngrams = 4 and vocab = 20k

In [10]:
cv = CountVectorizer(max_features=20000, ngram_range=(1,4))
train_x = cv.fit_transform(train_df['review'])
test_x = cv.transform(test_df['review'])
train_x = pd.DataFrame(train_x.toarray())

In [12]:
model = MultinomialNB().fit(train_x, train_df['sentiment'])
y_pred = model.predict(test_x)
accuracy_score(test_df['sentiment'], y_pred)

0.8675767237943404

### Simple TF-IDF with ngrams = 4 and covab 20k

TF-IDF is used to penalize words that are too often in the different classes. For example, this might penalize movie titles, that are not part of the stopwords. 

In addition, we often use the log of TF-IDF because ... 

In [15]:
cv = TfidfVectorizer(max_features=20000, ngram_range=(1,4))
train_x = cv.fit_transform(train_df['review'])
test_x = cv.transform(test_df['review'])
train_x = pd.DataFrame(train_x.toarray())

In [16]:
model = MultinomialNB().fit(train_x, train_df['sentiment'])
y_pred = model.predict(test_x)
accuracy_score(test_df['sentiment'], y_pred)

0.8764447987245915