# Naive Bayes
<div class="alert alert-block alert-info">
<b>Content:</b> In this notebook, 
    we demonstrate different versions of the Naive Bayes classifier on a dataset of newsgroup entries.
</div>

<div class="alert alert-block alert-warning">
<b>Time:</b> It takes 4-5 minutes to execute this notebook. Start all cells at once at the beging.
</div>

In [None]:
import numpy as np, pandas as pd
import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, CategoricalNB
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

# Loading and Initial Analysis of Dataset

In [None]:
# Load the dataset
data = fetch_20newsgroups()
# Get the text categories
text_categories = data.target_names
# define the training set
train_data = fetch_20newsgroups(subset="train", categories=text_categories)
# define the test set
test_data = fetch_20newsgroups(subset="test", categories=text_categories)

In [None]:
print("We have {} unique classes".format(len(text_categories)))
print("We have {} training samples".format(len(train_data.data)))
print("We have {} test samples".format(len(test_data.data)))

In [None]:
# let’s have a look as some training data
print(test_data.data[6])
type(test_data.data[6])

In [None]:
print(test_data.target_names[test_data.target[6]])

In [None]:
test_data.target_names

# Preprocessing

In [None]:
vec=CountVectorizer(stop_words='english', lowercase=True)
X_train=vec.fit_transform(train_data.data)
X_test=vec.transform(test_data.data)
len(vec.vocabulary_)

## Try Categorical Naive Bayes

In [None]:
start = time.time()
categorical_nb=CategoricalNB()
categorical_nb.fit(X_train.toarray(), train_data.target)
print(time.time()-start)

In [None]:
#predicted_categories = categorical_nb.predict(vec.transform(test_data.data).toarray())

Not only is the learning runtime of Naive Bayes suffering from the number of features, we also cannot even use it when the data are counts rather than categories. 

## Try Multinomial Naive Bayes

In [None]:
start = time.time()
multinomial_nb= MultinomialNB()
# Train the model using the training data
multinomial_nb.fit(X_train, train_data.target)
print(time.time()-start)

In [None]:
predicted_categories = multinomial_nb.predict(vec.transform(test_data.data))

In [None]:
accuracy_score(test_data.target, predicted_categories)

In [None]:
prod_data=[
    "I am the doctor", 
    "May Abraham find his Eva", 
    "Have you tried turning it off an on again?"
]
prod_pred=multinomial_nb.predict(vec.transform(prod_data))
np.array(text_categories)[prod_pred]

<div class="alert alert-block alert-info">
<b>Take Aways:</b> 

* Run Naive Bayes in different versions
* The influence and consequences of different assumptions on the data distributions.
</div>

<div class="alert alert-block alert-success">
<b>Play with:</b> 
    
* different new "ficticious" posts
* try the same experiments without excluding stopwords and observe the results
</div>