# Naive Bayes
<div class="alert alert-block alert-info">
<b>Content:</b> In this notebook, 
    we demonstrate different versions of the Naive Bayes classifier on a dataset of newsgroup entries.
</div>

<div class="alert alert-block alert-warning">
<b>Time:</b> It takes 4-5 minutes to execute this notebook. Start all cells at once at the beging.
</div>

In [1]:
import numpy as np, pandas as pd
import time

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, CategoricalNB
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

# Loading and Initial Analysis of Dataset

In [2]:
# Load the dataset
data = fetch_20newsgroups()
# Get the text categories
text_categories = data.target_names
# define the training set
train_data = fetch_20newsgroups(subset="train", categories=text_categories)
# define the test set
test_data = fetch_20newsgroups(subset="test", categories=text_categories)

In [3]:
print("We have {} unique classes".format(len(text_categories)))
print("We have {} training samples".format(len(train_data.data)))
print("We have {} test samples".format(len(test_data.data)))

We have 20 unique classes
We have 11314 training samples
We have 7532 test samples


In [4]:
# let’s have a look as some training data
print(test_data.data[6])
type(test_data.data[6])

From: PETCH@gvg47.gvg.tek.com (Chuck)
Subject: Daily Verse
Lines: 3

Dishonest money dwindles away, but he who gathers money little by little makes
it grow. 
Proverbs 13:11



str

In [5]:
print(test_data.target_names[test_data.target[6]])

soc.religion.christian


In [6]:
test_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

# Preprocessing

In [7]:
vec=CountVectorizer(stop_words='english', lowercase=True)
X_train=vec.fit_transform(train_data.data)
X_test=vec.transform(test_data.data)
len(vec.vocabulary_)

129796

## Try Categorical Naive Bayes

In [8]:
start = time.time()
categorical_nb=CategoricalNB()
categorical_nb.fit(X_train.toarray(), train_data.target)
print(time.time()-start)

391.21691489219666


In [16]:
#predicted_categories = categorical_nb.predict(vec.transform(test_data.data).toarray())

Not only is the learning runtime of Naive Bayes suffering from the number of features, we also cannot even use it when the data are counts rather than categories. 

## Try Multinomial Naive Bayes

In [10]:
start = time.time()
multinomial_nb= MultinomialNB()
# Train the model using the training data
multinomial_nb.fit(X_train, train_data.target)
print(time.time()-start)

0.34355974197387695


In [11]:
predicted_categories = multinomial_nb.predict(vec.transform(test_data.data))

In [12]:
accuracy_score(test_data.target, predicted_categories)

0.8023101433882103

In [20]:
prod_data=[
    "I am the doctor", 
    "May Abraham find his Eva", 
    "Have you tried turning it off an on again?",
    'holy macarony'
]
prod_pred=multinomial_nb.predict(vec.transform(prod_data))
np.array(text_categories)[prod_pred]

array(['sci.med', 'talk.religion.misc', 'comp.sys.mac.hardware',
       'soc.religion.christian'], dtype='<U24')

<div class="alert alert-block alert-info">
<b>Take Aways:</b> 

* Run Naive Bayes in different versions
* The influence and consequences of different assumptions on the data distributions.
</div>

<div class="alert alert-block alert-success">
<b>Play with:</b> 
    
* different new "ficticious" posts
* try the same experiments without excluding stopwords and observe the results
</div>