# <font color='#eb3483'> Naive Bayes Classifier </font>

Naive Bayes Classifier (NBC) is a popular classification algorithm for natural language processing and text analysis. In this module we'll walk through how to train a NB model using scikit-learn and evaluate some. 

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) 

## <font color='#eb3483'> Data Loading </font>

We are going to use a new dataset here. I downloaded 6000 post titles from reddit. We will see how to predict the post subreddit based on its title.

In [4]:
posts = pd.read_csv("data/reddit.csv")

posts.head()

Unnamed: 0,subreddit,title
0,news,Citibank fined $100 million for interest rate ...
1,news,Experts say locking up firearms reduces chance...
2,news,"California sees $9 billion surplus, passes bud..."
3,news,Paul Manafort ordered to jail after witness-ta...
4,news,Missouri woman recorded using racial slur on s...


In [5]:
posts.shape

(6712, 2)

We will use a small sample for time's sake

In [6]:
posts = posts.sample(2000, random_state=42)

The target variable is `subreddit` and the independent variable is posts `title`.

In [7]:
posts.subreddit.value_counts()

news               238
startups           233
geek               226
Python             224
machinelearning    219
AskReddit          218
science            217
datascience        214
gadgets            211
Name: subreddit, dtype: int64

We see there are posts from a few different subreddits.

## <font color='#eb3483'> Text Processing </font>

Naive Bayes Classifiers expect a vector as an input, so we will have to vectorize the text. To do so we  will use Tf-Idf Vectorizing (TF-IDF = Term Frequency - Inverse Document Frequency). Think of this as a little teaser to text processing - we'll cover it more in depth later. For now just understand that it's a way to map free text to a dataframe where we have words and some metrics on their frequency.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords

<font color='#eb3483'> **Stopwords removal** </font>

Stopwords are words that don't have semantic meaning, and thus we can remove them. For example in the sentence `"I have a red car"` the word `a` doesn't add any value to the sentence (its just language "glue"). NLTK (an awesome text processing package we'll explore more later), has built in stopwords lists.

In [19]:
en_stop = stopwords.words('english')

In [21]:
en_stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [22]:
vectorizer = TfidfVectorizer(stop_words=en_stop)

In [23]:
posts.shape

(2000, 2)

In [24]:
vectorized_text = vectorizer.fit_transform(posts.title).toarray()
vectorized_text.shape

(2000, 5798)

This matrix contains 2000 articles that have 5570 distinct words.

In [25]:
vectorized_text[:10]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectorized_text, 
                                                    posts.subreddit,
                                                    test_size=0.2)

## <font color='#eb3483'> Building a Naive Bayes Classifier </font>


Scikit-learn has 3 different implementations of the [Naive Bayes Classifier](http://scikit-learn.org/stable/modules/naive_bayes.html), `GaussianNB, BernoulliNB y MultinomialNB`, each calculates the class probabilities assuming that the data follows a different distribution.

[GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) assumes that the data follows a Gaussian (Normal) distribution.

In [27]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

In [28]:
gaussian_nb = GaussianNB()

In [29]:
gaussian_nb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [30]:
gaussian_nb.predict(X_test)[:10]

array(['news', 'science', 'machinelearning', 'startups',
       'machinelearning', 'machinelearning', 'science', 'gadgets',
       'science', 'geek'], dtype='<U15')

In [31]:
y_test[:10]

132                news
710                news
2441               geek
1212           startups
109                news
4141    machinelearning
432                news
3049            gadgets
2387               geek
2415               geek
Name: subreddit, dtype: object

In [32]:
gaussian_nb.predict_proba(X_test)[:6]

array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        1.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 1.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 1.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 1.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 1.00000000e+000,
        0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 2.16022148e-104,
        0.00000000e+000, 0.00000000e+000, 1.0000000

In [33]:
X_train[6]

array([0., 0., 0., ..., 0., 0., 0.])

We are going to evaluate using micro average of f1-score

In [35]:
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

def f1_multilabel_cv(estimator, X, y):
    preds = estimator.predict(X)
    return f1_score(y, preds, average="micro")

cross_val_score(gaussian_nb, 
                X=vectorized_text,
                y=posts.subreddit,
                scoring=f1_multilabel_cv).mean()

0.599

The most commonly used implementations of NBC for text classification are [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) (that assumes that the data follows a [Multinomial Distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) and the [BernouilliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB) , that assumes that the data follows a Bernouilli distribution (where each variable (word) is a binary value, either 0 or 1).

In [36]:
cross_val_score(MultinomialNB(), 
                X=vectorized_text,
                y=posts.subreddit,
                scoring=f1_multilabel_cv).mean()

0.708

We try now with CountVectorizer instead of TF-IDF

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
vectorized_text_counts_binary = CountVectorizer(strip_accents="unicode", 
                                         stop_words=en_stop, binary=True
                                        ).fit_transform(posts.title)

vectorized_text_counts = CountVectorizer(strip_accents="unicode", 
                                         stop_words=en_stop
                                        ).fit_transform(posts.title)

In [39]:
cross_val_score(MultinomialNB(), 
                X=vectorized_text_counts,
                y=posts.subreddit,
                scoring=f1_multilabel_cv).mean()

0.7050000000000001

In this particular case, Multinomial NB performs better with counts than with TfIDF vectors

Similarly we can test the Bernoulli NB classifier

In [40]:
cross_val_score(BernoulliNB(), 
                X=vectorized_text,
                y=posts.subreddit,
                scoring=f1_multilabel_cv).mean()

0.5755000000000001

We see the BernouilliNB performs much worst than Multinomial