# Naive Bayes Classification in Python

In this notebook we will look at how to create a Naive Bayes classifier in Python. As before we will use a toy dataset from Scikit-Learn, in this case the 20 Newsgroups Dataset.

## 20-Newsgroups
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

In the days before the web, bored scientists and engineers would discuss matters of global concern over a text-based bulletin-board system (BBS) called USENET. "News" articles were distributed over USENET using a protocol called "Network News Transfer Protocol" or NNTP, and a text-based reader like "tin" would be used to read and write/respond to these articles.

News articles in USENET are divided into "Newsgroups" with names like alt.tv.simpsons which discusses the Simpsons TV show, and comp.programming which discusses - surprisingly - programming.

In this section we will use the 20 Newsgroups dataset, which consists of 20,000 news articles gathered across 20 newsgroups. To begin, we import the various packages we need.

In [None]:
import numpy as np
import sklearn
from sklearn.datasets import fetch_20newsgroups


We now load the dataset. As it is quite large we will only load the training data, and leave the testing data for later. We set 'shuffle' to be True so that the data is shuffled and different each time.

We then print the names of each newsgroup, given in "target_names". Note that it will take a few minutes to run, and the "In \[6\]" marker on the left will show as "In \[\*\]".

In [None]:
twenty_train = fetch_20newsgroups(subset = 'train', shuffle = True)
print(twenty_train.target_names)

After a few minutes you should see the names of the 20 newsgroups that are captured in the dataset. Let's look at how long this training dataset is:

In [None]:
FULL_LEN = 20000 # Total number of articles
train_len = len(twenty_train.data)
print("%%age of dataset used for training: %3.4f" % (train_len / FULL_LEN * 100.0))

So about 57% has been used for training, and 43% for testing. Now let's start classifying!

## Using the SK-Learn Naive Bayes Classifiers
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes

In the lecture we saw that there are four types of Naive Bayes classifiers (+1 classifier that is a variant of Multinomial NB):

- Gaussian (GaussianNB), used for continuous features like temperature, height, etc.

- Multinomial (MultinomialNB), if each feature is some kind of count. 

- "Complement Naive Baye" (ComplementNB) that is like MultinomialNB, but does special calculations to overcome the effects of imbalanced datasets. Since the 20 Newsgroups has about 1000 articles per newsgroup (i.e. it is balanced), we will not use this.

- Bernoulli (BernoulliNB), if each feature is a binary value.

Since our dataset consists of word counts, we will use the MultinomialNB model. There are several things we need to do first.

### Creating a Bag of Words
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

We begin by creating a "bag of words", which is essentially a count of each word as they occur in each article. Let's do that now.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_counts = cv.fit_transform(twenty_train.data)

This creates an array of articles, with each article containing a vector of counts, and index pointers telling us where each article starts and ends in the bag of words, etc. We can explore some of these:

In [None]:
print("# of articles: %d" % X_train_counts.shape[0])
print("Count for words in the first article: ", repr(X_train_counts[0].data))
print("Indexes of words in the first article: ", repr(X_train_counts[0].indices))

print("The first 5 words in the first article:\n")

for ind in X_train_counts[0].indices[:5]:
    print(cv.get_feature_names()[ind]," ", end="")
    
print()

Without further ado, let's create our classifier.

### Building the 20 Newsgroups Classifier

Since our data consists of counts of each word, we will use a Multinomial Bayesian Classifier, as discussed in the lecture. We will import the MultinomialNB class, then call the "fit" method to train the model. We will also load up the test data from the dataset.

Note that when we create the word count for the documents in the test data, we use transform instead of fit_transform. This is because fit_transform will create a new dictionary. All we want to do is to convert the documents in the test set into a bag of words; we do not want to learn a new dictionary.

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

# Train using our training data
clf.fit(X_train_counts, twenty_train.target)

# Load up the test data
twenty_test = fetch_20newsgroups(subset='test', shuffle = True)

# We call transform to turn twenty_test into a BOW. We DO NOT call
# fit_transform, which also learns the words.
X_test_counts = cv.transform(twenty_test.data)

Now that we have our model trained, we can now evaluate its performance. We will create a vector of predicted categories and compare them against the "ground truths":

In [None]:
predicted = clf.predict(X_test_counts)

# predicted == twenty_test.target will produce a vector with "1" where the labels in 
# predicted match those in the target, and a "0" otherwise. We then call np.mean that
# sums up this vector and divides by the # of elements, effectively giving an accuracy rate.
perf = np.mean(predicted == twenty_test.target)
print("Our 20 Newsgroup's raw-count classifer correctly classified %3.4f%% of the articles." 
      % (perf * 100.0))


Our accuracy is 77.28%! Not bad, but not great either. Let's using tf.idf instead of raw word counts, and see what happens.

### Using tf.idf For Classification

As mentioned in the lecture raw word counts have two problems:

1. A bias towards long documents and terms that occur frequently across documents (and are thus very bad 'discriminators' - attributes that help us differentiate between document classes)

2. Zero count terms.

To fix this we use tf.idf, which if you recall, is defined as:

$$
x_i = log(tf_{ik} + 1) log(\frac{D}{d_{tf}})
$$

The $log(tf_{ik} + 1)$ part eliminates zero count terms by adding 1 (and taking a log so that the numbers don't become too huge), while the $log(\frac{D}{d_{tf}})$ part punishes terms that occur too frequently across many documents.

To get the tf.idf BOW, we use TfidfTransformer and use the raw count BOW derived in the previous part. The rest of the code is fairly straightforward:


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
tfidf_clf = MultinomialNB()
tfidf_clf.fit(X_train_tfidf, twenty_train.target)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted_tfidf = tfidf_clf.predict(X_test_tfidf)
perf_tfidf = np.mean(predicted_tfidf == twenty_test.target)

print("Our 20 Newsgroup's tf-idf classifer correctly classified %3.4f%% of the articles." 
      % (perf_tfidf * 100.0))

We see that we now have a slight improvement in performance of over 0.11% 

### Using Stop Words

The CountVectorizer allows us to ignore stop-words. Stop-words are words that occur very frequently, like "a", "the", etc, that they are meaningless for classification. You can find an example stop-word list at https://countwordsfree.com/stopwords.

Since the stop-words are ignored at the CountVectorizer, this means that we need to recreate all our counts again. Let's do that now.

In [None]:
sw_count_vect = CountVectorizer(stop_words = 'english')
X_train_counts = sw_count_vect.fit_transform(twenty_train.data)
X_test_counts = sw_count_vect.transform(twenty_test.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
tfidf_clf = MultinomialNB()
tfidf_clf.fit(X_train_tfidf, twenty_train.target)
predicted_tfidf_sw = tfidf_clf.predict(X_test_tfidf)

perf_tfidf_sw = np.mean(predicted_tfidf_sw == twenty_test.target)

print("Our 20 Newsgroup's tf-idf with stop-words classifer correctly classified %3.4f%% of the articles." 
      % (perf_tfidf_sw * 100.0))

This gives us a substantial improvement of 4%!

## Using Pipelines
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

The current workflow for creating classifiers and regressors is tedious; fortunately Scikit-Learn provides a very useful structure called a Pipeline that lets us specify what objects to use to process the inputs to produce the outputs. The code below shows how to use a Pipeline. Essentially the Pipeline takes a list of tuples that contain:

- An string identifier that you can use later on to access a particular object in the pipeline.

- The object itself that is used to process the data.

Let's look at how to create our tf.idf classifier with stop-words:

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer(stop_words = 'english')),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB()), ])
text_clf.fit(twenty_train.data, twenty_train.target)
predicted_pipeline = text_clf.predict(twenty_test.data)
perf_pipeline = np.mean(predicted_pipeline == twenty_test.target)
print("Our 20 Newsgroup's pipeline classifer correctly classified %3.4f%% of the articles." 
      % (perf_pipeline * 100.0))

As expected we have exactly the same results as before. Notice how much easier and more intuitive this is; we just specify the steps in our processing - CountVectorizer which produces the bag of words, tfidfTransformer that turns the raw frequencies into tf.idf scores, and finally MultinomialNB which does the classifications.

We provide labels like 'vect', 'tfidf' and 'clf' that lets us access these individual objects. For example we could do:


In [None]:
print(text_clf['clf'])

Which as we can see returns us the MultinomialNB we put into the pipeline.