# Naive Bayes and Scikit-Learn - Codealong 

## Introduction

In this lesson, we'll gain experience using sklearn to work with text data and implement a Naive Bayesian Classifier, including sklearn pipelines!

## Objectives

You will be able to:

* Implement Basic NLP Tasks including stemming/lemmatization, tokenization, and word vectorization
* Implement a machine learning classifier to process text, run the classifier, and validate results 

## Getting Started

In this lesson, we'll see an example of how we can we can use professsional tools such as sklearn to work through a real world NLP task. For this lesson, we'll build a pipeline that processes the text and then trains a Naive Bayesian Classifier on the _Reuters dataset_.  This tutorial has been modified from the tutorial available in the [sklearn documentation](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).
## Loading Our Dataset

We need to start by loading in our dataset.  SKlearn has provided a helper file to do this for us, called `fetch_data.py`.  

To load the data:

1. Open a terminal window
2. Navigate to this directory
3. Run the command `python fetch_data.py`

**_NOTE:_** This dataset is decent size, coming it at ~14 mb compressed.  This helper file will download the file and then decompress the data, but will only update you as each step finishes.  If it seems like it's frozen, don't worry--just let it finish! It should take a few minutes. 

When the helper file has finished, you'll see two new folders in this directory--`20news-bydate-test` and `20news-bydate-train`.

In order to make things move a bit more quickly, we'll limit ourselves to only 4 of the available 20 categories.  

In [1]:
!python fetch_data.py

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

Now, we'll load in only the files that contains articles matching those categories. 

In [3]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', categories=categories, 
                                  shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


We can check the names of our targets to confirm that we have the right ones. 

In [6]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Next, let's take a look at how many articles we have. 

In [15]:
len(twenty_train.data)

2257

In [28]:
twenty_train.data[0]

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

We can even take a look at the filenames of the articles, and the articles themselves!

In [4]:
print("First line of article")
print('\n'.join(twenty_train.data[0].split('\n')[:3]))   # 3 first lines



First line of article
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


In [5]:
print('label: {}'.format(twenty_train.target_names[twenty_train.target[0]]))

label: comp.graphics


In [32]:
twenty_train.target[0]

1

In [33]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [34]:
twenty_train.target_names[1]

'comp.graphics'

In [8]:
twenty_train.target.shape

(2257,)

It's also a good habit to inspect our labels to get a feel for what they look like.


In [35]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

Now that we have our data, we can move onto preprocessing our text, which includes:

* Tokenizing our text
* Transforming our text to a vectorized format

Run the cell below to import everything we'll need for the remainder of this lab. 

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
np.random.seed(0)
%matplotlib inline

## Vectorizing Our Text

Now that we've loaded in the data, all that's left to do is to vectorize it, so that we can use it to train a **_Multinomial Naive Bayesian Classifier_**.

We'll start by using **Count Vectorization_** and then convert everything to **_Term Frequencies_** to normalize everything (otherwise, longer articles would naturally have higher word counts than shorter articles). 

In [22]:
count_vectorizer = CountVectorizer()
x_train_counts = count_vectorizer.fit_transform(twenty_train.data)

In [23]:
# 2257 emails

len(twenty_train.data)

2257

In [24]:
x_train_counts

<2257x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 365886 stored elements in Compressed Sparse Row format>

In [25]:
help(CountVectorizer.fit_transform)

Help on function fit_transform in module sklearn.feature_extraction.text:

fit_transform(self, raw_documents, y=None)
    Learn the vocabulary dictionary and return term-document matrix.
    
    This is equivalent to fit followed by transform, but more efficiently
    implemented.
    
    Parameters
    ----------
    raw_documents : iterable
        An iterable which yields either str, unicode or file objects.
    
    Returns
    -------
    X : array, [n_samples, n_features]
        Document-term matrix.



In [26]:
count_vectorizer.vocabulary_

{'from': 14887,
 'sd345': 29022,
 'city': 8696,
 'ac': 4017,
 'uk': 33256,
 'michael': 21661,
 'collier': 9031,
 'subject': 31077,
 'converting': 9805,
 'images': 17366,
 'to': 32493,
 'hp': 16916,
 'laserjet': 19780,
 'iii': 17302,
 'nntp': 23122,
 'posting': 25663,
 'host': 16881,
 'hampton': 16082,
 'organization': 23915,
 'the': 32142,
 'university': 33597,
 'lines': 20253,
 '14': 587,
 'does': 12051,
 'anyone': 5201,
 'know': 19458,
 'of': 23610,
 'good': 15576,
 'way': 34755,
 'standard': 30623,
 'pc': 24651,
 'application': 5285,
 'pd': 24677,
 'utility': 33915,
 'convert': 9801,
 'tif': 32391,
 'img': 17389,
 'tga': 32116,
 'files': 14281,
 'into': 18268,
 'format': 14676,
 'we': 34775,
 'would': 35312,
 'also': 4808,
 'like': 20198,
 'do': 12014,
 'same': 28619,
 'hpgl': 16927,
 'plotter': 25361,
 'please': 25337,
 'email': 12833,
 'any': 5195,
 'response': 27836,
 'is': 18474,
 'this': 32270,
 'correct': 9932,
 'group': 15837,
 'thanks': 32135,
 'in': 17556,
 'advance': 4378,

Note that once we've fitted our vectorizer as we did above, we can use it's built-in dictionary to get the indices of any words we choose!

In [27]:
count_vectorizer.vocabulary_.get('quickly')

26727

Note that the output above represents the index of the word "dog", not the actual count for how many times that word appears. However, we could use that index to look it up, if we chose to!

Once we have our Count Vectorizer, it's pretty easy to leverage sklearn's `TfidfTransformer` to convert these counts to **_Term Frequencies_** (which is what the 'tf' in 'tf-idf' stands for). 

In [46]:
tf_transformer = TfidfTransformer(use_idf=False).fit(x_train_counts)
x_train_tf = tf_transformer.transform(x_train_counts)

In [49]:
x_train_tf

<2257x35788 sparse matrix of type '<class 'numpy.float64'>'
	with 365886 stored elements in Compressed Sparse Row format>

## Fitting Our Classifier

Now that we've vectorized our data, we can create a `MultinomialNB` classifier and fit it to our vectorized data!


In [50]:
clf = MultinomialNB()
clf.fit(x_train_tf, twenty_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Usually, we call `.fit()` and `.predict()` manually at first, so that we can change things around as needed experiment.  However, this can get a bit redundant--luckily, we can make use of sklearn's `Pipeline` class to automate many of the steps we've just done manually!

In [52]:
text_clf = Pipeline([('count_vectorizer', CountVectorizer()), 
                     ('tfidf_vectorizer', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

Now that we have our pipeline object that contains the vectorization and transformation steps as well as our classifier, we can easily pass in unprocessed data and call things like `.fit()` and let the pipeline take care of all the steps we've outlined!

In [53]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('count_vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
 ...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Evaluating Classifier Performance 

Recall that in order to really get a feel for how well our classifier is performing, we need to check its performance against data it hasn't seen before. We do this by splitting off some of our labeled data into a **_Test Set_**.  We have already have a test set created thanks to the helper function that we used to load everything in. In the cell below, we'll use our pipeline object to create predictions.  We can then make use of the `metrics` module in sklearn to view a **_Classification Report_** that shows us how well our model performed! 

We'll start by loading in our test set, in the same way that we loaded in our training set.

In [57]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, 
                                 shuffle=True, random_state=0)
test_articles = twenty_test.data
test_labels = twenty_test.target

In [62]:
help(fetch_20newsgroups)

Help on function fetch_20newsgroups in module sklearn.datasets.twenty_newsgroups:

fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)
    Load the filenames and data from the 20 newsgroups dataset (classification).
    
    Download it if necessary.
    
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    
    Read more in the :ref:`User Guide <20newsgroups_dataset>`.
    
    Parameters
    ----------
    data_home : optional, default: None
        Specify a download and cache folder for the datasets. If None,
        all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
    
    subset : 'train' or 'test', 'all', optional
        Select the dataset to load: 'train' for the training set, 'test'
        for the test set, 'all' for both, with shuffled ordering.
    
    categories : None or collect

Now, let's use our pipeline to create some predictions for our test data, and then compare the results to the corresponding labels.

In [58]:
test_predictions = text_clf.predict(test_articles)

# Mean of good predictions, stored as 1 or 0
np.mean(test_predictions == test_labels) # Expected Output: 0.8348868175765646

0.8348868175765646

**_83.4% accuracy--pretty good!_**  Let's round out this lab by viewing a full **_Classification Report_** for how our model performed for each given category:

In [56]:
print(metrics.classification_report(test_labels, test_predictions, 
                              target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

             micro avg       0.83      0.83      0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



## Summary

In this lesson, we worked through an example of how to use professional-quality tools such as **_sklearn_** to preprocess, vectorize, and classify real-world text data by predicting the categories of news articles using Naive Bayesian Classification. Great job!