# Naive Bayes example  
Based on [sklearn example](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).  

The [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) is one of the older "classic" natural-language-processing datasets.  Wikipedia has an [impressive page](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research) dedicated to datasets for machine learning.

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics',
                  'sci.med']

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories,
                                      shuffle=True, random_state=42)

## Featurize the text

In [4]:
count_vect = CountVectorizer(lowercase=True, tokenizer=None, stop_words='english',
                             analyzer='word', max_df=1.0, min_df=1,
                             max_features=None)

count_vect.fit(twenty_train.data)

target_names = twenty_train.target_names

In [5]:
X_train_counts = count_vect.transform(twenty_train.data)
print("The type of X_train_counts is {0}.".format(type(X_train_counts)))
print("The X matrix has {0} rows (documents) and {1} columns (words).".format(
        X_train_counts.shape[0], X_train_counts.shape[1]))

The type of X_train_counts is <class 'scipy.sparse.csr.csr_matrix'>.
The X matrix has 2257 rows (documents) and 35482 columns (words).


In [6]:
tfidf_transformer = TfidfTransformer(use_idf=True)
tfidf_transformer.fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)

## Training a Naive Bayes model

We have a multi-class classificiation problem with more features than rows.  
Will use sklearn's [Mulitnomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) algorithm.

In [7]:
nb_model = MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
nb_model.fit(X_train_tfidf, twenty_train.target);

### Interpretability - tokens associated with each category

In [8]:
feature_words = count_vect.get_feature_names()
n = 7 #number of top words associated with the category that we wish to see

for cat in range(len(categories)):
    print(f"\nTarget: {cat}, name: {target_names[cat]}")
    log_prob = nb_model.feature_log_prob_[cat]
    i_topn = np.argsort(log_prob)[::-1][:n]
    features_topn = [feature_words[i] for i in i_topn]
    print(f"Top {n} tokens: ", features_topn)


Target: 0, name: alt.atheism
Top 7 tokens:  ['edu', 'keith', 'god', 'com', 'caltech', 'writes', 'people']

Target: 1, name: comp.graphics
Top 7 tokens:  ['graphics', 'edu', 'image', 'files', 'com', 'lines', 'university']

Target: 2, name: sci.med
Top 7 tokens:  ['edu', 'pitt', 'com', 'gordon', 'banks', 'geb', 'msg']

Target: 3, name: soc.religion.christian
Top 7 tokens:  ['god', 'jesus', 'edu', 'church', 'christians', 'people', 'christian']


A [Word cloud](https://amueller.github.io/word_cloud/) would be a nicer way to visualize the top tokens associated with each category (will leave that to you).

## Building a pipeline

In [9]:
from sklearn.pipeline import Pipeline
nb_pipeline = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('model', MultinomialNB()),
                        ])
nb_pipeline.fit(twenty_train.data, twenty_train.target); 

### Evaluating performance on the test set

In [10]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories,
                                     shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = nb_pipeline.predict(docs_test)
accuracy = np.mean(predicted == twenty_test.target)
print("\nThe accuracy on the test set is {0:0.3f}.".format(accuracy))


The accuracy on the test set is 0.835.


In [11]:
len(docs_test)

1502