# Naive Bayes Example 2 - Text analysis

__https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html__

One place where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.Here we will use the sparse word count features from the 20 Newsgroups corpus to show how we might classify these short documents into categories

## Step One - Imports

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

from sklearn.metrics import confusion_matrix


## Step Two - Get the data

In [None]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
data.target_names

Select a few catagories and download test and training datasets

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

Print out a sample data item

In [None]:
print(train.data[5])


In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers. In other words, we have to preprocess the data

For this we will use the TF-IDF vectorizer and create a pipeline that attaches it to a multinomial naive Bayes classifier:

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

## S

In [None]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

