# 20 Newsgroups Classifier

The idea is to find the topic of a document. We'll use a set of documents, included in **sklearn**, known as **20 Newsgroups**. It's a set of approximately 18,000 emails, distributed between 20 distinct topics such as **christian religion**, **atheism**, **guns**, **mac hardware**, **graphic computing**, etc. The set has topics that are related to each other and has other that are not. 

We are going to create a classifier for a subset of the topics. This subset will have a total of 5 topics. 3 related topics and 3 unrelated topics.

We start importing the necessary modules. Load the training data from the **20 Newsgroup** dataset.  

After that we select the topics we will be working with.

In [2]:
%matplotlib inline

In [3]:
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# This are the selected topics.
cats = ['rec.sport.baseball', 'rec.sport.hockey', 'rec.autos', 'sci.crypt', 'soc.religion.christian']

# Load the training data for the chosen categories.
newsgroup_train = fetch_20newsgroups(subset='train', categories=cats, shuffle=True, random_state=42)

# List the categories of the fetched data.
list(newsgroup_train.target_names)

['rec.autos',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'soc.religion.christian']

2 ¿Cuántos archivos tiene su dataset?
Imprima, del 4to archivo, el tema y las 30 primeras líneas.
Imprima las categorías de los 1eros 20 documentos

Let's see:
* How many emails do we have in our dataset.
* See how a email looks like.
* The categories of the first 20 emails.

In [10]:
# Get the amount of emails in the dataset.
size = newsgroup_train.filenames.shape
print('Emails in dataset ---> ' + str(size))

# Get the topic of the 4th email.
tema_index = newsgroup_train.target[3]
tema = newsgroup_train.target_names[tema_index]
print('Topic of the 4th email ---> ' + str(tema))

# Get the first 30 lines of the 4th email.
primeras_lineas = newsgroup_train.data[3].split('\n')[0:29]
print('First 30 lines of the 4th email ---> ')
for linea in primeras_lineas:
    print linea
print
    
# Get the topics of the first 20 emails.
primeras_cat_ind = newsgroup_train.target[0:19]
print('Topics of the first 20 emails --->')
print
for categoria in primeras_cat_ind:
    print newsgroup_train.target_names[categoria]


Emails in dataset ---> (2985,)
Topic of the 4th email ---> rec.sport.baseball
First 30 lines of the 4th email ---> 


I'm not quite sure how these numbers are generated.  It appears that in
a neutral park Bo's HR and slugging tend to drop (he actually loses two
home runs).  Or do they?  What is "equivalent average?"

One thing, when looking at Bo's stats, is that you can see that KC took
away some homers.  Normally, you expect some would-be homers to go for
doubles or triples in big parks, or to be caught, and for that matter you
expect lots of doubles and triples anyway.  But Bo, despite his speed, 
hit very few doubles and not that many triples.  So I would expect his
value to have risen quite considerably in a neutral park.  


Felix Jose has been a .350/.440 player in a fairly neutral park.
I would offhand guess the `89-`90 Bo at around a .330/.530 player.
Maybe .330/.550 .  Not even close.


I'd put him about there too.  

Note: I hadn't realized the media had hyped him so much.  

The emails contain metadata, this metadata is mainly emails addresses, universities, linked articles. Keeping this information will cause the classifier to overfit the training data. If we want our classifier to be useful with text outside of this context, we must avoid using the metadata.

In [11]:
# Get a new training set, this one without metadata.
newsgroup_train = fetch_20newsgroups(subset='train', categories=cats, shuffle=True, random_state=42,
                                     remove=('headers','footers','quotes'))

We are going to vectorize the emails using two different vectorizers. By vectorizing we create a bag-of-words, where each email becomes a "bag" of words where the grammar and order of the words doesn't matter. The only thing that matter is the multiplicity of the words (after striping the stopwords).

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
vectors = vectorizer.fit_transform(newsgroup_train.data)

print('Size of the vocabulary ---> ' + str(len(vectorizer.vocabulary_)) + ' words.')

c_vectorizer = CountVectorizer()
count_vectors = c_vectorizer.fit_transform(newsgroup_train.data)

tam = np.max(count_vectors.sum(axis=1))
print('Maximum amount of words in an email ---> ' + str(tam) + ' words.')

Size of the vocabulary ---> 30975 words.
Maximum amount of words in an email ---> 10176 words.


Train a MultinomialNB classifier and check it's performance.

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn import metrics

# Get the test dataset and vectorize it
newsgroup_test = fetch_20newsgroups(subset='test', categories=cats, remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)
vectors_test = vectorizer.transform(newsgroup_test.data)

# Train the classifier.
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroup_train.target)

# Predict the value of the test dataset.
pred = clf.predict(vectors_test)

# Get some performance metrics.
print('Classification Report de Multinomial Naive Bayes --->')
print(metrics.classification_report(newsgroup_test.target, pred, target_names=newsgroup_test.target_names))
print('Confusion Matrix --->')
print(metrics.confusion_matrix(newsgroup_test.target, pred))
print('')

Classification Report de Multinomial Naive Bayes --->
                        precision    recall  f1-score   support

             rec.autos       0.95      0.87      0.91       396
    rec.sport.baseball       0.95      0.84      0.89       397
      rec.sport.hockey       0.79      0.94      0.86       399
             sci.crypt       0.93      0.91      0.92       396
soc.religion.christian       0.91      0.94      0.93       398

           avg / total       0.91      0.90      0.90      1986

Confusion Matrix --->
[[344   6  27   9  10]
 [  5 335  36  10  11]
 [  4   5 376   6   8]
 [  6   5  19 360   6]
 [  2   2  15   3 376]]



Now let's train a SGDClassifier and check it's performance.

In [14]:
# Train a new classifier.
clf2 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)
clf2.fit(vectors, newsgroup_train.target)

# Predict the result of the test dataset.
pred2 = clf2.predict(vectors_test)

# Get some performance metrics.
print('Classification Report de SGDClassifier --->')
print(metrics.classification_report(newsgroup_test.target, pred2, target_names=newsgroup_test.target_names))
print('Confusion Matrix --->')
print(metrics.confusion_matrix(newsgroup_test.target, pred2))
print('')

Classification Report de SGDClassifier --->
                        precision    recall  f1-score   support

             rec.autos       0.80      0.93      0.86       396
    rec.sport.baseball       0.91      0.84      0.87       397
      rec.sport.hockey       0.93      0.91      0.92       399
             sci.crypt       0.95      0.86      0.90       396
soc.religion.christian       0.92      0.94      0.93       398

           avg / total       0.90      0.90      0.90      1986

Confusion Matrix --->
[[370   7   4   9   6]
 [ 27 334  23   5   8]
 [ 15  13 362   1   8]
 [ 35   9   1 339  12]
 [ 17   4   1   3 373]]



Let's try a new approach, we are going to collapse the topics that are related into a new category. After that we will re-train the classifiers and find their new performance .

In [15]:
# Define two sets of categories. One with the unrelated topics and the other with the related ones.
sub_cats1 = ['sci.crypt', 'soc.religion.christian']
sub_cats2 = ['rec.sport.baseball', 'rec.sport.hockey', 'rec.autos']

# Get the training data for the set of categories that are unrelated.
sub_data1_train = fetch_20newsgroups(subset='train', categories=sub_cats1, remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)

# Get the training data for the set of categories that are related.
sub_data2_train = fetch_20newsgroups(subset='train', categories=sub_cats2, remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)

# Collapse the data into one category with name 'rec'.
sub_data2_train.target = np.ones(len(sub_data2_train.target))*2
sub_data2_train.target_names = ['rec']

# Merge the two datasets.
sub_data1_train.target_names.extend(sub_data2_train.target_names)
sub_data1_train.target = np.append(sub_data1_train.target, sub_data2_train.target)
sub_data1_train.filenames = np.append(sub_data1_train.filenames, sub_data2_train.filenames)
sub_data1_train.data.extend(sub_data2_train.data)

# Repeat the above procedure, this time with the test data.
sub_data1_test = fetch_20newsgroups(subset='test', categories=sub_cats1, remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)

sub_data2_test = fetch_20newsgroups(subset='test', categories=sub_cats2, remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)

sub_data2_test.target = np.ones(len(sub_data2_test.target))*2
sub_data2_test.target_names = ['rec']

sub_data1_test.target_names.extend(sub_data2_test.target_names)
sub_data1_test.target = np.append(sub_data1_test.target, sub_data2_test.target)
sub_data1_test.filenames = np.append(sub_data1_test.filenames, sub_data2_test.filenames)
sub_data1_test.data.extend(sub_data2_test.data)

# Create a vectorizer and vectorize the data.
vectorizer2 = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
vectors2 = vectorizer2.fit_transform(sub_data1_train.data)
vectors_test2 = vectorizer.transform(sub_data1_test.data)

# Create a classifier and train it.
clf3 = MultinomialNB(alpha=.01)
clf3.fit(vectors2, sub_data1_train.target)

# Predict the results of the test data.
pred3 = clf3.predict(vectors_test2)

# Print the performance.
print('Classification report de Multinomial Naives Bayes --->')
print(metrics.classification_report(sub_data1_test.target, pred3, target_names=sub_data1_test.target_names))
print('Confusion Matrix --->')
print(metrics.confusion_matrix(sub_data1_test.target, pred3))
print('')

# Create the second classifier and train it.
clf4 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)
clf4.fit(vectors2, sub_data1_train.target)

# Predict the results of the test data.
pred4 = clf4.predict(vectors_test2)

# Print the performance.
print('Classification report de SGDClassifier --->')
print(metrics.classification_report(sub_data1_test.target, pred4, target_names=sub_data1_test.target_names))
print('Confusion Matrix --->')
print(metrics.confusion_matrix(sub_data1_test.target, pred4))
print('')



Classification report de Multinomial Naives Bayes --->
                        precision    recall  f1-score   support

             sci.crypt       0.96      0.89      0.92       396
soc.religion.christian       0.94      0.94      0.94       398
                   rec       0.95      0.97      0.96      1192

           avg / total       0.95      0.95      0.95      1986

Confusion Matrix --->
[[ 354    5   37]
 [   3  374   21]
 [  13   20 1159]]

Classification report de SGDClassifier --->
                        precision    recall  f1-score   support

             sci.crypt       1.00      0.62      0.77       396
soc.religion.christian       0.99      0.76      0.86       398
                   rec       0.83      1.00      0.91      1192

           avg / total       0.89      0.88      0.87      1986

Confusion Matrix --->
[[ 247    1  148]
 [   0  303   95]
 [   1    3 1188]]

