<a href="https://colab.research.google.com/github/RaminParker/Text-Classification-with-Python/blob/master/Text_classification_20newsgroups_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is part 2 of the news article classification. Check out [part 1](https://github.com/RaminParker/Text-Classification-with-Python/blob/master/Text_Classification_using_scikit_learn_(1).ipynb)

In [0]:
import os
import string
import numpy as np
import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from sklearn.feature_extraction.text import TfidfVectorizer

from imblearn.metrics import classification_report_imbalanced
from imblearn.under_sampling import RandomUnderSampler

from sklearn.datasets import fetch_20newsgroups
from collections import Counter

Define categories which you are interested in:

In [0]:
# categories = ['alt.atheism',
#  'comp.graphics',
#  'comp.os.ms-windows.misc',
#  'comp.sys.ibm.pc.hardware',
#  'comp.sys.mac.hardware',
#  'comp.windows.x',
#  'misc.forsale',
#  'rec.autos',
#  'rec.motorcycles',
#  'rec.sport.baseball',
#  'rec.sport.hockey',
#  'sci.crypt',
#  'sci.electronics',
#  'sci.med',
#  'sci.space',
#  'soc.religion.christian',
#  'talk.politics.guns',
#  'talk.politics.mideast',
#  'talk.politics.misc',
#  'talk.religion.misc']

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

In [0]:
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

Look at the data:

In [0]:
print(newsgroups_train.data[0][:80])
print(newsgroups_train.data[1][:80])
print(newsgroups_train.data[2][:80])

From: rych@festival.ed.ac.uk (R Hawkes)
Subject: 3DS: Where did all the texture 
Subject: Re: Biblical Backing of Koresh's 3-02 Tape (Cites enclosed)
From: kmcva
From: Mark.Perew@p201.f208.n103.z1.fidonet.org
Subject: Re: Comet in Temporary O


Split data:

In [0]:
X_train = newsgroups_train.data
X_test = newsgroups_test.data

y_train = newsgroups_train.target
y_test = newsgroups_test.target

In [0]:
X_train[1]

"Subject: Re: Biblical Backing of Koresh's 3-02 Tape (Cites enclosed)\nFrom: kmcvay@oneb.almanac.bc.ca (Ken Mcvay)\nOrganization: The Old Frog's Almanac\nLines: 20\n\nIn article <20APR199301460499@utarlg.uta.edu> b645zaw@utarlg.uta.edu (stephen) writes:\n\n>Seems to me Koresh is yet another messenger that got killed\n>for the message he carried. (Which says nothing about the \n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n>In the mean time, we sure learned a lot about evil and corruption.\n>Are you surprised things have gotten that rotten?\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.\n-- \nThe Old Frog's Almanac - A Salute to That Old Frog Hisse'f, Ryugen Fisher \n     (604) 245-3205 (v32) (604) 245-4366 (2400x4) SCO XENIX 2.3.2 GT \n  Ladysmi

In [0]:
X_train[1].split()[0] # exctract every single word 

'Subject:'

The data comes as a dictonary:

In [0]:
# newsgroups_train 

In [0]:
print(newsgroups_train.keys()) # get the keys of dictonary

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [0]:
print(newsgroups_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


# The usual scikit-learn pipeline

In [0]:
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB()) # build pipeline

In [0]:
pipe.fit(X_train, y_train) # fit to data

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [0]:
y_pred = pipe.predict(X_test)

In [0]:
print(classification_report_imbalanced(y_test, y_pred)) # classification report 

                   pre       rec       spe        f1       geo       iba       sup

          0       0.67      0.94      0.86      0.79      0.90      0.82       319
          1       0.96      0.92      0.99      0.94      0.95      0.90       389
          2       0.87      0.98      0.94      0.92      0.96      0.92       394
          3       0.97      0.36      1.00      0.52      0.60      0.33       251

avg / total       0.87      0.84      0.94      0.82      0.88      0.78      1353



-- > The recall of the class #3 is low mainly due to the class imbalanced.

# Balancing the class before classification

Improve prediction of the class #3: apply a balancing before to train the naive bayes classifier. 

--> **RandomUnderSampler** to equalize the number of samples in all the classes before the training.

In [0]:
pipe = make_pipeline_imb(TfidfVectorizer(), RandomUnderSampler(), MultinomialNB())  # build pipeline

In [0]:
pipe.fit(X_train, y_train) # fit to data



Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('randomundersampler',
                 RandomUnderSampler(random_state=None, ratio=None,
                                    replacement=False, 

In [0]:
y_pred = pipe.predict(X_test) # classification report 

In [0]:
print(classification_report_imbalanced(y_test, y_pred))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.69      0.91      0.87      0.78      0.89      0.79       319
          1       0.97      0.85      0.99      0.91      0.92      0.83       389
          2       0.96      0.88      0.98      0.92      0.93      0.86       394
          3       0.80      0.73      0.96      0.76      0.84      0.69       251

avg / total       0.87      0.85      0.95      0.85      0.90      0.80      1353



Although the results are almost identical, it can be seen that the resampling allowed to correct the poor recall of the class #3 at the cost of reducing the other metrics for the other classes. However, the overall results are slightly better.