## Lab 6: Naive Bayes Text Classification

# News Classifiction

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

The news are from each of the following 20 newsgroups.

    alt.atheism
    comp.graphics
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x
    misc.forsale
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    soc.religion.christian
    talk.politics.guns
    talk.politics.mideast
    talk.politics.misc
    talk.religion.misc

In [1]:
#import required libraries
import pandas as pd

In [2]:
#import 20 news group dataset from scikit learn datasets
from sklearn.datasets import fetch_20newsgroups

In [3]:
dataset= fetch_20newsgroups()


In [4]:
#The sklearn.datasets.fetch_20newsgroups function is a data fetching / caching
#functions that downloads the data archive from the original 20 newsgroups 
#website, extracts the archive contents in the ~/scikit_learn_data/20news_home 
#folder and calls the sklearn.datasets.load_files on either the training or 
#testing set folder.

#load 20 news group train subset - You need Internet for it/Or place the data file in Anaconda directory
newsgroups_train = fetch_20newsgroups(subset='train', shuffle=True)
#load 20 news group test subset
newsgroups_test = fetch_20newsgroups(subset='train', shuffle=True)

In [5]:
# print all target labels - dataset.target_names
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [6]:
#prepare a list of categories 'alt.atheism', 'comp.graphics','sci.space'
catlist = ['alt.atheism','comp.graphics','sci.space']


In [7]:
#load 20 news group train subset with three categories 'alt.atheism', 
#'comp.graphics','sci.space' by passing the list to fetch_20newsgroups 
newsgroups_train = fetch_20newsgroups(subset='train', categories=catlist, shuffle=True)

#load 20 news group test subset with three categories 'alt.atheism', 
#'comp.graphics','sci.space' by passing the list to fetch_20newsgroups 
newsgroups_test = fetch_20newsgroups(subset='test', categories=catlist, shuffle=True)

In [8]:
# print new training set target names(labels)
print(newsgroups_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space']


In [9]:
X = newsgroups_train.data
y = newsgroups_train.target

In [10]:
# print shape of targets
#X = pd.DataFrame(X)
#y = pd.DataFrame(y)
#X.shape ,y.shape
len(X), len(y)

(1657, 1657)

In [11]:
#print training set filenames - dataset.filenames
print(newsgroups_train.filenames)

['C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60869'
 'C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38633'
 'C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\alt.atheism\\53534'
 ...
 'C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60915'
 'C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60176'
 'C:\\Users\\RGUKT\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\60929']


In [12]:
# print new training data of 6th article - use dataset.data[index]
print(newsgroups_train.data[5])

From: dietz@cs.rochester.edu (Paul Dietz)
Subject: Commercial mining activities on the moon
Organization: University of Rochester
Lines: 38

In article <1993Apr20.152819.28186@ke4zv.uucp> gary@ke4zv.UUCP (Gary Coffman) writes:

 > be the site of major commercial activity. As far as we know it has no
 > materials we can't get cheaper right here on Earth or from asteroids
 > and comets, aside from the semi-mythic He3 that *might* be useful in low
 > grade fusion reactors.

I don't know what a "low grade" fusion reactor is, but the major
problem with 3He (aside from the difficulty in making any fusion
reactor work) is that its concentration in lunar regolith is just so
small -- on the order of 5 ppb or so, on average (more in some
fractions, but still very small).  Massive amounts of regolith would
have to be processed.

This thread reminds me of Wingo's claims some time ago about the moon
as a source of titanium for use on earth.  As I recall, Wingo wasn't
content with being assured that

In [35]:
#by using countvectorizer convert train data into numeric format considering only 1500 features
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
count_vect1 = CountVectorizer(max_features = 1500)
vectors = count_vect1.fit_transform(newsgroups_train.data)
vectors = vectors.toarray()

In [36]:
# use MultinomialNB(alpha=.01) for training - alpha is Laplace Smoothing 

from sklearn.naive_bayes import MultinomialNB
bnb = MultinomialNB(alpha=0.01)
# fit
bnb.fit(vectors,y)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [37]:
#by using countvectorizer convert test data into numeric format considering only 1500 features
vectors_test = count_vect1.transform(newsgroups_test.data)
vectors_test =vectors_test.toarray() 

In [38]:
#predict target labels for testing set
pred = bnb.predict(vectors_test)
pred.shape

(1102,)

In [39]:
y_test  = newsgroups_test.target
y_test.shape

(1102,)

In [40]:
#find accuacy score on test set
from sklearn import metrics
metrics.accuracy_score(y_test, pred)

0.9428312159709619

In [41]:
# used TfidfVectorizer insted of ContVectorizer and use multinomialNB
#find test set accuracy

count_vect = TfidfVectorizer(max_features = 1500)
vectors_test = count_vect.fit_transform(newsgroups_train.data)
vectors_test =vectors_test.toarray()

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB(alpha=0.01)
# fit
mnb.fit(vectors,y)

#by using Tfidfvectorizer convert test data into numeric format considering only 1500 features
vectors_test = count_vect.transform(newsgroups_test.data)
vectors_test =vectors_test.toarray() 

#predict target labels for testing set
pred = mnb.predict(vectors_test)

#find accuacy score on test set
from sklearn import metrics
metrics.accuracy_score(y_test, pred)


0.9337568058076225

In [51]:
#try with avoiding stopwords and repeat the same


#by using Countvectorizer convert train data into numeric format considering only 1500 features
count_vect = CountVectorizer(max_features = 1500,stop_words='english')
vectors = count_vect.fit_transform(newsgroups_train.data)
vectors =vectors.toarray()

from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB(alpha=0.01)
# fit
mnb.fit(vectors,y)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [52]:

#by using Countvectorizer convert test data into numeric format considering only 1500 features
vectors_test = count_vect.transform(newsgroups_test.data)
vectors_test =vectors_test.toarray() 

#predict target labels for testing set
pred = mnb.predict(vectors_test)

#find accuacy score on test set
from sklearn import metrics
metrics.accuracy_score(newsgroups_test.target, pred)

0.9437386569872959

In [326]:
#Extra - try grid search for count vectorizer, tfidf, different classifiers 


In [327]:
#Extra - try with stemming also