Importing the 20NewsGroups dataset consisting of 11314 articles in the training dataset and 7532  articles in the test dataset accross 20 classes.

In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', shuffle=True)
test = fetch_20newsgroups(subset='test', shuffle=True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Next we Vectorize the articles in the Corpus.
For this we use sci-kit learn's CountVectorizer to create a sparse matrix of the count of each word in an article
For better results we then calculate the inverse term frequency for the words using sci-kit learn's TfidfTransformer

In [15]:
print(train.target[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf[0:4,0:20000])


  (0, 56979)	0.0574701540748513
  (0, 75358)	0.3538350134970617
  (0, 123162)	0.25970902457356887
  (0, 118280)	0.21186807208281694
  (0, 50527)	0.05461428658858725
  (0, 124031)	0.10798795154169123
  (0, 85354)	0.03696978508816317
  (0, 114688)	0.06214070986309587
  (0, 111322)	0.019156718024950434
  (0, 123984)	0.036854292634593756
  (0, 37780)	0.3813389125949312
  (0, 68532)	0.07325812342131598
  (0, 114731)	0.1444727551278406
  (0, 87620)	0.0356718631408158
  (0, 95162)	0.03447138409326312
  (0, 64095)	0.035420924271313554
  (0, 98949)	0.16068606055394935
  (0, 90379)	0.01992885995664587
  (0, 118983)	0.03708597805061915
  (0, 89362)	0.06521174306303765
  (0, 79666)	0.10936401252414275
  (0, 40998)	0.07801368196918111
  (0, 92081)	0.09913274493911224
  (0, 76032)	0.01921946305222309
  (0, 4605)	0.06332603952480324
  :	:
  (0, 37565)	0.03431760442478462
  (0, 113986)	0.17691750674853085
  (0, 83256)	0.08844382496462175
  (0, 86001)	0.07000411445838192
  (0, 51730)	0.0971474405797672

Having got the sparse matrix, we would now apply classification Algorithms on this vectorized word matrix to predic classes for data in test dataset.
Starting with K-Nearest Neighbours


In [5]:
from sklearn.pipeline import Pipeline
from sklearn import neighbors
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', neighbors.KNeighborsClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
np.mean(predicted == test.target)

0.6591874668082847

Now we apply Support Vector Machine algorithm


In [38]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
import warnings
warnings.filterwarnings("ignore")
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
np.mean(predicted == test.target)

0.8516994158258099

Now we apply Naive Bayes

In [39]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
np.mean(predicted == test.target)


0.7738980350504514

All this while we used Bag-Of-Words technique to vectorize the dataset.
Here we apply ngrams technique to create the sparse matrix. Let's have an example as how n-grams is differnt from Bag-of-Words and what it actually does.

In [12]:
ngram_vectorizer = CountVectorizer()
counts = ngram_vectorizer.fit_transform(['Anagh Anagh is Chutiya', 'Anmol is smart'])
print("Bag-of-Words")
print(ngram_vectorizer.get_feature_names())
print(counts.toarray().astype(int))

ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
counts = ngram_vectorizer.fit_transform(['Anagh', 'Anmol'])
print("Bi-grams")
print(ngram_vectorizer.get_feature_names())
print(counts.toarray().astype(int))

ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(3, 3))
counts = ngram_vectorizer.fit_transform(['Anagh', 'Anmol'])
print("Tri-grams")
print(ngram_vectorizer.get_feature_names())
print(counts.toarray().astype(int))


Bag-of-Words
['anagh', 'anmol', 'chutiya', 'is', 'smart']
[[2 0 1 1 0]
 [0 1 0 1 1]]
Bi-grams
[' a', 'ag', 'an', 'gh', 'h ', 'l ', 'mo', 'na', 'nm', 'ol']
[[1 1 1 1 1 0 0 1 0 0]
 [1 0 1 0 0 1 1 0 1 1]]
Tri-grams
[' an', 'agh', 'ana', 'anm', 'gh ', 'mol', 'nag', 'nmo', 'ol ']
[[1 1 1 0 1 0 1 0 0]
 [1 0 0 1 0 1 0 1 1]]


What if we need to apply Bag-of-2grams, or in other words club two consecutive words in a document, then vectorize

In [60]:
ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 2))
counts = ngram_vectorizer.fit_transform(['Today it is Allahabad', 'Tomorrow it will be Prayagraj'])
print("Bag-of-2grams")
print(ngram_vectorizer.get_feature_names())
print(counts.toarray().astype(int))

Bag-of-2grams
['be prayagraj', 'is allahabad', 'it is', 'it will', 'today it', 'tomorrow it', 'will be']
[[0 1 1 0 1 0 0]
 [1 0 0 1 0 1 1]]


Now we will apply n-grams on our dataset and will then apply SVM classification algorithm. We will compare the accuracies for uni-gram, bi-gram and tri-gram vectorization on character level.

In [46]:
text_clf = Pipeline([('vect', CountVectorizer(analyzer='char_wb', ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For uni-grams : ",np.mean(predicted == test.target))

text_clf = Pipeline([('vect', CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For bi-grams : ",np.mean(predicted == test.target))

text_clf = Pipeline([('vect', CountVectorizer(analyzer='char_wb', ngram_range=(3, 3))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For tri-grams : ",np.mean(predicted == test.target))


For uni-grams :  0.15493892724375996
For bi-grams :  0.6437865108868827
For tri-grams :  0.806558682952735


Now applying Bag-of-1gram(Same as bag of words), Bag-of-2grams and Bag-of-3grams to our dataset which by the way is out actual intention.

In [64]:
text_clf = Pipeline([('vect', CountVectorizer(analyzer='word', ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For Bag-of-1-grams : ",np.mean(predicted == test.target))

text_clf = Pipeline([('vect', CountVectorizer(analyzer='word', ngram_range=(2, 2))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For Bag-of-2-grams : ",np.mean(predicted == test.target))

text_clf = Pipeline([('vect', CountVectorizer(analyzer='word', ngram_range=(3, 3))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])
text_clf.fit(train.data, train.target)
predicted = text_clf.predict(test.data)
print("For Bag-of-3-grams : ",np.mean(predicted == test.target))

For Bag-of-1-grams :  0.8499734466277217
For Bag-of-2-grams :  0.8021773765268189
For Bag-of-3-grams :  0.7169410515135423
