<a href="https://colab.research.google.com/github/RaminParker/Text-Classification-with-Python/blob/master/Text_classification_20newsgroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification: 20newsgroups

This notebook is a summary for me and it is based on the following article (written by Javed Shaikh): [Machine Learning, NLP: Text Classification using scikit-learn, python and NLTK.](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)

# Loading the data set

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. 

In [0]:
from sklearn.datasets import fetch_20newsgroups

In [0]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [0]:
twenty_train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Check how the data looks like:

In [0]:
print(twenty_train.data[0][:80])
print(twenty_train.data[1][:80])
print(twenty_train.data[2][:80])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Pos
From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Ca
From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: PB questions...
Organ


The test set:

In [0]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

In [0]:
print(twenty_test.data[0][:80])
print(twenty_test.data[1][:80])
print(twenty_test.data[2][:80])

From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-
From: Rick Miller <rick@ee.uwm.edu>
Subject: X-Face?
Organization: Just me.
Line
From: mathew <mathew@mantis.co.uk>
Subject: Re: STRONG & weak Atheism
Organizati


#  Extracting features from text files

We use a bag of words model. Briefly, we segment each text file into words, and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).

[What is a TF-IDF Matrix (video)](https://youtu.be/G1bof7UL9RU?t=52)

## Transform Data: Word Count - CountVectorizer

To understand what CountVectorizer does, look at [this very simple example](https://youtu.be/0kPRaYSgblM?t=555)

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data) #learning the vocabulary dictionary 

In [0]:
X_train_counts.shape # --> Document-Term matrix. [n_samples, n_features]

(11314, 130107)

## Term Frequency Inverse Document Frequency (TF-IDF)

**Term Frequency:** This summarizes how often a given word appears within a document

**Inverse Document Frequency:** This downscales words that appear a lot across documents

[Simple example](https://youtu.be/0kPRaYSgblM?t=755)


More specific: 

**TF:**  #count(word) / #Total words, in each document.

**TF-IDF:** Finally, we can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all document. This is called as TF-IDF: Term Frequency times inverse document frequency.

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer

In [0]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [0]:
X_train_tfidf.shape

(11314, 130107)

 # Running ML algorithms

There are various algorithms which can be used for text classification. We will start with the most simplest one ‘Naive Bayes (NB)’.

(Note: there are many variants of NB)

In [0]:
from sklearn.naive_bayes import MultinomialNB

In [0]:
# Training Naive Bayes (NB) classifier on training data.
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target) 

# Building a pipeline 

We can write less code and do all of the above, by building a pipeline as follows:

In [0]:
from sklearn.pipeline import Pipeline

In [0]:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())]) # build pipeline

In [0]:
text_clf = text_clf.fit(twenty_train.data, twenty_train.target) # Training Naive Bayes (NB) classifier on training data.

# Performance of NB Classifier

Now we will test the performance of the NB classifier on test set.

In [0]:
import numpy as np

In [0]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True) # load the test data

In [0]:
predicted = text_clf.predict(twenty_test.data)

In [0]:
np.mean(predicted == twenty_test.target) # accuracy

0.7738980350504514

# Support Vector Machines (SVM)

In [0]:
from sklearn.linear_model import SGDClassifier

In [0]:
# Training Support Vector Machines - SVM and calculating its performance
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))])

In [0]:
text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)



In [0]:
predicted_svm = text_clf_svm.predict(twenty_test.data)

In [0]:
np.mean(predicted_svm == twenty_test.target) # accuracy

0.8248805098247477

# Grid Search

Fine tune parameters!

In [0]:
from sklearn.model_selection import GridSearchCV

Here, we are creating a list of parameters for which we would like to do performance tuning. 
All the parameters name start with the classifier name (remember the arbitrary name we gave).

E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

In [0]:
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3)}

Next, we create an instance of the grid search by passing the classifier, parameters and n_jobs=-1 which tells to use multiple cores from user machine.


In [0]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [0]:
# This might take few minutes to run depending on the machine configuration.
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)




Lastly, to see the best mean score and the params, run the following code:

In [0]:
gs_clf.best_score_ # accuracy

0.9157684864695698

In [0]:
gs_clf.best_params_ # optimal parameters

{'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

## Grid search for SVM

In [0]:
# define parameter range
parameters_svm = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3)}

In [0]:
# Create an instance of the grid search
gs_clf_svm = GridSearchCV(text_clf_svm, parameters_svm, n_jobs=-1)

In [0]:
# This might take few minutes to run depending on the machine configuration.
gs_clf_svm = gs_clf_svm.fit(twenty_train.data, twenty_train.target)



In [0]:
gs_clf_svm.best_score_

0.9047198366213406

In [0]:
gs_clf_svm.best_params_

{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

Useful tips: see last steps in the [article](https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a)