# Text classification in Python

This is a short practical guide to text classification using machine learning.

For this notebook we will use the 20 newsgroups dataset.

We will limit our choice of categories to 4 to get faster execution times.

Scikit-learn has a built in function to download this dataset.

In [1]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

Print the size of the dataset.

In [2]:
len(twenty_train.data)

2257

Print the first few lines of a sample.

In [3]:
print("\n".join(twenty_train.data[13].split("\n")[:10]))

Subject: So what is Maddi?
From: madhaus@netcom.com (Maddi Hausmann)
Organization: Society for Putting Things on Top of Other Things
Lines: 12

As I was created in the image of Gaea, therefore I must
be the pinnacle of creation, She which Creates, She which
Births, She which Continues.

Or, to cut all the religious crap, I'm a woman, thanks.


## Bag of words approach

CountVectorizer component is responsible for text preprocessing, tokenizing and filtering of stopwords.

Once fitted, the vectorizer has built a dictionary of feature indices.
The index value of a word in the dictionary is linked to its frequency in the whole training corpus.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

TF-IDF - reflects how important a word is to a document in a collection documents.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

## Naive Bayes classifier

Let's start of with a baseline example using Multinomial Naive Bayes classifier.

In [6]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Using the trained classifier we can now predict the category for some example sentences.

In [7]:
docs_new = [
    'God is love', 
    'OpenGL on the GPU is fast', 
    'The human visual cortex is a part of the brain'
]

# calculate tf-idf for the new sentences
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'The human visual cortex is a part of the brain' => sci.med


### Building a pipeline

We can simplify the above code by combining the steps into a single pipeline.

In [8]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### Evaluation on the test set

In [9]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)

print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 83.49%


## Support Vector Machine classifier

SVM is widely regarded as of the best text classification algorithms.

In [10]:
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42)),
])
text_clf.fit(twenty_train.data, twenty_train.target)  

predicted = text_clf.predict(docs_test)

print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 91.28%


Print a more detailed performance analysis.

In [11]:
from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



Print the confusion matrix.

In [12]:
metrics.confusion_matrix(twenty_test.target, predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

## Parameter tuning using grid search

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values.

In [13]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
             }
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)           

Example classification (GridSearchCV behaves like a normal model)

In [14]:
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]

'soc.religion.christian'

Print the best score and parameters corresponding to that score.

In [15]:
gs_clf.best_score_

0.96544085068675234

In [16]:
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)


In [17]:
predicted = gs_clf.predict(docs_test)

print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 91.28%


## Random forest classifier

Let's try out how RandomForestClassifier will perform on the text data.

In [18]:
from sklearn.ensemble import RandomForestClassifier

forest = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier())
])
forest = forest.fit(twenty_train.data, twenty_train.target)   
predicted = forest.predict(docs_test)
print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 70.64%


As you can see Random forest does not perform very well on text data. We can probably improve the score by tuning the parameters but overall it is not the best choice for text classification.

## Logistic regression

Let's also try out logistic regression to classify the text.

In [19]:
from sklearn.linear_model import LogisticRegression
logreg = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression(C=1e5))
])
logreg = logreg.fit(twenty_train.data, twenty_train.target)   
predicted = logreg.predict(docs_test)
print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 92.54%


As you can see Logistic regression tends to use more computer resources, but produce pretty good results making it viable to use in production applications.

## Approaches based on Neural Networks

Basic example using a multilayer perceptron to classify the text.

In [35]:
from sklearn.neural_network import MLPClassifier

mlp = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MLPClassifier(solver='lbfgs', alpha=0.01, 
                                           hidden_layer_sizes=(20,), 
                                           random_state=1))
])
mlp = mlp.fit(twenty_train.data, twenty_train.target)   
predicted = mlp.predict(docs_test)
print("Accuracy: %.2f%%" % (np.mean(predicted == twenty_test.target) * 100))

Accuracy: 92.74%


# TODO: CNN, LSTM, Word2Vec, Doc2Vec