scikit-learn classifiers & pipelines; big data apps

In [1]:
import sklearn

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
#the data is downloaded from http://qwone.com/~jason/20Newsgroups/

In [3]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [4]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [5]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded in memory in the data attribute. 

In [6]:
len(twenty_train.data)

2257

For reference the filenames are also available:

In [7]:
len(twenty_train.filenames)

2257

In [15]:
print("\n".join(twenty_train.data[2256].split("\n"))[:5])


From:


In [None]:
print(twenty_train.target_names[twenty_train.target[2250]])

Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.
For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored in the target attribute:

In [16]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

In [17]:
twenty_train.target[:136]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2, 3, 1, 0, 0, 1, 1, 2, 0, 3, 0, 3, 0,
       3, 1, 1, 1, 3, 3, 2, 2, 2, 3, 2, 3, 2, 3, 0, 0, 0, 1, 3, 0, 1, 1,
       2, 0, 3, 3, 1, 2, 1, 2, 0, 0, 2, 1, 2, 3, 0, 1, 0, 3, 1, 2, 1, 1,
       2, 0, 3, 1, 3, 2, 0, 3, 0, 1, 1, 2, 0, 1, 2, 2, 2, 2, 1, 1, 0, 2,
       1, 2, 0, 1, 1, 3, 1, 0, 1, 2, 1, 0, 0, 3, 0, 2, 3, 0, 3, 1, 0, 2,
       3, 2, 3, 3, 1, 2, 3, 2, 2, 0, 0, 2, 0, 2, 3, 0, 2, 3, 0, 0, 0, 3,
       2, 3, 2, 2])

We can get back the category names as follows:

In [18]:

for t in twenty_train.target[:136]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med
soc.religion.christian
comp.graphics
alt.atheism
alt.atheism
comp.graphics
comp.graphics
sci.med
alt.atheism
soc.religion.christian
alt.atheism
soc.religion.christian
alt.atheism
soc.religion.christian
comp.graphics
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med
soc.religion.christian
sci.med
soc.religion.christian
sci.med
soc.religion.christian
alt.atheism
alt.atheism
alt.atheism
comp.graphics
soc.religion.christian
alt.atheism
comp.graphics
comp.graphics
sci.med
alt.atheism
soc.religion.christian
soc.religion.christian
comp.graphics
sci.med
comp.graphics
sci.med
alt.atheism
alt.atheism
sci.med
comp.graphics
sci.med
soc.religion.christian
alt.atheism
comp.graphics
alt.atheism
soc.religion.christian
comp.graphics
sci.med
comp.graphics
comp.graphics
sci.med
alt.atheism
soc

Let us apply what we learnt today!

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

In [20]:
count_vect = CountVectorizer()

In [21]:
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [22]:
X_train_counts.shape

(2257, 35788)

In [23]:
X_train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Can we track down a single word? How?

In [27]:
count_vect.vocabulary_.get(u'athlete')

5726


Is this 242 one of the 2257 or of the 35788 slots?

In [None]:
count_vect.vocabulary_.get(u'algorithm')

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer

In [29]:
tfidf_transformer = TfidfTransformer()

In [30]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [31]:
X_train_tfidf.shape

(2257, 35788)

We can calculate tf only:

In [32]:

>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)



In [33]:
>>> X_train_tf = tf_transformer.transform(X_train_counts)


In [34]:
>>> X_train_tf.shape

(2257, 35788)


THE CLASSIFIER

Let us train a naÃ¯ve Bayes classifier;  scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [35]:

from sklearn.naive_bayes import MultinomialNB

In [36]:
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [37]:
docs_new = ["I am interested in NLP and would like to learn more about vectors and vectorizers for language processing and NLP in general.",

"For those who like to experiment with vectors in dealing with various data, it might be useful to work with NLP vectorizers.",

"A significant computational experiment that would involve language will also have to involve vectors, so you will want to look into vectorizers.",

"Even if you are not interested in NLP, but you are dealing with data, you can use vectors to organize your data."]

In [38]:
X_new_counts = count_vect.transform(docs_new)

In [39]:
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [40]:
predicted = clf.predict(X_new_tfidf)

In [41]:
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))
   

'I am interested in NLP and would like to learn more about vectors and vectorizers for language processing and NLP in general.' => sci.med
'For those who like to experiment with vectors in dealing with various data, it might be useful to work with NLP vectorizers.' => sci.med
'A significant computational experiment that would involve language will also have to involve vectors, so you will want to look into vectorizers.' => soc.religion.christian
'Even if you are not interested in NLP, but you are dealing with data, you can use vectors to organize your data.' => sci.med



THE PIPELINE

In [42]:

from sklearn.pipeline import Pipeline

In [43]:

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

In [44]:

text_clf.fit(twenty_train.data, twenty_train.target) 


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...False,
         use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])



EVALUATING THE PERFORMANCE of the PIPELINE


In [45]:
import numpy as np

In [46]:

twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

In [47]:

docs_test = twenty_test.data

In [48]:

predicted = text_clf.predict(docs_test)

In [49]:

np.mean(predicted == twenty_test.target)

0.8348868175765646


Which means, we obtained 83.4% accuracy.


Let us compare this performance with the performance of another classifier.

Let us experiment with a linear support vector machine (SVM):

In [50]:

from sklearn.linear_model import SGDClassifier

In [51]:
text_clf1 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None))])

In [52]:

text_clf1.fit(twenty_train.data, twenty_train.target) 

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...ty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False))])

In [55]:
predicted1 = text_clf1.predict(docs_test)

In [56]:

np.mean(predicted1 == twenty_test.target)            


0.9127829560585885


Scikit-learn has the tools to evaluate classifiers rigorously and in minute detail:

In [57]:

from sklearn import metrics

In [58]:

print(metrics.classification_report(twenty_test.target, predicted1, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



In [59]:
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])

In [60]:
text_clf.fit(twenty_train.data, twenty_train.target) 

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...False,
         use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [61]:
predicted = text_clf.predict(docs_test)

In [62]:
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

           avg / total       0.88      0.83      0.84      1502




What are precision and recall?


Let us hit this Wiki entry https://bit.ly/2jmIrtX

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

See more here: https://bit.ly/2hQCvdh 

Source: http://scikit-learn.org 