Ref.: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) 
# Output: The returned dataset is a scikit-learn “bunch”

In [None]:
print('Categories:')
for i, cat in enumerate(twenty_train.target_names):
    print(i,' : ',cat)

In [None]:
print(len(twenty_train.data)) # 2257
print(len(twenty_train.filenames)) # 2257

In [None]:
print('################ Data ############')
print("\n".join(twenty_train.data[0].split("\n")[:40]))

print('################ Target ############')
print(twenty_train.target_names[twenty_train.target[0]])

In [None]:
# For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored in the target attribute:
twenty_train.target[:10]

In [None]:
# It is possible to get back the category names as follows:
m = map(lambda x: twenty_train.target_names[x], twenty_train.target[:10])
[*m]

## Extracting features from text files
Ref.: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

### Bag  of Words
The bags of words representation implies that `n_features` is the number of distinct words in the corpus: this number is typically larger than 100,000.

If `n_samples == 10000`, storing X as a numpy array of type `float32` would require `10000 x 100000 x 4 bytes = 4GB` in RAM.

Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

`scipy.sparse` matrices are data structures that do exactly this, and `scikit-learn` has built-in support for these structures.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

xtrain_counts = count_vect.fit_transform(twenty_train.data)
xtrain_counts.shape # (2257, 35788)

In [None]:
# CountVectorizer supports counts of N-grams of words or consecutive characters. 
# Once fitted, the vectorizer has built a dictionary of feature indices
count_vect.vocabulary_.get(u'game') # 15077
# The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

### From Occurrancecs to Frequencies
Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for **“Term Frequency times Inverse Document Frequency”**.

Both `tf` and `tf–idf` can be computed as follows:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
# tf_transformer = TfidfTransformer(use_idf=False)
# xtrain_tf = tf_transformer.fit_transform(xtrain_counts)

tfidf_transformer = TfidfTransformer()
xtrain_tfidf = tfidf_transformer.fit_transform(xtrain_counts)
xtrain_tfidf.shape # (2257, 35788)

### Training a classifier

Now that we have our features, we can train a classifier to try to predict the category of a post.

Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task.

Note:  `scikit-learn` includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(xtrain_tfidf, twenty_train.target)
clf

In [None]:
# To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before.
docs_new = ['God is love', 'OpenGL on the GPU is fast', 'I am God loving if not God fearing', 'I am the prince of Asia']
xnew_counts = count_vect.transform(docs_new)
xnew_tfidf = tfidf_transformer.transform(xnew_counts)

preds = clf.predict(xnew_tfidf)
print(preds)
for doc,pred in zip(docs_new, preds):
    cat = twenty_train.target_names[pred] # Converting numeric category to string
    print('%r => %s' % (doc, cat))
'''
[3 1 3 2]
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'I am God loving if not God fearing' => soc.religion.christian
'I am the prince of Asia' => sci.med
'''    

### Building a pipeline
In order to make the vectorizer => transformer => classifier easier to work with, `scikit-learn` provides a `Pipeline` class that behaves like a compound classifier:

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])

# The names vect, tfidf and clf (classifier) are arbitrary. 
# We shall see their use in the section on grid search, below. 
# We can now train the model with a single command:
text_clf.fit(twenty_train.data, twenty_train.target)

### Evaluation of the performance on the test set

In [None]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
preds = text_clf.predict(twenty_test.data)

In [None]:
import numpy as np
acc = np.mean(preds == twenty_test.target) # 0.8348868175765646
print('Accuracy = ', acc)

from sklearn.metrics import classification_report, confusion_matrix
creport = classification_report(twenty_test.target, preds, target_names=twenty_test.target_names)
print(creport)

confusion_matrix(twenty_test.target, preds)

### Evaluation of the performance on the test set using SVM Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
svm_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42,
                                           max_iter=5, tol=None)),
])
svm_clf.fit(twenty_train.data, twenty_train.target)  

preds = svm_clf.predict(twenty_test.data)
acc = np.mean(preds == twenty_test.target)
print('Accuracy = ', acc) # 0.9127829560585885

creport = classification_report(twenty_test.target, preds, target_names=twenty_test.target_names)
print(creport)
confusion_matrix(twenty_test.target, preds)

### Parameter tuning using grid search

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. 

We try out all classifiers on either words or bigrams, with or without idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
             }
gs_clf = GridSearchCV(svm_clf, parameters, n_jobs=-1)

# Let’s perform the search on a smaller subset of the training data to speed up the computation:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400]) # The data is shuffled, so we are good

# The object’s best_score_ and best_params_ attributes store the best mean score and 
# the parameters setting corresponding to that score:
print('Best Score : ', gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
'''
Best Score :  0.9
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)
'''    

In [None]:
preds = gs_clf.predict(twenty_test.data)
acc = np.mean(preds == twenty_test.target)
print('Accuracy = ', acc) # 0.859520639147803

creport = classification_report(twenty_test.target, preds, target_names=twenty_test.target_names)
print(creport)
confusion_matrix(twenty_test.target, preds)
'''
Accuracy =  0.859520639147803
                        precision    recall  f1-score   support

           alt.atheism       0.93      0.71      0.80       319
         comp.graphics       0.84      0.96      0.90       389
               sci.med       0.91      0.83      0.87       396
soc.religion.christian       0.80      0.91      0.85       398

           avg / total       0.87      0.86      0.86      1502

array([[225,  12,  14,  68],
       [  2, 373,   8,   6],
       [  5,  45, 329,  17],
       [  9,  13,  12, 364]], dtype=int64)
'''