## 2.3 Working With Text Data

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics. 

In this section we will see how to: 
- load the ﬁle contents and the categories
- extract feature vectors suitable for machine learning 
- train a linear model to perform categorization 
- use a grid search strategy to ﬁnd a good conﬁguration of both the feature extraction components and the classiﬁer

### 2.3.2 Loading the 20 newsgroups dataset

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 
               'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42)
print(twenty_train.target_names)
print(len(twenty_train.data))
print(len(twenty_train.filenames))
print("\n".join(twenty_train.data[0].split("\n")[:3])) 
print(twenty_train.target_names[twenty_train.target[0]]) 

### 2.3.3 Extracting features from text ﬁles
In order to perform machine learning on text documents, we ﬁrst need to turn the text content into numerical feature vectors.
#### Bags of words
The most intuitive way to do so is to use a bags of words representation:
1. Assign a ﬁxed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices). 
1. For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary. 

Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will beused. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory

##### Tokenizing text with scikit-learn


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
print(count_vect.vocabulary_.get(u'algorithm'))
X_train_counts.shape

##### From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics. 

To avoid these potential discrepancies it sufﬁces to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies. 

Another reﬁnement on top of tf is to down scale weights forwords that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. 

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”. 

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
print(X_train_tf[0,177])
X_train_tf.shape

In [None]:
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
print(X_train_tf[0,177])
X_train_tfidf.shape

### 2.3.4 Training a classiﬁer


In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf,twenty_train.target)
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new,predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category])) 

### 2.3.5 Building a pipeline


In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',MultinomialNB())
])
text_clf.fit(twenty_train.data, twenty_train.target)

### 2.3.6 Evaluation of the performance on the test set


In [None]:
import numpy as np 
twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories,
                                 shuffle=True,
                                 random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

 Let’s see if we can do better with a linear support vector machine (SVM), which is widely regarded as one of the best text classiﬁcation algorithms (although it’s also a bit slower than naïve Bayes).

In [None]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',SGDClassifier(loss='hinge',
                         penalty='l2',
                         alpha=1e-3,
                         random_state=42,
                         max_iter=5,
                         tol=None))
])
text_clf.fit(twenty_train.data,twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(twenty_test.target,predicted,
                            target_names=twenty_test.target_names))
print(confusion_matrix(twenty_test.target,predicted))

### 2.3.7 Parameter tuning using grid search

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1,1),(1,2)],
    'tfidf__use_idf': (True,False),
    'clf__alpha': (1e-2,1e-3)
}
gs_clf = GridSearchCV(text_clf,parameters,cv=5,iid=False,n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
print(twenty_train.target_names[gs_clf.predict(['God is love'])[0]])
print(gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
    print("%s: %s" % (param_name,gs_clf.best_params_[param_name]))