# Text processing - Text Classification pipeline

In this notebook we will practice the following items:
+ We will apply supervised machine learning on text data, specifically
- Text classification (into topics) using 20newsgroup data
- Familiarize with the `pipeline` object



In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn import preprocessing
from sklearn import metrics

from sklearn.datasets import fetch_20newsgroups

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

## Text Classification

For this task we will use a dataset called “Twenty Newsgroups”. This is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups (topics). The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

We will use the built-in [dataset loader for 20 newsgroups](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset) from scikit-learn. Our task is to train a classifier to correctly classify a new post into one of the topics (newsgroups) based on its content. We will use part of the examples provided [here](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier)

In [None]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=12) # use sklearn's method


Let's take a look on some of the documents (feel free to change the document id's you look on)

In [None]:
doc_id=11
print(twenty_train.data[doc_id]) # looking on the first doc
print("it's topic id is:",twenty_train.target[doc_id])
print("it's topic name is:",twenty_train.target_names[twenty_train.target[doc_id]])

Let's take a look on the 20 topics:

In [None]:
twenty_train.target_names

It's time to turn it into a feature matrix (do you remember how to do it?)

In [None]:
count_vect = CountVectorizer(stop_words="english")
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

Wow! Over 120,000 features! That's too much, we don't need all of them, let's limit ourselves to the top 10000 features:


In [None]:
count_vect = CountVectorizer(stop_words="english",max_features=10000)
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

That's more reasonable (you can always test later again, what happens if you keep the larger number of features, or reduce the number even more aggressively)

As seen earlier, it's recommended now to normalize the data (according to the relative frequency)

In [None]:
X_train_normalized = preprocessing.normalize(X_train_counts, norm='l1')
#X_train_normalized.toarray()

Money time! Time to train the classifier. We will use the Naive Bayes classifier (SVM works well for texts as well).

In [None]:
clf = MultinomialNB().fit(X_train_normalized, twenty_train.target)


Ok, let's evaluate the model on the test set. But...

Before we run it, we need to pass it through the same steps of feature extraction, filtering and normalization (exactly as in train phase). We have to use the same vectorizer object (otherwise we will get different feature ids). This can be complicated, and that's why we have the `pipeline` object that come to our help:


## `pipeline` Object

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [None]:

text_clf_nb = Pipeline([
    ('vect', CountVectorizer(stop_words="english",max_features=10000)),
    ('norm', preprocessing.Normalizer(norm='l1')),
    ('clf', MultinomialNB()),
])

The names vect, norm and clf (classifier) are arbitrary. We can use them for example to perform grid search for suitable hyperparameters. We will now train the model with a single command:

In [None]:
text_clf_nb.fit(twenty_train.data, twenty_train.target)

what's next? 

correct, evaluation on test set. Evaluating the predictive accuracy of the model is equally easy:

In [None]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=12)
docs_test = twenty_test.data
predicted = text_clf_nb.predict(docs_test)
np.mean(predicted == twenty_test.target)


We achieved 64.8% accuracy.

In [None]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))