# Natural Language Processing

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.
In this section we will see how to:
load the file contents and the categories
extract feature vectors suitable for machine learning
train a linear model to perform categorization
use a grid search strategy to find a good configuration of both the feature extraction components and the classifier

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In [1]:
from sklearn.datasets import fetch_20newsgroups

In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories = categories)

How many records do we have in the training dataset?
Explore the dataset

In [4]:
len(twenty_train.data)

2257

In [21]:
print("\n".join(twenty_train.data[50].split("\n")[:5]))

From: ab@nova.cc.purdue.edu (Allen B)
Subject: Re: TIFF: philosophical significance of 42
Organization: Purdue University
Lines: 39



In [22]:
print(twenty_train.target_names[twenty_train.target[50]])

comp.graphics


In [23]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

In [24]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


## Vectorization

Currently, we have our text as lists of tokens (also known as lemmas) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with
Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.
We'll do that in three steps using the bag-of-words model:
Count how many times does a word occur in each message (Known as term frequency)
Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
Let's begin the first step:

Each vector will have as many dimensions as there are unique words in the SMS corpus.  We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message. 

For example:

<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix).


In [25]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [27]:
count_vect.vocabulary_.get(u'algorithm')

4690

After the counting, the term weighting and normalization can be done with [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf), using scikit-learn's `TfidfTransformer`.

____
### So what is TF-IDF?
TF-IDF stands for *term frequency-inverse document frequency*, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

**TF: Term Frequency**, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

*TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).*

**IDF: Inverse Document Frequency**, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

*IDF(t) = log_e(Total number of documents / Number of documents with term t in it).*

See below for a simple example.

**Example:**

Consider a document containing 100 words wherein the word cat appears 3 times. 

The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
____

Let's go ahead and see how we can do this in SciKit Learn:

In [43]:
from sklearn.feature_extraction.text import TfidfTransformer
tf = TfidfTransformer()
X_train_tf = tf.fit_transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

With news represented as vectors, we can finally train our classifier. Now we can actually use almost any sort of classification algorithms. For a [variety of reasons](http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf), the Naive Bayes classifier algorithm is a good choice. We will later use the Support Vector Machine!!

In [30]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB().fit(X_train_tf, twenty_train.target)

In [49]:
docs_new = ['God the father', 'going to church', 'High spec laptop', 'High Priest', 'angry mob']
new_count = count_vect.transform(docs_new)
new_tf = tf.transform(new_count)

Predict = nb.predict(new_tf)

for doc, category in zip(docs_new, Predict):
    print('%r => %s' %(doc, twenty_train.target_names[category]))

'God the father' => soc.religion.christian
'going to church' => soc.religion.christian
'High spec laptop' => comp.graphics
'High Priest' => soc.religion.christian
'angry mob' => soc.religion.christian


In [51]:
from sklearn.pipeline import Pipeline
text_nb = Pipeline([('vect', CountVectorizer()), ('tf', TfidfTransformer()), ('nb', MultinomialNB()),])

In [56]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
                                categories=categories)
text_nb.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...linear_tf=False, use_idf=True)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [57]:
docs_test = twenty_test.data

In [59]:
predict = text_nb.predict(docs_test)
np.mean(predict==twenty_test.target)

0.8348868175765646

Using the support vector machine model

In [61]:
from sklearn.linear_model import SGDClassifier
text_sv = Pipeline([('vect', CountVectorizer()), ('tf', TfidfTransformer()), ('cvm', SGDClassifier()),])

In [62]:
text_sv.fit(twenty_train.data, twenty_train.target)



Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

In [64]:
predictsv = text_sv.predict(docs_test)
np.mean(predictsv==twenty_test.target)

0.9234354194407457

In [67]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predictsv, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.96      0.82      0.89       319
         comp.graphics       0.91      0.96      0.94       389
               sci.med       0.95      0.92      0.93       396
soc.religion.christian       0.89      0.97      0.93       398

             micro avg       0.92      0.92      0.92      1502
             macro avg       0.93      0.92      0.92      1502
          weighted avg       0.93      0.92      0.92      1502



In [69]:
print(metrics.classification_report(twenty_test.target, predict, target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

             micro avg       0.83      0.83      0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502

