# Working with Text Data

This section draws heavily from the [official scikit-learn tutorial on text classification](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

Working with text data is a particularly attractive use case for machine learning. It's also often a messy one that can involve working with a lot of boilerplate code. Scikit-Learn provides many features for working with text data.

In this section, we're going to work with the canonical "20 newsgroups" data set.

Newsgroups are like reddit before reddit.

From the [web site](http://qwone.com/~jason/20Newsgroups/)


> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups = fetch_20newsgroups(subset='train')

The training portion of the dataset has been downloaded, stored on your machine, and made available in memory via `sklearn.datasets.load_data`.

In [None]:
newsgroups.filenames[0]

In [None]:
len(newsgroups.filenames)

In [None]:
len(newsgroups.data)

In [None]:
print(newsgroups.data[0])

Eventually, we'll want to build a classifier on these names.

In [None]:
newsgroups.target_names

## Bag of Words

First, we need to take our text and turn it in to numerical features. A common assumption for doing machine learning on text is what's known as the bag of words assumption. This means that we assume that the order of the words as they occur in a document doesn't matter to discern tje general meaning of the document. This is commonly done in the following steps

1. Build what's called a *vocabulary*, which is a mapping from integers to possible words, $w$, in your *corpus*, or collection of documents.
2. Using this *vocabulary*, assign a number to the count of each word occuring in any document.

What you're left with is a matrix $X$, where each value $X[i,j]$ is the count of word $j$ in document $i$.

$X$ is a matrix of dimension `n_documents` by `n_vocabulary`. This is large. Luckily, most words don't occur in every document. If they did, we would not be able to separate the documents according to topics.

For this reason, bag of words documents are often high-dimensional, sparse datasets. We don't need to keep the zeros in memory.

## Tokenizing Text

Ok, so how do we do this? Text is often really messy, has punctuation, and has a bunch of words that every text has to have but don't necessarily connote topical meaning. These words are called *stop words* such as "the," "a," or "an."

We turn human writing into a set of feature vectors by taking care of these issues. This process is called *tokenization*.

scikit-learn provides some nice facilities for building a dictionary of features and transform documents to feature vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorizer = CountVectorizer()

In [None]:
X_train_counts = count_vectorizer.fit_transform(newsgroups.data)

In [None]:
X_train_counts

The importance of sparse matrices.

In [None]:
import numpy as np
np.prod(X_train_counts.shape) / (8 * 1000 ** 2)

The trained `CountVectorizer` transformer has a `vocabulary_` attribute that's a dictionary of feature indices

In [None]:
type(count_vectorizer.vocabulary_)

In [None]:
count_vectorizer.vocabulary_.get('algorithm')

In [None]:
idx = count_vectorizer.vocabulary_.get('algorithm')
X_train_counts[:, idx].sum()

### Occurrences to Frequencies

There's one issue so far. Number of occurrences is correlated with document length. Instead, we should look at the *term frequency*. This is the frequency of occurences of a word in a document. Term frequency in document $i$ for word $j$ is

$$tf_{ij}=\frac{w_{ij}}{\sum_jw_{ij}}$$

You might go about computing this.

In [None]:
from sklearn.preprocessing import normalize

tf = normalize(X_train_counts, norm='l1', axis=1)

In [None]:
tf[:3, :].sum(1)

That's great. What's the most frequently used term in one of the documents?

In [None]:
word_idx = tf[1234, :].argmax()

We can create a reverse mapping using `get_feature_names`.

In [None]:
count_vectorizer.get_feature_names()[word_idx]

Another important concept is that of *inverse document frequency*. This is a measure of how important a word is. Words like stop words or words that are otherwise popular in a corpus will still have a high term frequency. Inverse document frequency is a way to downweight the frequent terms but upweight the rare ones. The inverse document frequency is

$$idf = \log\left(\frac{N_{\text{documents}}}{N_{\text{documents with term}}}\right)$$

You'll often see

$$idf = \log\left(\frac{N_{\text{documents}}}{1 + N_{\text{documents with term}}}\right)$$

In case your vocabulary is a superset of the words in your documents.

So tf-idf is

$$\text{tf-idf} = tf * idf$$

Scikit-learn actually uses a *slightly* different definition.

Of course, scikit-learn provides a transformer for tf-idf

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()

X_tfidf = tfidf.fit_transform(X_train_counts)

## Classifying with Text Data

Let's train a classifier. One of the first off-the-shelf, textbook classifiers for text documents is the Naive Bayes classifier. Naive Bayes is so named because it applies Bayes' theorem to classify samples, relying on a naive (incorrect) assumption on the independence of features. 

We estimate the probability of a record with $k$ features  $\boldsymbol{w}=(w_1,\dots,w_k)$ belonging to some class $C_j$ for $J$ potential classes as

$$p(C_j\mid\boldsymbol{w}) = \frac{p(C_j)p(\boldsymbol{w} \mid C_j)}{p(\boldsymbol{w})}$$

Thus, the probability of $w_i$ belonging to $C_j$ is based on our prior belief for the incidence of $C_j$, the likelihood of observing the data that we did, $\boldsymbol{w}$, given that the true class is $C_j$, and the evidence $p(w)$, which is constant and ignored.

We are, therefore, interested in the numerator only whose joint probability model is

$$p(C_k,w_1,\dots,w_k)$$

We want to select the class $C_k$ that maximizes this probability. By assuming that the features are independent, this probability is equivalent to

$$p(C_k)\prod_{i=1}^k p(w_i\mid C_k)$$

so the classifier is the solution to

$$\hat{y} = \underset{j \in \{1,\dots,J\}}{\text{argmax}}p(C_j)\prod_{i=1}^k p(w_i \mid C_j)$$

For text documents, we rely on the bag of words assumption and, thus, the event probability model is multinomial. The frequency of words in a document are generated via a multinomial distribution with parameter $\boldsymbol{p} = (p_1,\dots,p_k)$. As per usual, we estimate the log-likelihood

$$\hat{y} = \underset{j \in \{1,\dots,J\}}{\text{argmax}}\log p(C_j) + \sum_{i=1}^k \log p(w_i \mid C_j)$$

where the maximum liklelihood estimate for the prior $p(C_j)$ is simple the relative frequencies that we observe for each class and, similarly, the MLE for $p(w_i \mid C_j)$ are the relative frequencies of each term in each class. We often use Laplace smoothing to avoid the problem that rare occurrences of words will not appear in the training data. So instead of calculating the raw relative frequencies we assume a uniform prior on all words and add one to their occurrences when counting.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

clf.fit(X_tfidf, newsgroups.target)

In [None]:
docs = [
    'What kind of car is this',
    'This GPU and RAM in my new computer is awesome',
    'ESA is doing some cool things',
    'This old thing is for sale'
]

In [None]:
X_new = count_vectorizer.transform(docs)

In [None]:
X_tfidf_new = tfidf.transform(X_new)

In [None]:
predictions = clf.predict(X_tfidf_new)

In [None]:
for doc, category in zip(docs, predictions):
    target_name = newsgroups.target_names[category]
    print(f"{doc}: {target_name}")

#### Exercise

Turn the above work from the raw 20 newsgroups data to a Naive Bayes classifier into a Pipeline.

## Going Beyond 

There are a lot of great libraries for working with text data in Python. Two very popular ones are

* [NLTK](http://www.nltk.org/)
* [gensim](https://radimrehurek.com/gensim/)

But there are [many more](https://github.com/keon/awesome-nlp#user-content-python).

### Deep Learning 

We should point out that much of modern NLP and text modeling takes advantage of advances in [Deep Learning](https://github.com/keon/awesome-nlp#deep-learning-for-nlp) that allow estimators to go beyond the bag of words assumption. There are a few examples in the [TensorFlow Tutorials](https://www.tensorflow.org/tutorials/).