# STA 141B Lecture 12

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

### Topics

* Natural Language Processing

### Datasets

### References

* [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
* [Applied Text Analysis with Python][atap], chapters 1, 3, 4
* [Scikit-Learn Documentation][skl], especially the section about [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US
[skl]: https://scikit-learn.org/stable/documentation.html

In [None]:
import numpy as np
import pandas as pd

import nltk
import nltk.corpus

# nltk.download("gutenberg")
# nltk.download("punkt")

In [None]:
alice = nltk.corpus.gutenberg.raw("carroll-alice.txt")

## Strings, String Methods, and Regular Expressions

How does word tokenization actually work?

The simplest strategy is to split at whitespace. You can do this with Python's built-in string methods:

In [None]:
alice.split()

Splitting on whitespace doesn't handle punctuation. You can use regular expressions to split on more complex patterns. Python's built-in __re__ module provides regular expression functions.

In [None]:
import re

re.split("[ ,.:;!']", alice)

What if we also want to split at newlines?

### Escape Sequences and Raw Strings

In Python strings, backslash `\` marks the beginning of an _escape sequence_. Escape sequences are special codes for writing characters that you can't otherwise type. For example, `\n` is a new line character and `\t` is a tab character.

Since `\` has a special meaning in strings, to write a literal `\` you must use the escape sequence `\\`.

You can see the actual characters in a string by printing the string:

The regular expression language is independent of Python and also uses backslash `\` to mark the beginning of an escape sequence. Regex escape sequences disable special behavior for characters. For example, `.` matches any character, but `\.` only matches a literal `.`.

As a result, writing a regular expression in an ordinary Python string is awkward. For example, to match a literal `\`, we need to write `\\` in regular expressions, which is `\\\\` in an ordinary Python string.

Python provides _raw strings_, where `\` has no special meaning for Python, to help solve this problem. You can create a raw string by putting an `r` before the starting quote:

Even raw strings can't end in `\`; this is a limitation of the Python parser.

### More Regular Expressions

Now we can write a better regular expression to split with:

The regular expressions language includes _character classes_ that describe common sets of characters. The whitespace class `\s` and the word class `\w` are useful here. So to split on any whitespace character:

Capitalizing a character classes inverts the meaning, so to split on all non-word characters:

Rather than splitting the text, you can also approach the problem from the perspective of extracting tokens. The `findall()` function returns all matches for a regular expression:

Tokenizing natural languages is a difficult problem. Some tokenizers work better for certain kinds of documents than others.

Before building your own tokenizer, try the tokenizers included with __nltk__, in the `nltk.tokenize` submodule.

## Standardizing Text

We standardize numerical data in order to make fair comparisons, comparisons that are not influenced by the location and scale of the data. Similarly, you can standardize text (tokens) to make sure comparisons are fair and accurate.

For example, `"Cat"` and `"cat"` are the same word even though they're different tokens. Converting all characters to lowercase is one way to standardize a document.

Some common standardization techniques for text are:

* Lowercasing
* Stemming: Use patterns to remove prefixes and suffixes from words.
* Lemmatiziation: Look up each token in a dictionary and replace it with a root word. Similar to stemming, but more accurate.
* Stopword Removal: Remove tokens that don't contribute meaning. For example, "the" is meaningless on its own.
* Identifying Outliers: Identify and possibly remove non-standard "words" like numbers, mispellings, code, etc...

How and whether you should standardize a document or corpus depends on what kind of analysis you want to do. There is no formula; you must think carefully and experiment to determine which standardization techniques work best for your problem.

### Lowercasing

You can use Python's string methods for simple text transformations.

### Stemming

_Stemming_ runs an algorithm on each token to remove affixes (prefixes and suffixes). The result is called a _stem_.

Stemming is useful if you want to ignore affixes.

For example, most English verbs use suffixes to mark the tense. We write "They fish" (present) and "They fished" (past). Without any standardization, the tokens "fish" and "fished" would be treated as separate words. Stemming converts both tokens to the common stem "fish":

If you want to stem an entire document, use a list comprehension:

Stemmers use a sequence of rules to determine the stem for each token, but natural languages are full of special cases and exceptions. So as you can see in the example above, some stems are not words ("alic"), and sometimes tokens that seem like they should have the same stem don't.

Several different stemmers are provided in the `nltk.stem` submodule.

### Lemmatization

_Lemmatization_ looks up each token in a dictionary to find a root word, or _lemma_.

Lemmatization serves the same purpose as stemming. Lemmatization is more accurate, but requires a dictionary and usually takes longer.

The WordNet lemmatizer requires part of speech information in order to lemmatize words. You can get approximate part of speech information with __nltk__'s `pos_tag()` function.

These are [Brown POS tags][brown], but the lemmatizer uses WordNet POS tags. You can use this function to convert the tags:

[brown]: https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

In [None]:
from nltk.corpus import wordnet

def wordnet_pos(tag):
    """Map a Brown POS tag to a WordNet POS tag."""
    
    table = {"N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV, "J": wordnet.ADJ}
    
    # Default to a noun.
    return table.get(tag[0], wordnet.NOUN)

The `nltk.stem` submodule also provides several different lemmatizers.

### Stopword Removal

_Stopwords_ are words that appear frequently but don't add meaning.

In English, "the", "a", and "at" are examples. However, exactly which words are stopwords depends on your analysis. Words that are meaningless in one analysis might be very important in others.

You can filter out stopwords with a list comprehension:

__nltk__ also provides a stopwords corpus that contains common stopwords for several languages.

In [None]:
nltk.download("stopwords")

## Exploring Documents

A simple way to explore a document is by looking at frequency distributions for tokens.

You can use the `FreqDist()` function to construct a frequency distributions from a list of tokens.

Frequency distribution objects have a few methods to provide summary information.

The `.most_common()` method returns the most common tokens and their frequencies:

A _hapax_ is a token that only occurs once within a document. The `.hapaxes()` method returns the hapaxes:

The `.plot()` method displays a plot of word frequencies, sorted from most to least frequent word.

The first parameter controls how many words to display. The second parameter controls whether the plot is cummulative.

## Feature Engineering

Most statistical techniques take numbers as input. You may have already noticed this when working with categorical data. We can't compute the mean, median, standard deviation, or z-score if the observations aren't numbers. While we can fit linear models, it takes extra work because we have to create, or _engineer_, indicator variables.

We face the same problem with natural language data. We need to _quantify_ documents, or turn them into numbers, so that we can use a wider variety of statistical techniques. We can do this by engineering features from our documents.

So: what kinds of features can we create for language data?

### Term Frequencies

One solution is to extend the idea of frequency analysis. We used frequency analysis to study individual documents, but what if we compute the word frequencies for every document in our corpus, and use those frequencies as features?

Let's try this for a small corpus:

In [None]:
corpus = ["The cat saw the dog was angry.", "The dog saw the cat was angry.", "The canary saw the iguana was sad."]


Notice that when we use term frequencies as features, we lose information about the order of the words in each document.

The first and second document contain the same words, but in different orders. The word frequency features for these two documents are identical.

The __scikit-learn__ package (included with Anaconda) provides functions to help with feature engineering. The `sklearn.feature_extraction.text` submodule is specifically for extracting features from text documents.

One problem with term frequencies is that some terms have high frequencies simply because they appear frequently in the language. These terms can cause documents to appear similar even if they are otherwise different.

While removing stopwords takes care of some high-frequency words, there may also be high-frequency words that have meaning and need to be kept.

### One-hot Encoding

We can avoid emphasis on high-frequency words by ignoring frequency altogether. Instead, we can create indicator variables for individual words. The indicator is 1 if the word appears in the document, and 0 otherwise.

In machine learning, an indicator variable is also called a _one-hot encoding_.

The `sklearn.preprocessing` submodule of __scikit-learn__ provides a function for one-hot encoding.

As with term frequencies, we lose information about the order of the words in the document.

One-hot encoding as an extreme transformation: every term is equally important. This means terms that are relatively rare or unique still might be underemphasized (this is also a problem for term frequencies).

### Term Frequency-Inverse Document Frequency

_Term frequency-inverse document frequency_ (tf-idf) statistics put terms on approximately the same scale while also emphasizing relatively rare terms. There are [several different tf-idf statistics](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

The _smoothed tf-idf_, for a term $t$ and document $d$, is given by:

$$
\operatorname{tf-idf}(t, d) = \operatorname{tf}(t, d) \cdot \log \left( \frac{N}{1 + n_t} \right)
$$

where $N$ is the total number of documents and $n_t$ is the number of documents that contain $t$.

The `sklearn.feature_extraction.text` submodule of __scikit-learn__ provides a function for computing tf-idf:

In long documents or documents with many high-frequency terms, we can further reduce the emphasis on these terms by taking the logarithm of the term frequency. To do this, set `sublinear_tf = True` in the `TfidfVectorizer()` function.

## The Bag-of-words Model

The one-hot encoding, term frequencies, and TF-IDF scores all ignore word order.

The _bag-of-words model_ assumes that the order of words in a document doesn't matter. Imagine taking the words in each document and dumping them into a bag, where they get all mixed up. Note that in this case "model" means a way of thinking about a problem, not a statistical model.

While the order of words in a document might seem important, the bag-of-words model is surprisingly useful. The bag-of-words model is a good place to start if you want to use statistical methods on language data.

## Measuring Similarity

We can measure the _similarity_ of two documents by computing the distance between their term frequency vectors. There are many different ways we can measure distance:

* Minkowski distance, a family of distances that includes Euclidean distance and Manhattan distance.
* Mahalanobis distance, the Euclidean distance between z-scores.
* Cosine distance, the cosine of the angle between two vectors. See [here](https://stats.stackexchange.com/a/235676/29695) for an explanation of how cosine distance is related to correlation.
* And others...

The cosine distance often works well for language data. The cosine distance between two vectors $a$ and $b$ is defined as:

$$
\frac{a \cdot b}{\Vert a \Vert \Vert b \Vert}
$$

where $\Vert \cdot \Vert$ is the Euclidean norm.

The `TfidfVectorizer()` function already divides the returned tf-idf vectors by their Euclidean norms, so we can compute cosine distance as a simple dot product:

## The n-gram Model

Remember how the bag-of-words model ignores word order?

We can extend the model to keep some order by taking sequences of words instead of individual words. Sequences of two or three words are called _bigrams_ and _trigrams_, respectively. A sequence of $n$ words is called an _n-gram_.

__nltk__ provides functions to extract n-grams:

Notice that a separate n-gram was computed for each word in the original document. So for the bigrams in the example, we get every pair of words that appears in the document.

The n-gram model assumes that nearby words have the strongest effect on the meaning of each word.

We can use n-grams to identify phrases that are particularly common in a document. We can also use the n-gram model to engineer features, the same way we used the bag-of-words model. That is, we can compute frequencies, one-hot encodings, TF-IDFs, and other features: