# Corpus Linguistics with Python

This session will demonstrate how Python can be used for corpus linguistics. We will work primarily with the **Natural Language Toolkit (NLTK)** library (http://www.nltk.org), which is an excellent platform for examining human language data favoured by teachers of computational linguistics. It has built-in corpora, lexical resources, and a comprehensive array of text processing libraries.

After a quick review of relevant Python concepts, we will explore NLTK's core corpus linguistics functions: tokenization, frequency distributions of keywords, part-of-speech tagging, stemming, lemmatization, n-grams, collocations, and concordances. This will allow for a descriptive understanding of the corpus (word categories, counts, and contexts), which sets the stage for the detection of themes via topic modelling.

We will also examine text preprocessing features offered by **scikit-learn** (http://scikit-learn.org), a user-friendly Python library for machine learning. It offers a different approach to tokenisation, and uses these tokens to transform documents into numerical vectors. Such vectors serve as input features for text classification algorithms.

We will apply these functions to a few forum posts from an online newsgroups dataset, and examine them individually as well as comparatively.

## Review of basic Python

Let's start by reviewing some fundamental concepts, functions, and data structures in Python that will be used throughout the workshop.

For a more thorough introduction to Python programming for the humanities, I recommend Folgert Karsdorp's tutorial: http://www.karsdorp.io/python-course/

**Note**: I cannot underscore the importance of search engines in the programming process, and of sites such as Stack Overflow and Quora. More often than not, coding is 80% Googling for the solution and 20% adapting search results to your specific problem. Chances are that someone else has had your exact problem or something very similar to it. It's important to be familiar with basic concepts, but there is no need to memorise complex commands.

### Numbers, strings, and variables

Numbers and strings are two of Python's built-in data types: they are the basic ingredients of the language. **Numbers** can be integers (whole numbers such as `12` and `1000`) or floats (numbers with decimal points such as `3.1415`). **Strings** are sequences of text characters and are always contained within either single `' '` or double quotes `" "` (but not a mixture of the two). 

All data types in Python are implemented as **objects** - think of them as nouns. The *type* of the object determines what actions can be done to it: numbers can, e.g., be added to each other, whereas strings have to be concatenated. Note that objects of different types cannot be concatenated or added together; they must be of the same type.

We can store data types in **variables**, which require a name (label). We use the equal sign `=` to assign values to named variables.

In [None]:
1 + 1

In [None]:
'Hello' + 'Python'

In [None]:
'Hello ' + 'Python'

In [None]:
'Hello ' + 1

In [None]:
'Hello ' + '1'

In [None]:
message = ('Hello Python!')

### Functions and methods

**Functions** in Python perform actions - think of them as verbs. They are denoted by parentheses `()`. `print()` is the most basic function in Python, and is pretty self-explanatory: it prints input values directly to the screen.

You can check the type of a variable with the `type()` function, and you can count the number of characters in a string with the `len()` function.

**Methods** are a subcategory of functions that apply to specific objects (data types). They are called using dot notation: `object.method()`. Python has a number of useful *string methods*:
- `.upper()` makes everything uppercase.
- `.lower()` makes everything lowercase.
- `.count()` adds up the number of times a character/character sequence appears.
- `.find()` displays the position at which a character/character sequence can be found (note that Python counting starts at 0, not 1).
- `.replace()` changes one character/character sequence for another.

In [None]:
print('Hello Python!')

In [None]:
print(message)

In [None]:
type(message)

In [None]:
len(message)

In [None]:
message.upper()

In [None]:
message.lower()

In [None]:
message.count('o')

In [None]:
message.find('Python')

In [None]:
message.replace('Python', 'Yin') # Put your name here!

### Lists

Lists are a very useful data type in Python, and should be used to store values when order matters. They are declared using brackets `[]` and indexed starting at 0.

Lists are highly *mutable*: you can add to them using the `.append()` method, remove from them using the `.remove()` method, change their values through their indexes, and slice them. For a full list of list methods (no pun intended!), see the Python documentation: https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

The `len()` function can be used on lists to calculate their length, and the `sorted()` function can be used to sort their elements in alphabetical order.

In [None]:
best_cities_list = ['London', 'New York', 'Paris']
type(best_cities_list)

In [None]:
best_cities_list.append('Tartu')
best_cities_list

In [None]:
best_cities_list.remove('Paris')
best_cities_list

In [None]:
best_cities_list[0] # First item in the list.

In [None]:
best_cities_list[-1] # Last item in the list.

In [None]:
best_cities_list[1] = 'Barcelona' # Replace the second item in a list (remember that Python counting starts at 0).
best_cities_list

In [None]:
best_cities_list[:2] # The first two items in the list.

In [None]:
best_cities_list[1:] # The first item to the last item in the list.

In [None]:
len(best_cities_list)

In [None]:
sorted(best_cities_list)

### Dictionaries

Dictionaries are a powerful data type in Python that stores related information. They are declared using braces (curly brackets) `{}` and contain a list of `key:value` pairs. Unlike lists, they are not ordered, and keys are used as opposed to index numbers to extract values. Keys must be unique (a key can only have one value), but values do not have to be (different keys can have the same value). As such, they work in a similar way to the dictionaries that we know and love, apart from the fact that they aren't ordered!

The `len()` function can also be used on dictionaries. Two dictionary-specific methods are `.keys()` and `.values()`, which display all of the keys and values in the dictionary, respectively.

Elements can be removed from dictionaries using the `del` command.

Python documentation: https://docs.python.org/3/tutorial/datastructures.html#dictionaries

In [None]:
best_cities_dict = {'London': 'United Kingdom', 'New York': 'United States', 'Paris': 'France'}
type(best_cities_dict)

In [None]:
best_cities_dict['Barcelona'] = 'Spain' # Add to a dictionary.
best_cities_dict # Note that the key:value pairs are not in the order in which they were entered!

In [None]:
best_cities_dict['Paris'] # Access values of a dictionary through their keys.

In [None]:
best_cities_dict['London'] = 'England' # Change values of a dictionary through their keys.
best_cities_dict

In [None]:
print(best_cities_dict.keys()) 
print(best_cities_dict.values()) 

In [None]:
len(best_cities_dict)

In [None]:
del best_cities_dict['Paris']
best_cities_dict

### Sets

A set stores an unordered collection of items for fast lookup - no indexes are used. There are no duplicates, so by definition every item in a set is unique. They are similar to dictionaries, but have no key:value pairs.

Create a set with curly braces `{}` or the `set()` function. Note that if you are creating an empty set, you have to use `set()`, as `{}` represents an empty dictionary.

Add to a set using the `.add` method and remove from a set using the `.remove` method. The `len()` function can be used to calculate its length. The `sorted()` function can be used to transform a set into an alphabetical list.

Python documentation: https://docs.python.org/3/tutorial/datastructures.html#sets

In [None]:
best_cities_set = {'London', 'New York', 'Paris', 'London'} 
print(type(best_cities_set))
best_cities_set # Note that 'London' only appears once even though it is added twice.

In [None]:
empty_set = set()
empty_set

In [None]:
empty_set.add('Tartu')
empty_set

In [None]:
empty_set.remove('Tartu')
empty_set

In [None]:
len(empty_set)

In [None]:
sorted(best_cities_set)

### If and for statements

`if` and `for` statements are two of Python's most fundamental control flow tools. They are extremely intuitive and readable. `if` is used for conditional execution, whereas `for` is used to iterate over the elements of an iterable object (e.g., a list or a string).

Note that the body of the statement needs to be *indented* (press tab once, or use four spaces). This is extremely important.

Python documentation: https://docs.python.org/3/tutorial/controlflow.html

In [None]:
if len(best_cities_list) >= 3: # Try changing this to 4.
    print(best_cities_list)
else:
    print("The length of best_cities is less than 3.")

In [None]:
for city in best_cities_dict:
    print(city, 'is in', best_cities_dict[city])

## Import packages and modules

So far we have only been using the built-in functions from Python's standard library. While these are quite extensive, in order to execute more specialised tasks (e.g., those related to corpus linguistics or machine learning), we will have to import tailor-made packages and modules. A **module** is a Python file that contains functions for specific non-standard tasks; modules are organised into file hierarchies called **packages**. A **library** is a more generic term that refers to a published module, package, or group of packages.

The Python Standard Library: https://docs.python.org/3/library/

In [None]:
# We need these modules for most of our corpus linguistics functions.
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.tag import pos_tag, map_tag
from nltk import bigrams
from nltk.collocations import *
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import wordnet as wn
from nltk import WordNetLemmatizer

# We need this module for tokenization.
from sklearn.feature_extraction.text import CountVectorizer

# We need these packages for plotting.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# We also need to download a few NLTK-specific packages that were not included in the general NLTK download.
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

## Load and examine dataset

For this workshop we will be using the **20 newsgroups** dataset from scikit-learn, one of the most famous corpora for text classification and clustering. It contains 20,000 newsgroup forum posts with 20 labeled topics. To keep things simple and clear, we will examine three groups of posts from different categories: automobiles, space, and guns. This means they are highly unrelated to each other, so should contain more distinct vocabularies.

When downloading the posts from our categories of interest, we will strip newsgroup-related metadata by setting the `remove` argument to equal `'headers', 'footers', 'quotes'`. We will then extract the raw text (`.data`) from each group into a separate variable. This raw text is a list: each element of the list is a forum post, so the length of the entire list represents the total number of documents (forum posts in the category).

Scikit-learn documentation: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
<br>20 newsgroups dataset homepage: http://qwone.com/~jason/20Newsgroups/

In [None]:
from sklearn.datasets import fetch_20newsgroups

cars = fetch_20newsgroups(categories=['rec.autos'], remove=('headers', 'footers', 'quotes'))
space = fetch_20newsgroups(categories=['sci.space'], remove=('headers', 'footers', 'quotes'))
guns = fetch_20newsgroups(categories=['talk.politics.guns'], remove=('headers', 'footers', 'quotes'))

cars_raw = cars.data
space_raw = space.data
guns_raw = guns.data

type(cars_raw)

In [None]:
print('Number of posts about cars:', len(cars_raw))
print('Number of posts about space:', len(space_raw))
print('Number of posts about guns:', len(guns_raw))

Let's take a look at the first post in each group to get a sense of their language and style. Each post is a string.

In [None]:
print(type(cars_raw[0]))
print('***')
print(cars_raw[0])
print('***')
print(space_raw[0])
print('***')
print(guns_raw[0])

## Tokenization

Now that we've downloaded the forum posts, let's tokenize them so that we can analyse their words in various ways. This will transform the strings into lists of 'words' (tokens). There are multiple ways to tokenize text:
1. The simplest method is with Python's `.split()` method for strings. This allows us to tokenize on spaces.
2. NLTK's `word_tokenize` is a more advanced method. It uses the Treebank Word Tokenizer and assumes that the text has already been separated into sentences. Contractions are split, and most punctuation and special characters are kept as separate tokens. 
3. Scikit-learn's `CountVectorizer` is an alternate advanced method that goes even further than NLTK. By default, words with only one character are discarded, punctuation is ignored, and special characters are stripped. 

Each method has its advantages and disadvantages, and there is no 'best' tokenization approach: it all depends on the research question. Scikit-learn's `CountVectorizer` disregards punctuation because such tokens are not important for most text classification algorithms, whereas NLTK retains punctuation because it is meaningful for many corpus linguistics questions. From the perspective of machine learning, selecting the right tokenization approach can be viewed as part of feature engineering: tokens are extremely significant features for text classification algorithms, as they greatly influence the output.

`word_tokenize` documentation: https://www.nltk.org/api/nltk.tokenize.html
<br> Treebank Word Tokenizer documentation: http://www.nltk.org/_modules/nltk/tokenize/treebank.html
<br>`CountVectorizer` documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### Compare tokenizers for one post

Let's begin by testing the tokenizers on the first post in the space forum. As our ultimate goal is to detect topics, we can ignore capitalisation (this will reduce the number of features, which makes modelling easier). Let's also clean our text by removing new lines, which are represented by `\n`.

We can compare not only the tokens themselves, but also the total number of tokens produced by each method.

In [None]:
sample_text = space_raw[0].lower()
sample_text = sample_text.replace('\n',' ')
print(sample_text)

In [None]:
tokenized_python = sample_text.split(' ')
print(type(tokenized_python))
print(tokenized_python)
print('Number of tokens (Python):', len(tokenized_python))

In [None]:
tokenized_scikit = CountVectorizer().build_tokenizer()(sample_text)
print(type(tokenized_scikit))
print(tokenized_scikit)
print('Number of tokens (Scikit):', len(tokenized_scikit))

In [None]:
tokenized_nltk = nltk.word_tokenize(sample_text)
print(type(tokenized_nltk))
print(tokenized_nltk)
print('Number of tokens (NLTK):', len(tokenized_nltk))

As scikit-learn's `CountVectorizer` produces the 'cleanest' results for our purposes, we will use it going forward. We can modify its parameters to remove numbers, words of less than three characters, and the most common words in the English language (stopwords). These tokens are not helpful for identifying the theme of a text.

We remove numbers and words of less than three characters using a *regular expression* in the `token_pattern` argument of `CountVectorizer`. `r` indicates raw string, `\b` indicates the word boundary, `[A-Za-z]` indicates that only tokens containing alphabetic letters (either uppercase or lowercase) will be considered, and `{3,}` indicates that the token must be at least three characters. More about regular expressions: https://docs.python.org/3/library/re.html

Scikit-learn stopwords: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py 

In [None]:
tokenized_scikit2 = CountVectorizer(token_pattern=r'\b[A-Za-z]{3,}\b', stop_words='english').build_tokenizer()(sample_text)
print(type(tokenized_scikit2))
print(tokenized_scikit2)
print('Number of tokens (Scikit modified):', len(tokenized_scikit2))

### Tokenize all forum posts

Now that we have seen how different tokenizers work and selected the one best-suited to our objectives, we can scale up tokenization to the entire corpora of forum text: each corpus contains the posts in one category of interest. This will allow us to compare their vocabularies in various ways.

At the moment, our corpora are represented as lists of strings (each string is a document, or forum post). To facilitate tokenization, let's merge all of the documents in each corpus into one big string using the `.join()` method. We will pass this into the tokenizer.

In [None]:
cars_corpus = ' '.join(cars_raw)
cars_corpus = cars_corpus.lower()
cars_corpus = cars_corpus.replace('\n',' ')

space_corpus = ' '.join(space_raw)
space_corpus = space_corpus.lower()
space_corpus = space_corpus.replace('\n',' ')

guns_corpus = ' '.join(guns_raw)
guns_corpus = guns_corpus.lower()
guns_corpus = guns_corpus.replace('\n',' ')

print('Number of characters in cars corpus:', len(cars_corpus))
print('Number of characters in space corpus:', len(space_corpus))
print('Number of characters in guns corpus:', len(guns_corpus))

In [None]:
print('First 1000 characters in cars corpus:')
print(cars_corpus[:1000])
print('')
print('First 1000 characters in space corpus:')
print(space_corpus[:1000])
print('')
print('First 1000 characters in guns corpus:')
print(guns_corpus[:1000])

In [None]:
tokenized_cars = CountVectorizer(token_pattern=r'\b[A-Za-z]{3,}\b', stop_words='english').build_tokenizer()(cars_corpus)
tokenized_space = CountVectorizer(token_pattern=r'\b[A-Za-z]{3,}\b', stop_words='english').build_tokenizer()(space_corpus)
tokenized_guns = CountVectorizer(token_pattern=r'\b[A-Za-z]{3,}\b', stop_words='english').build_tokenizer()(guns_corpus)

print('First 100 tokens in cars corpus:', tokenized_cars[:100])
print('First 100 tokens in space corpus:', tokenized_space[:100])
print('First 100 tokens in guns corpus:', tokenized_guns[:100])

Tokenization allows us to count the total number of words (tokens) in each corpus, as well as the total number of *unique* words (types) in each corpus using a Python set.

In [None]:
print('Number of tokens in cars corpus:', len(tokenized_cars))
print('Number of tokens in space corpus:', len(tokenized_space))
print('Number of tokens in guns corpus:', len(tokenized_guns))

In [None]:
print('Number of unique tokens in cars corpus:', len(set(tokenized_cars)))
print('Number of unique tokens in space corpus', len(set(tokenized_space)))
print('Number of unique tokens in guns corpus:', len(set(tokenized_guns)))

## Stemming

At the moment, words with the same root meaning such as 'car' and 'cars' are treated as separate types in our tokenized corpora. As distinguishing between these different forms is not important for topic modelling, we can simplify our tokens by **stemming** them: transforming their words into their root forms by removing affixes. This reduces the number of features, thus simplifying the model, but might not be lexicographically correct. 

NLTK offers three options for English words: Porter, Snowball (Porter2), and Lancaster.
- The Porter stemming algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. It is also the most computationally intensive. 
- The Lancaster stemming algorithm was published in 1990, and can be more aggressive than Porter; it is also the fastest. 
- The SnowballStemmer currently supports 15 languages and is nearly universally regarded as an improvement over Porter (it's in between Porter and Lancaster with regard to aggressiveness and speed).

The input for these stemmers is tokenized text, which is an iterable list (not the raw text, which is a string). Let's compare their performance on our cars corpus.

NLTK stemmer documentation: http://www.nltk.org/howto/stem.html

In [None]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer('english', ignore_stopwords=True)
lancaster_stemmer = LancasterStemmer()

In [None]:
tokenized_cars_porter = [porter_stemmer.stem(w) for w in tokenized_cars]
print('First 100 tokens from Porter stemmer:', tokenized_cars_porter[:100])
print('Number of unique tokens after Porter stemming:', len(set(tokenized_cars_porter)))
print('')
tokenized_cars_snowball = [snowball_stemmer.stem(w) for w in tokenized_cars]
print('First 100 tokens from Snowball stemmer:', tokenized_cars_snowball[:100])
print('Number of unique tokens after Snowball stemming:', len(set(tokenized_cars_snowball)))
print('')
tokenized_cars_lancaster = [lancaster_stemmer.stem(w) for w in tokenized_cars]
print('First 100 tokens from Lancaster stemmer:', tokenized_cars_lancaster[:100])
print('Number of unique tokens after Lancaster stemming:', len(set(tokenized_cars_lancaster)))

## Part-of-speech (PoS) tagging and lemmatizing

As all linguists are well aware, stemming can create non-real words, such as 'thu' from 'thus'. **Lemmatization** aims to obtain the canonical (lexicographically/grammatically correct) form of the root word, the so-called *lemma*. Lemmatization is computationally more difficult and expensive than stemming, as it requires PoS tags. NLTK's lemmatizer is trained on WordNet, an English lexical database: http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html 

It is also useful to examine the **PoS categories** in and of themselves - they represent a level above that of specific words (types). Which words are most popular in any given PoS category? What is the frequency of different PoS categories in different corpora? NLTK's default PoS tagger uses tags from the Penn Treebank project: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
tokenized_cars_pos = nltk.pos_tag(tokenized_cars)
tokenized_space_pos = nltk.pos_tag(tokenized_space)
tokenized_guns_pos = nltk.pos_tag(tokenized_guns)

print('First ten PoS-tagged tokens in cars corpus:', tokenized_cars_pos[:10])
print('First ten PoS-tagged tokens in space corpus:', tokenized_space_pos[:10])
print('First ten PoS-tagged tokens in guns corpus:', tokenized_guns_pos[:10])

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize(token, tag):
    tag = {
        'N': wn.NOUN,
        'V': wn.VERB,
        'R': wn.ADV,
        'J': wn.ADJ
    }.get(tag[0], wn.NOUN)
    return lemmatizer.lemmatize(token, tag) 

In [None]:
tokenized_cars_lemmas = []

for token, pos in tokenized_cars_pos:
    try:
        tokenized_cars_lemmas.append(lemmatize(token, pos))
    except KeyError:
        tokenized_cars_lemmas.append(token)

print('First 100 tokens from lemmatization:', tokenized_cars_lemmas[:100])
print('Number of unique tokens after lemmatization:', len(set(tokenized_cars_lemmas)))

In [None]:
tokenized_space_lemmas = []

for token, pos in tokenized_space_pos:
    try:
        tokenized_space_lemmas.append(lemmatize(token, pos))
    except KeyError:
        tokenized_space_lemmas.append(token)

print('First 100 tokens from lemmatization:', tokenized_space_lemmas[:100])
print('Number of unique tokens after lemmatization:', len(set(tokenized_space_lemmas)))

In [None]:
tokenized_guns_lemmas = []

for token, pos in tokenized_guns_pos:
    try:
        tokenized_guns_lemmas.append(lemmatize(token, pos))
    except KeyError:
        tokenized_guns_lemmas.append(token)

print('First 100 tokens from lemmatization:', tokenized_guns_lemmas[:100])
print('Number of unique tokens after lemmatization:', len(set(tokenized_guns_lemmas)))

Let's rank the frequencies of the different PoS tags in each corpus, and plot the frequencies to visually compare them.

In [None]:
cars_tag_fd = nltk.FreqDist(tag for (word, tag) in tokenized_cars_pos)
space_tag_fd = nltk.FreqDist(tag for (word, tag) in tokenized_space_pos)
guns_tag_fd = nltk.FreqDist(tag for (word, tag) in tokenized_guns_pos)

print('Most common PoS tags in cars corpus:', cars_tag_fd.most_common())
print('')
print('Most common PoS tags in space corpus:', space_tag_fd.most_common())
print('')
print('Most common PoS tags in guns corpus:', guns_tag_fd.most_common())

In [None]:
plt.style.use('seaborn')

plt.figure(figsize=(15, 5))
plt.title('PoS Tag Frequencies in Cars Corpus')
cars_tag_fd.plot()

plt.figure(figsize=(15, 5))
plt.title('PoS Tag Frequencies in Space Corpus')
space_tag_fd.plot()

plt.figure(figsize=(15, 5))
plt.title('PoS Tag Frequencies in Guns Corpus')
guns_tag_fd.plot()

Using the PoS tag, we can filter for only one lexical category (e.g., nouns). To compare its frequency across corpora, we can calculate the proportion of words with the tag in the corpora.

In [None]:
cars_corpus_nouns = [(token, pos) for token, pos in tokenized_cars_pos if pos.startswith('N')]
space_corpus_nouns = [(token, pos) for token, pos in tokenized_space_pos if pos.startswith('N')]
guns_corpus_nouns = [(token, pos) for token, pos in tokenized_guns_pos if pos.startswith('N')]

print('Proportion of nouns in cars corpus:', len(cars_corpus_nouns)/len(tokenized_cars))
print('First ten nouns in cars corpus:', cars_corpus_nouns[:10])
print('Proportion of nouns in space corpus:', len(space_corpus_nouns)/len(tokenized_space))
print('First ten nouns in space corpus:', space_corpus_nouns[:10])
print('Proportion of nouns in guns corpus:', len(guns_corpus_nouns)/len(tokenized_guns))
print('First ten nouns in guns corpus:', guns_corpus_nouns[:10])

## Word-level calculations

Tokenization allows us to count the words (tokens) in a text in a variety of different ways. We have already counted the total number of tokens and types (unique tokens) in each corpus. From these counts, we can write our own function to calculate *lexical richness*: the number of types (unique words) divided by the number of tokens (total words). We can use the lemmatized corpora to group words together that have the same root meaning, as they do not increase semantic diversity.

In [None]:
def lexical_diversity(corpus):
    return len(set(corpus))/len(corpus)

print('Lexical diversity of cars corpus:', lexical_diversity(tokenized_cars_lemmas))
print('Lexical diversity of space corpus:', lexical_diversity(tokenized_space_lemmas))
print('Lexical diversity of guns corpus:', lexical_diversity(tokenized_guns_lemmas))

Now let's examine the frequency of specific words (types) in the forum posts. After we generate the frequency distribution, we can see what the most common words are in each corpus. We can also display the PoS categories of the most common words, and sort the words within a certain PoS category by frequency: e.g., what are the most popular *nouns* in each corpus? Again, we will use the lemmatized corpora to group words together that have the same root meaning.

In [None]:
fdist_cars = FreqDist(tokenized_cars_lemmas)
fdist_space = FreqDist(tokenized_space_lemmas)
fdist_guns = FreqDist(tokenized_guns_lemmas)

In [None]:
print(fdist_cars['car'])
print(fdist_space['space'])
print(fdist_guns['gun'])

In [None]:
print('Most frequent words in cars corpus:', fdist_cars.most_common(20))
print('Most frequent words in space corpus:', fdist_space.most_common(20))
print('Most frequent words in guns corpus:', fdist_guns.most_common(20))

In [None]:
cars_type_tag_fd = nltk.FreqDist(tokenized_cars_pos)
space_type_tag_fd = nltk.FreqDist(tokenized_space_pos)
guns_type_tag_fd = nltk.FreqDist(tokenized_guns_pos)

print('Most frequent words and their PoS in cars corpus:', cars_type_tag_fd.most_common(20))
print('Most frequent words and their PoS in space corpus:', space_type_tag_fd.most_common(20))
print('Most frequent words and their PoS in guns corpus:', guns_type_tag_fd.most_common(20))

In [None]:
print('Most popular nouns in cars corpus:', [typetag[0] for (typetag, _) in cars_type_tag_fd.most_common() if typetag[1] == 'NN'][:10])
print('Most popular nouns in space corpus:', [typetag[0] for (typetag, _) in space_type_tag_fd.most_common() if typetag[1] == 'NN'][:10])
print('Most popular nouns in guns corpus:', [typetag[0] for (typetag, _) in guns_type_tag_fd.most_common() if typetag[1] == 'NN'][:10])

Now let's look at some more peculiar words: those that appear only once (hapax legomena), those that are extremely long, and those that are both long and frequently occurring. Such words often add a different perspective on a corpus of text (they're a bit like linguistic outliers!).

In [None]:
print('Number of hapax legomena in cars corpus:', len(fdist_cars.hapaxes()))
print('Number of hapax legomena in space corpus:', len(fdist_space.hapaxes()))
print('Number of hapax legomena in guns corpus:', len(fdist_guns.hapaxes()))

In [None]:
print('First 20 hapax legomena in cars corpus:', fdist_cars.hapaxes()[:20])
print('First 20 hapax legomena in space corpus:', fdist_space.hapaxes()[:20])
print('First 20 hapax legomena in guns corpus:', fdist_guns.hapaxes()[:20])

In [None]:
vocab_cars = set(tokenized_cars_lemmas)
vocab_space = set(tokenized_space_lemmas)
vocab_guns = set(tokenized_guns_lemmas)

# Let's define a long word as a word with more than 10 characters.
long_words_cars = [word for word in vocab_cars if len(word) > 10] 
long_words_space = [word for word in vocab_space if len(word) > 10]
long_words_guns = [word for word in vocab_guns if len(word) > 10]

print('First 20 long words in cars corpus:', long_words_cars[:20])
print('First 20 long words in space corpus:', long_words_space[:20])
print('First 20 long words in guns corpus:', long_words_guns[:20])

In [None]:
# Let's define frequent words as those that occur more than 10 times.
long_frequent_words_cars = [word for word in vocab_cars if len(word) > 10 and fdist_cars[word] > 10]
long_frequent_words_space = [word for word in vocab_space if len(word) > 10 and fdist_space[word] > 10]
long_frequent_words_guns = [word for word in vocab_guns if len(word) > 10 and fdist_guns[word] > 10]

print('Long frequent words in cars corpus:', sorted(long_frequent_words_cars))
print('Long frequent words in space corpus:', sorted(long_frequent_words_space))
print('Long frequent words in guns corpus:', sorted(long_frequent_words_guns))

## N-grams and collocations

N-grams are words that co-occur within a given window: 2-grams (bigrams) are two words that co-occur, 3-grams (trigrams) are three words that co-occur, etc. The window is typically just one word (i.e., the words must be next to each other). 

N-gram collocations are n-grams that occur more often than we would expect based on the frequency of the individual words. We can compute them in Python using Pointwise Mutual Information (a measure of association used in statistics).

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder_cars = BigramCollocationFinder.from_words(tokenized_cars)
bigram_finder_space = BigramCollocationFinder.from_words(tokenized_space)
bigram_finder_guns = BigramCollocationFinder.from_words(tokenized_guns)

# Top 20 bigrams.
print('Top 20 bigrams in cars corpus:', bigram_finder_cars.nbest(bigram_measures.pmi, 20))
print('Top 20 bigrams in space corpus:', bigram_finder_space.nbest(bigram_measures.pmi, 20))
print('Top 20 bigrams in guns corpus:', bigram_finder_guns.nbest(bigram_measures.pmi, 20))

Let's filter the results to only see the top 20 bigrams that appear at least five times.

In [None]:
bigram_finder_cars.apply_freq_filter(5)
bigram_finder_space.apply_freq_filter(5)
bigram_finder_guns.apply_freq_filter(5)

print('Top 20 frequent bigrams in cars corpus:', bigram_finder_cars.nbest(bigram_measures.pmi, 20))
print('Top 20 frequent bigrams in space corpus:', bigram_finder_space.nbest(bigram_measures.pmi, 20))
print('Top 20 frequent bigrams in guns corpus:', bigram_finder_guns.nbest(bigram_measures.pmi, 20))

We can also apply a word filter to remove bigrams containing specific words.

In [None]:
bigram_finder_cars.apply_freq_filter(5)
bigram_finder_space.apply_freq_filter(5)
bigram_finder_guns.apply_freq_filter(5)

bigram_finder_cars.apply_word_filter(lambda w: w in ('gov', 'blah'))
bigram_finder_space.apply_word_filter(lambda w: w in ('emx'))
bigram_finder_guns.apply_word_filter(lambda w: w in ('ifas'))

print('Top 20 frequent bigrams in cars corpus (filtered):', bigram_finder_cars.nbest(bigram_measures.pmi, 20))
print('Top 20 frequent bigrams in space corpus (filtered):', bigram_finder_space.nbest(bigram_measures.pmi, 20))
print('Top 20 frequent bigrams in guns corpus (filtered):', bigram_finder_guns.nbest(bigram_measures.pmi, 20))

In [None]:
# Repeat for trigrams.
trigram_measures = nltk.collocations.TrigramAssocMeasures()
trigram_finder_cars = TrigramCollocationFinder.from_words(tokenized_cars)
trigram_finder_space = TrigramCollocationFinder.from_words(tokenized_space)
trigram_finder_guns = TrigramCollocationFinder.from_words(tokenized_guns)

trigram_finder_cars.apply_freq_filter(5)
trigram_finder_space.apply_freq_filter(5)
trigram_finder_guns.apply_freq_filter(5)

print('Top 20 frequent trigrams in cars corpus:', trigram_finder_cars.nbest(trigram_measures.pmi, 20))
print('Top 20 frequent trigrams in space corpus:', trigram_finder_space.nbest(trigram_measures.pmi, 20))
print('Top 20 frequent trigrams in guns corpus:', trigram_finder_guns.nbest(trigram_measures.pmi, 20))

## Searching text (in context)

We've been looking at words in isolation as well as in clusters of two and three (bigrams and trigrams). Let's zoom out and examine the context of these words. NLTK provides some very useful functions for this.

To preserve as much of the original context as we can (including punctuation and special characters), we can tokenize the corpora with NLTK's `word_tokenize`. These tokenized corpora then have to be transformed into NLTK Text objects in order for the NLTK-specific methods to be executed on them:
- **Concordances** allow us to see keywords in context (KWIC).
- For any given word, we can calculate **similar words** in the corpus: which words occur in a similar range of contexts?
- For any given set of two or more words, we can examine the **common contexts** that they share in the corpus (if any).
- We can visualise the locations of words in the text by generating a **lexical dispersion plot**. Each stripe in the plot represents an instance of a word, and each row represents the entire text. Location is measured by 'word offset': the number of characters from the beginning of the text at which the word can be found.

In [None]:
tokenized_cars_nltk = nltk.word_tokenize(cars_corpus)
tokenized_space_nltk = nltk.word_tokenize(space_corpus)
tokenized_guns_nltk = nltk.word_tokenize(guns_corpus)

cars_text = nltk.Text(tokenized_cars_nltk)
space_text = nltk.Text(tokenized_space_nltk)
guns_text = nltk.Text(tokenized_guns_nltk)

In [None]:
cars_text.concordance('acceleration')

In [None]:
space_text.concordance('acceleration')

In [None]:
space_text.similar('space')

In [None]:
space_text.common_contexts(['space', 'lunar'])

In [None]:
plt.style.use('classic')
plt.figure(figsize=(15, 5)) 
space_text.dispersion_plot(['sun', 'earth', 'moon', 'satellite'])

## Open-Ended Exercises and Questions
1. Rerun the above analyses for other forum topics. It would be interesting to compare two topics in the same category: e.g., comp.sys.ibm.pc.hardware and comp.sys.mac.hardware (PC vs Mac hardware).
2. What are the most common adjectives in the forum corpora? The most common verbs?
3. What are the most common short words (length < 5)?
4. Play around with the various tools that NLTK provides for searching text (concordances, similar words, lexical dispersion plots) with different corpora.