# Working with n-grams

N-grams are a method for examining the context of words and how words occur together.

We will go through some of the most common n-grams, namely bigrams and trigrams, and apply the method to our parliamentary data.

## Libraries
For this analysis, we need pandas and a number of specific modules from the NLTK library.

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ucloud/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import pandas as pd

from nltk import word_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
from nltk.metrics import BigramAssocMeasures, TrigramAssocMeasures
from nltk.stem.snowball import SnowballStemmer

## Data preparation
In order to find n-grams in our data we need to format our text to a specific format. We only want to look at the text data so what we need is a big text string. As demonstrated in a [previous notebook](./Loading%20and%20cleaning%20data.ipynb#Single-party-text-extraction) there are two ways to obtain a string with the text data. We can either load the data into a DataFrame and extract all the data from `text` column; or we can load the string from a single text file.

To use the DataFrame method, we provide the path to a data file and load it to a DataFrame with pandas' `read_csv()` function.

In [None]:
data_file = '/work/Common-files/Data/Datasæt3/20201.csv' # Path to data file

df = pd.read_csv(data_file)

We extract and join the text from the `fulltext_org` column into a string with the following command.

In [None]:
text_str = ' '.join(df['text'])

Notice that we haven't filtered the data by newspaper and `text_str` contains all the text from this specific year.

If we want to focus on a specific party, we only extract the text where the abbreviated party name is present in the `group_name` field.

In [None]:
party = 'LA'

party_str = ' '.join(df[df['group_name'] == party]['text'])

Alternatively, we can load the data from a previously saved text file on the disk.

In [None]:
text_file = '' # Path to text file

with open(text_file) as f:
    text_data = f.read()

Finally, we split the text string into a list of tokens. Again, we can use the fast but naïve `split()` method or the sophisticated but slower `word_tokenize()` function from NLTK.

In [None]:
split_tokens = text_str.split()

In [None]:
nltk_tokens = word_tokenize(text_str)

### Data cleaning
This is a good time apply a bit of data cleaning before we proceed.

The text may contain a lot of symbols from misread OCR or additional punctuation, which we would like to get rid of. We can filter our tokens and keep only those that are alphabetic - i.e. contains characters. This will also remove any punctuation from our list of tokens - notice, this will only be efficient if have used `word_tokenize()` to create our list of tokens.

We iterate over the list of tokens, apply the string method `isalpha()` and append the words that pass the test to a new list of tokens. In order to standardise the text further, we convert all words to lower case when appending to the new list.

In [None]:
tokens = []

for word in nltk_tokens:
    if word.isalpha():
        tokens.append(word.lower())

This quick cleaning will remove the worst offenders from our analysis. You are welcome to apply other steps of cleaning as well.

## Build bigram function

With our data prepared, we can build the functions that will identify the n-grams in our text. We start with bigrams.

In [None]:
def bigrams_score(word, source, freq_filter=2):
    """Indentify bigrams that include a specific word.
    
    Based on a specific target word and a list of tokens (source), the function returns
    a list of bigrams including the target word. Optionally the frequency filter can be adjusted.
    """
    
    # Initiate bigram scoring measures
    bigram_measures = BigramAssocMeasures()
    
    # Generation of bigrams
    finder = BigramCollocationFinder.from_words(source)

    # Frequency filter which determines the number of times a bigram must occur
    finder.apply_freq_filter(freq_filter)  
    
    # Create filter that discards bigrams without the target word
    word_filter = lambda *w: word not in w
    
    # Application of the word filter
    finder.apply_ngram_filter(word_filter)
    
    # Return bigrams with scores of likelihood
    return finder.score_ngrams(bigram_measures.likelihood_ratio)

In the function, we first initiate the `BigramAssocMeasures()` function from NLTK. We then initiate the NLTK function `BigramCollocationFinder()` and identify all bigrams in our text source.

Next, we create a frequency filter, which sets a threshold of how many times a bigram should appear in the text before we want to keep it. By default, we set the frequency filter to 2.

We then create a _lambda_ expression to filter out all bigrams that doesn't contain our target word. Lambda expressions can be thought of as quick, disposable functions that are only needed once. Learn more about lambda expressions and their use cases in the [documentation](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions).

Finally, we apply our filtering lambda expression (`word_filter`) before we return the a list of scored bigrams.

### Testing the function
Now we can test `bigrams_score` by passing a target word and our list of tokens to the function.

In [None]:
bigrams_score('ulovlig', tokens)

The function returns a list of bigrams which meet our requirements i.e. word filter and frequency filter. They are accompanied by a score, which is a statistical measure of how likely the bigrams are to occur.

## Build trigram function
Similarly, we can build a trigram function in order to identify sequences of three words that occur together.

In [None]:
def trigrams_score(word, source, freq_filter=2):
    """Indentify trigrams that include a specific word.
    
    Based on a specific target word and a list of tokens (source), the function returns
    a list of trigrams including the target word. Optionally the frequency filter can be adjusted.
    """
    
    # Initiate trigram scoring measures
    trigram_measures = TrigramAssocMeasures()
    
    # Generation of trigrams
    finder = TrigramCollocationFinder.from_words(source)

    # Frequency filter which determines the number of times a trigram must occur
    finder.apply_freq_filter(freq_filter)
    
    # Create filter that discards trigrams without the target word
    word_filter = lambda *w: word not in w
    
    # Application of the word filter
    finder.apply_ngram_filter(word_filter)

    # Return bigrams with scores of likelihood
    return finder.score_ngrams(trigram_measures.likelihood_ratio)

In [None]:
trigrams_score('raske', tokens)

## Stemming
We can process our data in various ways in order to optimise our n-grams. One of them is stemming.

Stemming is the process of removing all inflections of a word so only the word stem remains.

To stem our words, we need a stemmer. NLTK includes an implementation of the [Snowball stemmer](https://snowballstem.org/), which is based on the original Porter stemmer.

First we need to initiate the stemmer; then we iterate over our tokens, stem each word and append it to a new list of stemmed tokens. Because the stemming process is language specific, we need to pass the language of our text to the stemmer.

In [None]:
language = 'danish'

stemmer = SnowballStemmer(language)

stemmed_tokens = []

for word in tokens:
    stem = stemmer.stem(word)
    stemmed_tokens.append(stem)

Now we can pass our list of stemmed tokens to the n-gram functions and inspect how the output is affected.

In [None]:
bigrams_score('sprog', stemmed_tokens)

In [None]:
trigrams_score('kritisk', stemmed_tokens, freq_filter=3)

## Wrap up

In this notebook we have demonstrated how we can build our own tools to identify n-grams. We have encountered some of the more technical aspects of NLTK but by enclosing our analyses into functions, we can fairly easily get to the results.

We have also encountered stemming, which can be a useful tool whenever we want to count words across a text. We also applied a few other techniques for optimising our analyses. Namely, we cleaned the data by removing punctuation and OCR noise.