# NLTK - Natural Language Toolkit

Natural Language Toolkit (NLTK) is a Python library that allows us to easily perform various text analyses. NLTK includes a massive amount of useful tools. In this notebook, we will look at a few of them.

## Setup
We start by importing the NLTK library and the pandas library.

In [None]:
import pandas as pd

import nltk

If NLTK is not installed, we can install it by typing `pip install nltk` into a terminal or by using the command below directly from within the notebook.

In [None]:
!pip install nltk

When NLTK is installed and imported, we can download additional material to extend the functionality of the library. NLTK includes a great number of corpora, models, stopword list and other Natural Language Processing (NLP) tools.

We can either download specific parts of the tools or we can download everything at once with the command below. Notice that it is only necessary to download additional data the first time the NLTK installation is used.

In [None]:
nltk.download('all')

## Load data

Once again we need our text data in the form of a text string. We can either load the data from a CSV-file and extract the text data or load it directly from a text file. For more details on this process, see the [N-grams notebook](./N-grams.ipynb).

In [None]:
data_file= '/work/Common-files/Data/Datasæt3/20191.csv' # Path to data file

df = pd.read_csv(data_file)

text_str = ' '.join(df['text'])

In [None]:
text_file = '' # Path to text file

with open(text_file) as f:
    text_str = f.read()

### Prepare the data
In order to work with our text data, we need to process our text string a bit.

First, we convert the text into a list of tokens with the NLTK `word_tokenize()` function. We also create a NLTK `Text` object which allows us to apply various NLTK methods.

In [None]:
tokens = nltk.word_tokenize(text_str)

The `Text` object is created from the list of tokens.

In [None]:
text = nltk.Text(tokens)

## NLTK methods

With our `Text` object we can perform a number of text analyses.

### `count()`
The simplest method is `count()`, which returns the count of a specific term.

In [None]:
text.count('grundloven')

### `collocation_list()`
Similar to the n-grams notebook, `collocation_list()` returns a list of the most common word pairs in the text. Notice, that in some versions of Python `collocation_list()` doesn't work. If this is the case, try `collocations()` instead.

In [None]:
text.collocation_list()

### `concordance()`
The `concordance()` method returns the context of a specific term. The length of the output can be modified with the `width` and `lines` keyword-parameters.

In [None]:
text.concordance('samfundssind')

### `similar()`
In order to identify words that appear in a similar context, we can use the `similar()` method. This can sometimes be useful if we want to look for common OCR errors.

In [None]:
text.similar('lovgivning')

### `dispersion_plot()`
The `dispersion_plot()` method lets us visualise how terms occur across our text. If the text has a temporal aspect - as our parliamentary data sorted by date - `dispersion_plot()` can approximate a timeline. However, the linearity can be somewhat skewed depending on how evenly spread the text data are.

The method accepts a list of one or more terms as input.

In [None]:
terms = ['epidemi', 'pandemi', 'grundloven', 'mink', 'samfundssind']

text.dispersion_plot(terms)

## Frequency distribution
Frequency distribution is another useful tool built into NLTK which gives us a quick overview of the most common words in our text.

We generate the frequency distribution from our list of tokens with the `FreqDist()` function.

In [None]:
fdist = nltk.FreqDist(tokens)

We can then inspect the most common words. The `most_common()` method returns a number of the most common tokens and how many times they appear in the text.

In [None]:
fdist.most_common(10)

The immediate results are not very interesting. This is a good time for some data cleaning.

### Removing stop words
Our initial results included a lot of short, uninteresting words. These are commonly known as stop words. We can exclude these from our analysis by applying a list of stop words.

For this purpose, NLTK has a built-in list of stop words that we can use.

In [None]:
stopwords = nltk.corpus.stopwords.words('danish')

We then filter the list of tokens against the stop words. First, we create a new, empty list for the filtered tokens. We then iterate over the list of tokens and for each word, we check if the word is in the stop words list. To get rid of punctuation, we also check if the word consists of characters with the string method `isalpha()`. If these two conditions are met, we append the word to our new list.

In [None]:
filtered_tokens = []

for word in tokens:
    if word.lower() not in stopwords and word.isalpha():
        filtered_tokens.append(word)

Now we can create a new frequency distribution from the filtered tokens.

In [None]:
fdist_filtered = nltk.FreqDist(filtered_tokens)

In [None]:
fdist_filtered.most_common(10)

The results are better than before but we still have a lot of uninteresting words.

### Custom stop words list
Whenever the NLTK stop words list is insufficient we can supply our own list of stop words which can be tailored to a specific domain.

We load a custom stop words list from a text file and save it the variable `my_stopwords`.

In [None]:
stopwords_file = '/work/Common-files/Diverse/dk_stopord.txt' # Path to text file with stop words.

with open(stopwords_file) as f:
    my_stopwords = f.read().split()

In [None]:
my_stopwords

We now create a new list for words filtered against our custom stop words list.

Notice, that we also convert each word to lower case in order to catch more stop words.

In [None]:
clean_tokens = []

for word in tokens:
    if word.lower() not in my_stopwords and word.isalpha():
        clean_tokens.append(word.lower())

We create a new frequency distributions and inspect the results.

In [None]:
fdist_clean = nltk.FreqDist(clean_tokens)

In [None]:
fdist_clean.most_common(20)

Again, the results are better but not perfect.

### Word length
Most of the common words in our filtered text are short and does not carry a lot of meaning. If we assume that short words in general are uninteresting, we can filter our text again and only keep words above a certain length.

Below create a new list of tokens and only keep words with a length above six characters.

In [None]:
long_tokens = []

for word in filtered_tokens:
    if len(word) > 6:
        long_tokens.append(word)

We create a frequency distribution of the long words and inspect the most common words.

In [None]:
fdist_long = nltk.FreqDist(long_tokens)

In [None]:
fdist_long.most_common(20)

Now our results are somewhat interesting, as we have a lot more meaningful words.

### Plotting frequency distribution
For a visual representation of the frequency distribution we can use the `plot()` method. We supply the method with the number of words we want to include.

Notice, that if we call the method without a number, Python will attempt to include all unique words in the text, which will be a very taxing operation without any value to us.

In [None]:
fdist_long.plot(25)

We can add a title to our plot with the `title` keyword.

In [None]:
fdist_long.plot(30, title='Most common words')

We can also count the terms cumulatively with the `cumulative` keyword.

In [None]:
fdist_long.plot(30, title='Most common words (Cumulative)', cumulative=True)

## Wrap up

NLTK is a very large and powerful Python library and the possibilities are virtually endless. In this notebook, we have scratched the surface and demonstrated some of the tools.

We have converted our data to a `Text` object which allows us to perform a number analyses with little effort. We have also worked with frequency distributions and refined our data for a better result.

The examples in this notebook are adapted from the book [Natural Language Processing with Python](https://www.nltk.org/book/), which is a recommended resource if you want to learn more about NLTK.