# Collecting and Preparing Text for Topic Modelling using Gensim

---
---
## More About Tokenising and Normalisation

In the last workshop, in notebook `workshop-1-basics/2-collecting-and-preparing.ipynb`, we cleaned and prepared the text _The Iliad of Homer_ (translated by Alexander Pope (1899)) by:
* Tokenising the text into individual words.
* Normalising the text:
 * into lowercase,
 * removing punctuation,
 * removing non-words (empty strings, numerals, etc.),
 * removing stopwords.

One form of normalisation we didn't do last time is making sure that different _inflections_ of the same word are counted together. In English, words are modified to express quantity, tense, etc. (i.e. _declension_ and _conjugation_ for those who remember their language lessons!).

For example, 'fish', 'fishes', 'fishy' and 'fishing' are all formed from the root 'fish'. Last workshop, all these words would have been counted as different words, which may or may not be desirable.

### Stemming and Lemmatization

There are two main ways to normalise for inflection:

* **Stemming** - reducing a word to a stem by removing endings (a **stem** may not be an actual word).
* **Lemmatization** - reducing a word to its meaningful base form using its context (a **lemma** is typically a proper word in the language).

To do this we can use several facilities provided by NLTK. There are many different ways to stem and lemmatize words, but we will compare the results of the [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) and [WordNet](https://wordnet.princeton.edu/) lemmatizer.

First, let's get the H.G. Wells book _The First Men on the Moon_ from Project Gutenberg:

In [8]:
import requests
response = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/1/1013/1013.txt')
text = response.text
text[681:900]

'THE FIRST MEN IN THE MOON\r\n\r\nby H.G. Wells\r\n\r\n\r\n\r\n\r\nChapter 1\r\n\r\n\r\n\r\n\r\nMr. Bedford Meets Mr. Cavor at Lympne\r\n\r\nAs I sit down to write here amidst the shadows of vine-leaves under the\r\nblue sky of southern Italy, it com'

Then we pick out one sentence from the book to use an example:

In [10]:
hg_wells = text[118017:118088]
hg_wells

'All about us on the sunlit slopes frothed and swayed the darting shrubs'

Next we tokenise the sentence:

In [15]:
from nltk import word_tokenize

tokens = word_tokenize(hg_wells)
tokens

['All',
 'about',
 'us',
 'on',
 'the',
 'sunlit',
 'slopes',
 'frothed',
 'and',
 'swayed',
 'the',
 'darting',
 'shrubs']

And use the Porter Stemmer to find the word stems:

In [16]:
from nltk import PorterStemmer

porter = PorterStemmer()
stems = [porter.stem(token) for token in tokens]
stems

['all',
 'about',
 'us',
 'on',
 'the',
 'sunlit',
 'slope',
 'froth',
 'and',
 'sway',
 'the',
 'dart',
 'shrub']

To compare these stems with lemmas, we download the WordNet lemmatizer and use it:

In [17]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
lemmas

['All',
 'about',
 'u',
 'on',
 'the',
 'sunlit',
 'slope',
 'frothed',
 'and',
 'swayed',
 'the',
 'darting',
 'shrub']

What do you think about the results? Perhaps surprisingly, the lemmatizer seems to have performed more poorly than the stemmer since `frothed` and `darting` have not been reduced to `froth` and `dart`.

The different rules used to stem and lemmatize words are called _algorithms_ and they can result in different stems and lemmas. If the precise details of this are important to your research, you should compare the results of the various algorithms. Stemmers and lemmatizers are also available in many languages, not just English.

---
### Going Further: Improving Lemmatization with Part-of-Speech (POS) Tagging

To improve the lemmatizer's performance we can tell it which _part of speech_ each word is, which is known as **part-of-speech tagging (POS tagging)**. A part of speech is the role a word plays in the sentence, e.g. verb, noun, adjective, etc.

NLTK has a POS tagger so let's download it:

In [21]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [20]:
# Generate the POS tags for each token
tags = nltk.pos_tag(tokens)
tags

[('All', 'DT'),
 ('about', 'IN'),
 ('us', 'PRP'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('sunlit', 'NN'),
 ('slopes', 'VBZ'),
 ('frothed', 'VBN'),
 ('and', 'CC'),
 ('swayed', 'VBN'),
 ('the', 'DT'),
 ('darting', 'NN'),
 ('shrubs', 'NN')]

These tags that NLTK generates are from the [Penn Treebank II tag set](https://www.clips.uantwerpen.be/pages/MBSP-tags). For example, now we know that `frothed` is a 'verb, past participle' (VBN).

Unfortunately, the NLTK lemmatizer accepts WordNet tags (`ADJ, ADV, NOUN, VERB = 'a', 'r', 'n', 'v'`) instead! In theory, at least, if we pass the tagging information to the lemmatizer, the results are better.

In [None]:
# Mapping of tokens to WordNet POS tags
tags = [('All', 'n'),
 ('about', 'n'),
 ('us', 'n'),
 ('on', 'n'),
 ('the', 'n'),
 ('sunlit', 'a'),
 ('slopes', 'v'),
 ('frothed', 'v'),
 ('and', 'n'),
 ('swayed', 'v'),
 ('the', 'n'),
 ('darting', 'a'),
 ('shrubs', 'n')]

lemmas = [lemmatizer.lemmatize(*tag) for tag in tags]
lemmas

Now `frothing` has been reduced to `froth`. In practice, however, we may wish to [experiment](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) with other lemmatizers to get the best results. The [SpaCy](https://spacy.io/) Python library has an excellent alternative lemmatizer, for example.

---

---
### Going Further: Beyond NLTK to SpaCy

NLTK was the first open-source Python library for Natural Language Processing (NLP), originally released in 2001, and it is still a valuable tool for teaching and research. Much of the literature uses NLTK code in its examples, which is why I chose to write this course using NLTK. As you may deduce from the parts-of-speech tagging example (above), NLTK does have its limitations though.

In many ways NLTK has been overtaken in efficiency and ease of use by other, more modern libraries, such as [SpaCy](https://spacy.io/). SpaCy is designed to use less computer memory and split workloads across multiple processor cores (or even computers) so that it can handle very large corpora easily. It also has excellent documentation. If you are serious about text-mining with Python for a large research dataset, I recommend that you try SpaCy. If you have understood the text-mining principles we have covered with NLTK, you will have no trouble using SpaCy as well.

---

---
---
## Gensim Python Library for Topic Modelling

[Gensim](https://radimrehurek.com/gensim/) is an open-source library that specialises in topic modelling. It is powerful, easy to use and is designed to work with very large corpora. (Another Python library, [scikit-learn](https://scikit-learn.org), also has topic modelling, but we won't cover that here.)

### Collecting the Example Corpus: US Presidential Inaugural Addresses

First, we are going to load a corpus of speeches `nltk.corpus.inaugural` that comes packaged into NLTK. This is the C-Span Inaugural Address Corpus (public domain) that contains the inaugural address of every US president from 1789–2009. 

In [22]:
import nltk
nltk.download('inaugural')
inaugural = nltk.corpus.inaugural

[nltk_data] Downloading package inaugural to
[nltk_data]     /usr/local/share/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


To get an idea of what is inside, we can list the files:

In [23]:
files = inaugural.fileids()
files[0:10]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt']

And examine the first few words of each file:

In [24]:
for file in files[0:10]:
    print(inaugural.words(file))

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]
['Fellow', 'citizens', ',', 'I', 'am', 'again', ...]
['When', 'it', 'was', 'first', 'perceived', ',', 'in', ...]
['Friends', 'and', 'Fellow', 'Citizens', ':', 'Called', ...]
['Proceeding', ',', 'fellow', 'citizens', ',', 'to', ...]
['Unwilling', 'to', 'depart', 'from', 'examples', 'of', ...]
['About', 'to', 'add', 'the', 'solemnity', 'of', 'an', ...]
['I', 'should', 'be', 'destitute', 'of', 'feeling', ...]
['Fellow', 'citizens', ',', 'I', 'shall', 'not', ...]
['In', 'compliance', 'with', 'an', 'usage', 'coeval', ...]


---
#### Going Further: Corpora for Learning and Practicing Text-Mining
It is difficult to source pre-prepared corpora for learning and practicing text-mining. The documents must be good quality, easily available and distributed with a license that allows text-mining. NLTK comes with a number of corpora you can download from [`nltk_data`](http://www.nltk.org/nltk_data/) but these are quite old and limited in scope. It's worth searching around for [lists of corpora](https://nlpforhackers.io/corpora/) but bear in mind you must determine the true source and licensing of any corpus for yourself.

---

### Pre-Processing Text in Gensim
Before we can start to do topic modelling we must — of course! — clean and prepare the text by tokenising, removing stopwords, stemming, and so on. We could do this with NLTK, as we have learnt, but Gensim can do that for us too.

The defaults of `preprocess_string()` and `preprocess_documents()` use the following _filters_:

* Strip any HTML or XML tags
* Replace punctuation characters with spaces
* Remove repeating whitespace characters and turn tabs and line breaks into spaces
* Remove digits
* Remove stopwords
* Remove words with length less than 3 characters
* Lowercase
* Stem the words using a Porter Stemmer

Using Gensim, we will preprocess just the _first_ file in the corpus as an example:

In [34]:
import gensim
from gensim.parsing.preprocessing import *

washington = files[0]
text = inaugural.raw(washington)
tokens = preprocess_string(text)
tokens[0:10]

['fellow',
 'citizen',
 'senat',
 'hous',
 'repres',
 'vicissitud',
 'incid',
 'life',
 'event',
 'fill']

Hmm, what has happened here to our tokens? 😕

The Porter Stemmer that comes with Gensim does not give us real words, but this will make our topics less readable.

We can do something about this, but the code is a bit more advanced. Feel free to skip over the next section and start reading again at 'Pre-Processing the Corpus and Saving to File'.

---
#### Going Further: Using SpaCy's Lemmatizer to Get Real Words

In order to lemmatize the words instead, we have to specify a _list of filters_ that we want `preprocess_string()` to apply.

Before that we will import an alternative lemmatizer from the [SpaCy](https://spacy.io/) library, as it is a better by default than the NLTK one.

In [26]:
!spacy download en

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
[K    100% |████████████████████████████████| 11.1MB 5.6MB/s ta 0:00:011  19% |██████▏                         | 2.1MB 1.3MB/s eta 0:00:07    92% |█████████████████████████████▊  | 10.3MB 5.9MB/s eta 0:00:01
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.1.0
[33mYou are using pip version 10.0.1, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/Users/mary/digita

In [27]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.lemmatizer import LOOKUP
lemmatize = Lemmatizer(lookup=LOOKUP).lookup
lemmatize('swayed')

'sway'

(👆👆👆 All you need to understand here is that we using SpaCy's lemmatizer rather than NLTK's. If you don't understand the code, you can skip over it and continue.)

Now we apply a list of filters, which are in fact the same as defaults, except with the string method `lower()` and without the Gensim stemmer:

In [28]:
filters = [strip_tags, 
           strip_punctuation, 
           strip_multiple_whitespaces, 
           strip_numeric, 
           remove_stopwords, 
           strip_short,
           str.lower]

# Pre-process the tokens with the filters
tokens = preprocess_string(text, filters=filters)

# Lemmatize the filtered tokens with SpaCy's lemmatizer
lemmas = [lemmatize(token) for token in tokens]
lemmas[0:10]

['fellow',
 'citizen',
 'senate',
 'house',
 'representative',
 'among',
 'vicissitude',
 'incident',
 'life',
 'event']

Now we have real words for the tokens, instead of awkward stems. We'll use these lemmatized tokens for our topic modelling example.

---

---
---
## Pre-Processing the Corpus and Saving to File

### Reading Files and Writing to File
Last workshop I glossed over how we save to text files read them back in again. I offered this guide [Reading and Writing Files in Python](https://realpython.com/read-write-files-python/#opening-and-closing-a-file-in-python), which is an excellent in-depth look that I recommend.

In brief, in order to open files we use the `open()` function and the keyword `with`.

For reading:

`with open(file, 'r') as reader:`

For writing:

`with open(file, 'w') as writer:`

Then whatever you put inside the code block will run with the file open and ready. Once your code has finished running the file is safely closed.

We can create and then write a text file with the `write()` method:

In [29]:
with open('blackhole.txt', 'w') as writer:
    writer.write('At the center of a black hole lies a singularity.')

> Now go to the Jupyter notebook folder `workshop-2-topic-modelling`, open the newly created text file `blackhole.txt` and inspect its contents!

We can read this file back in to a string with the `read()` method:

In [30]:
with open('blackhole.txt', 'r') as reader:
    sentence = reader.read()
    
sentence

'At the center of a black hole lies a singularity.'

To write line by line (instead of the whole file at once) use `writelines()`, and likewise, to read one line at a time use `readlines()`. For all the details, see the tutorial linked above.

### Pre-Processing Speeches and Saving Tokens to Text File
We can now put everything we have learnt together to pre-process our entire corpus of speeches, and save the clean lemma tokens to text files, ready to be loaded in the next notebook `3-topic-modelling-and-visualising.ipynb`.

Let's step through this code now:

1. Create a location for the `data/inaugural` folder where we want to save the files:

In [35]:
from pathlib import Path
location = Path('data', 'inaugural-test')

2. Loop over all the files in turn, using the Gensim `preprocess_string` function to prepare them, and save them as individual files:

In [36]:
for file in files:
    
    print(f'Processing file: {file}')
    
    text = inaugural.raw(file)
    tokens = preprocess_string(text, filters=filters)
    lemmas = [lemmatize(token) for token in tokens]
    
    with open(location / file, 'w') as writer:
        writer.write(' '.join(lemmas))

Processing file: 1789-Washington.txt
Processing file: 1793-Washington.txt
Processing file: 1797-Adams.txt
Processing file: 1801-Jefferson.txt
Processing file: 1805-Jefferson.txt
Processing file: 1809-Madison.txt
Processing file: 1813-Madison.txt
Processing file: 1817-Monroe.txt
Processing file: 1821-Monroe.txt
Processing file: 1825-Adams.txt
Processing file: 1829-Jackson.txt
Processing file: 1833-Jackson.txt
Processing file: 1837-VanBuren.txt
Processing file: 1841-Harrison.txt
Processing file: 1845-Polk.txt
Processing file: 1849-Taylor.txt
Processing file: 1853-Pierce.txt
Processing file: 1857-Buchanan.txt
Processing file: 1861-Lincoln.txt
Processing file: 1865-Lincoln.txt
Processing file: 1869-Grant.txt
Processing file: 1873-Grant.txt
Processing file: 1877-Hayes.txt
Processing file: 1881-Garfield.txt
Processing file: 1885-Cleveland.txt
Processing file: 1889-Harrison.txt
Processing file: 1893-Cleveland.txt
Processing file: 1897-McKinley.txt
Processing file: 1901-McKinley.txt
Processing

> Feel free to inspect these files now in the folder `data/inaugural-test`. If for some reason you have changed the code and it's not worked properly, don't worry! I've created a proper set to use in `data/inaugural`.

---
---
## Summary

In this notebook we have covered:

* Stemming and lemmatization
* Gensim Python library for topic modelling
* Pre-processing the text with Gensim
* Reading from and writing to text files

👌👌👌

In the next notebook `3-topic-modelling-and-visualising.ipynb` we will walk through a full example of topic modelling using Gensim and the speeches we have prepared.