# Introduction to Topic Modelling with Python

---
---
## What is Topic Modelling?

Topic modelling is a _distant reading_ technique for finding structure in large collections of text, without actually reading everything by eye. If you have hundreds or thousands of documents and want to understand roughly what your corpus contains, then topic modelling may be for you.

A topic modelling programme finds the words that appear frequently together in a document and groups them together to form 'topics'. A **topic** is a mixture of words that is supposed to characterise (part of) the content of a document — its theme or underlying ideas. For example, one topic of this [Wikipedia article](https://en.wikipedia.org/wiki/Black_hole) is:

* black, hole, mass, star

![First picture of a supermassive black hole, captured in 2019](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Black_hole_-_Messier_87.jpg/320px-Black_hole_-_Messier_87.jpg "First picture of a supermassive black hole, captured in 2019")

Not too surprising, you may think. We could say the topic seems pretty accurate from our perspective. What about a document that we are less familiar with? Here is a topic of a [speech](https://er.jsc.nasa.gov/seh/ricetalk.htm) made by John F. Kennedy at Rice University in 1962:

* space, new, year, man

![Charles Conrad Jr., Apollo 12 Commander, examines the unmanned Surveyor III spacecraft on the Moon](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4e/Surveyor_3-Apollo_12.jpg/274px-Surveyor_3-Apollo_12.jpg "Charles Conrad Jr., Apollo 12 Commander, examines the unmanned Surveyor III spacecraft on the Moon")

This is Kennedy's famous 'we choose to go to the moon' speech. Notice that 'moon' is not in this topic; but the speech does cover the history of humankind's ("man's") endeavours and emphasises a forward-looking perspective (the "new"-ness of advancements).

From these simplified examples, we can see that human intervention is still required to interpret what topics might 'mean'. Topic modelling is not magic; it is a tool that requires informed use and careful review, just like any other.

### So... Why Do Topic Modelling?
In the humanities, topic modelling may be used to support different approaches to large text corpora, such as:

* Survey a collection that is too big to read closely e.g. [Computational Historiography: Data Mining in a Century of Classics Journals](http://www.perseus.tufts.edu/publications/02-jocch-mimno.pdf) (PDF)
* Look at thematic trends over time in an archive e.g. [Topic Modeling Martha Ballard's Diary](http://www.cameronblevins.org/posts/topic-modeling-martha-ballards-diary/)
* Create metadata for an archive to improve accessibility e.g. [Topic modelling for the valorisation of digitised archives of the European Commission](https://ieeexplore.ieee.org/abstract/document/7840981)
* Understand current trends in social media relevant to your discipline e.g. [Mining the Open Web with ‘Looted Heritage’](https://electricarchaeology.ca/2012/06/08/mining-the-open-web-with-looted-heritage-draft/)

### Alternatives to Topic Modelling in Python
If you are looking to explore the topics of a few documents in a casual way, you can use the online digital texts environment [Voyant](), which allows you to upload or copy-and-paste texts and explore a corpus with a number of graphical tools, including topics.

For serious research, a well-known tool for topic modelling is called [MALLET](http://mallet.cs.umass.edu/topics.php), which is a programme (written in Java) that you download to your computer. You have to type commands to use MALLET, but it has otherwise done a great deal for you. [Getting Started with Topic Modeling and MALLET](https://programminghistorian.org/en/lessons/topic-modeling-and-mallet) from Programming Historian gives a step-by-step tutorial on MALLET.

There is a graphical interface for MALLET called [Topic Modeling Tool](https://github.com/senderle/topic-modeling-tool) that is a bit easier to use. The [Quickstart Guide](https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html) will get you up and running.

If you are looking to use R rather than Python, then `tidytext` is a popular NLP library that will help you work with the `topicmodels` package. The book _Text Mining with R_ devotes [chapter 6](https://www.tidytextmining.com/topicmodeling.html) to this.

---
**With the alternatives out of the way, let's see how we can do topic modelling in Python!**

---
---

## How to Join In with Coding

* **Edit** any cell and try changing the code, or delete it and write your own.

* Before running a cell, try to **guess** what the output will be by thinking through what will happen.

* If you encounter an **error**, realise this is normal. Errors happen all the time and by reading the error message you will learn something new.

* Remember: you cannot break the notebook or your computer, so **don't be afraid to experiment**.

**Let's get coding!**

---
---
## Recap of Python Basics
Welcome back! Let's recap the Python that we learnt last time. Any questions?
### Strings
Create a _string_ and store it with a _name_:

In [202]:
my_sentence = 'The Moon formed 4.51 billion years ago.'
my_sentence

'The Moon formed 4.51 billion years ago.'

_Slice_ a string. Remember that indexing in Python starts at 0.

In [203]:
my_sentence[0:20]

'The Moon formed 4.51'

Transform a string with string methods. Important: the original string `my_sentence` is unchanged. Instead, a string method _returns_ a new string.

In [204]:
my_sentence.swapcase()

'tHE mOON FORMED 4.51 BILLION YEARS AGO.'

Test a string with string methods:

In [205]:
my_sentence.islower()

False

Test a string to see if it contains another string:

In [206]:
'f' in my_sentence

True

Create a _list_ of strings:

In [207]:
my_list = ['The Moon formed 4.51 billion years ago',
           "The Moon is Earth's only permanent natural satellite",
          'The Moon was first reached in September 1959']
my_list

['The Moon formed 4.51 billion years ago',
 "The Moon is Earth's only permanent natural satellite",
 'The Moon was first reached in September 1959']

Slice a list:

In [208]:
my_list[-1]

'The Moon was first reached in September 1959'

Create a transformed list of strings with a _list comprehension_:

In [209]:
new_list = [string.upper() for string in my_list if 'Earth' in string]
new_list

["THE MOON IS EARTH'S ONLY PERMANENT NATURAL SATELLITE"]

### Imports
`import` a _module_ and use it. A module is simply code 'written by someone else' in another file (or files).

In [210]:
import requests
response = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/1/0/1/1013/1013.txt')
text = response.text
text[681:900]

'THE FIRST MEN IN THE MOON\r\n\r\nby H.G. Wells\r\n\r\n\r\n\r\n\r\nChapter 1\r\n\r\n\r\n\r\n\r\nMr. Bedford Meets Mr. Cavor at Lympne\r\n\r\nAs I sit down to write here amidst the shadows of vine-leaves under the\r\nblue sky of southern Italy, it com'

`import` [Natural Language Tool Kit](http://www.nltk.org/) (NLTK) to help with natural language processing (NLP):

In [211]:
from nltk import word_tokenize

tokens = word_tokenize(text)
tokens[126:146]

['THE',
 'FIRST',
 'MEN',
 'IN',
 'THE',
 'MOON',
 'by',
 'H.G',
 '.',
 'Wells',
 'Chapter',
 '1',
 'Mr.',
 'Bedford',
 'Meets',
 'Mr.',
 'Cavor',
 'at',
 'Lympne',
 'As']

### Functions
_Call_ a _function_ with _arguments_. For example, here the function `most_common()` takes a single argument `10`, to give us the ten most common tokens.

In [212]:
from nltk.probability import FreqDist

freqdist = FreqDist(tokens)
freqdist.most_common(10)

[(',', 4612),
 ('.', 3851),
 ('the', 3639),
 ('and', 2538),
 ('of', 2483),
 ('I', 2190),
 ('a', 1809),
 ('to', 1658),
 ("''", 1214),
 ('``', 1160)]

---
---
## More About Tokenising and Normalisation

In the last workshop, in notebook `workshop-1-basics/2-collecting-and-preparing.ipynb`, we cleaned and prepared the text _The Iliad of Homer_ (translated by Alexander Pope (1899)) by:
* Tokenising the text into individual words.
* Normalising the text:
 * into lowercase,
 * removing punctuation,
 * removing non-words (empty strings, numerals, etc.),
 * removing stopwords.

One form of normalisation we didn't do last time is making sure that different _inflections_ of the same word are counted together. In English, words are modified to express quantity, tense, etc. (i.e. _declension_ and _conjugation_ for those who remember their language lessons!).

For example, 'fish', 'fishes', 'fishy' and 'fishing' are all formed from the root 'fish'. Last workshop, all these words would have been counted as different words, which may or may not be desirable.

### Stemming and Lemmatization

There are two main ways to normalise for inflection:

* **Stemming** - reducing a word to a stem by removing endings (a **stem** may not be an actual word).
* **Lemmatization** - reducing a word to its meaningful base form using its context (a **lemma** is typically a proper word in the language).

To do this we can use several facilities provided by NLTK. There are many different ways to stem and lemmatize words, but we will compare the results of the [Porter Stemmer](https://tartarus.org/martin/PorterStemmer/) and [WordNet](https://wordnet.princeton.edu/) lemmatizer.

In [213]:
hg_wells = text[118017:118088]
hg_wells

'All about us on the sunlit slopes frothed and swayed the darting shrubs'

In [214]:
tokens = word_tokenize(hg_wells)

from nltk import PorterStemmer

porter = PorterStemmer()
stems = [porter.stem(token) for token in tokens]
stems

['all',
 'about',
 'us',
 'on',
 'the',
 'sunlit',
 'slope',
 'froth',
 'and',
 'sway',
 'the',
 'dart',
 'shrub']

In [215]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/mary/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [216]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
lemmas

['All',
 'about',
 'u',
 'on',
 'the',
 'sunlit',
 'slope',
 'frothed',
 'and',
 'swayed',
 'the',
 'darting',
 'shrub']

What do you think about the results? Perhaps surprisingly, the lemmatizer seems to have performed more poorly than the stemmer since `frothed` and `darting` have not been reduced to `froth` and `dart`.

The different rules used to stem and lemmatize words are called _algorithms_ and they can result in different stems and lemmas. If the precise details of this are important to your research, you should compare the results of the various algorithms. Stemmers and lemmatizers are also available in many languages, not just English.

---
#### Going Further: Improving Lemmatization with Part-of-Speech Tagging

To improve the lemmatizer's performance we can tell it which _part of speech_ each word is, which is known as **part-of-speech tagging (POS tagging)**. A part of speech is the role a word plays in the sentence, e.g. verb, noun, adjective, etc.

In [217]:
# Download the POS tagger
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/mary/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [218]:
# Generate the POS tags for each token
tags = nltk.pos_tag(tokens)
tags

[('All', 'DT'),
 ('about', 'IN'),
 ('us', 'PRP'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('sunlit', 'NN'),
 ('slopes', 'VBZ'),
 ('frothed', 'VBN'),
 ('and', 'CC'),
 ('swayed', 'VBN'),
 ('the', 'DT'),
 ('darting', 'NN'),
 ('shrubs', 'NN')]

These tags that NLTK generates are from the [Penn Treebank II tag set](https://www.clips.uantwerpen.be/pages/MBSP-tags). For example, now we know that `frothed` is a 'verb, past participle' (VBN).

Unfortunately, the NLTK lemmatizer accepts WordNet tags (`ADJ, ADV, NOUN, VERB = 'a', 'r', 'n', 'v'`) instead! In theory, at least, if we pass the tagging information to the lemmatizer, the results are better.

In [219]:
# Mapping of tokens to WordNet POS tags
tags = [('All', 'n'),
 ('about', 'n'),
 ('us', 'n'),
 ('on', 'n'),
 ('the', 'n'),
 ('sunlit', 'a'),
 ('slopes', 'v'),
 ('frothed', 'v'),
 ('and', 'n'),
 ('swayed', 'v'),
 ('the', 'n'),
 ('darting', 'a'),
 ('shrubs', 'n')]

lemmas = [lemmatizer.lemmatize(*tag) for tag in tags]
lemmas

['All',
 'about',
 'u',
 'on',
 'the',
 'sunlit',
 'slop',
 'froth',
 'and',
 'sway',
 'the',
 'darting',
 'shrub']

Now `frothing` has been reduced to `froth`. In practice, however, we may wish to [experiment](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) with other lemmatizers to get the best results. The [SpaCy](https://spacy.io/) Python library has an excellent alternative lemmatizer, for example.

---

---
#### Going Further: Beyond NLTK to SpaCy

NLTK was the first open-source Python library for Natural Language Processing (NLP), originally released in 2001, and it is still a valuable tool for teaching and research. Much of the literature uses NLTK code in its examples, which is why I chose to write this course using NLTK. As you may deduce from the parts-of-speech tagging example (above), NLTK does have its limitations though.

In many ways NLTK has been overtaken in efficiency and ease of use by other, more modern libraries, such as [SpaCy](https://spacy.io/). SpaCy is designed to use less computer memory and split workloads across multiple processor cores (or even computers) so that it can handle very large corpora easily. It also has excellent documentation. If you are serious about text-mining with Python for a large research dataset, I recommend that you try SpaCy. If you have understood the text-mining principles we have covered with NLTK, you will have no trouble using SpaCy as well.

---

---
---
## Gensim Python Library for Topic Modelling

[Gensim](https://radimrehurek.com/gensim/) is an open-source library that specialises in topic modelling. It is powerful, easy to use and is designed to work with very large corpora. (Another Python library, [scikit-learn](https://scikit-learn.org), also has topic modelling, but we won't cover that here.)

### Collecting the Example Corpus: US Presidential Inaugural Addresses

First, we are going to load a corpus of speeches `nltk.corpus.inaugural` that comes packaged into NLTK. This is the C-Span Inaugural Address Corpus (public domain) that contains the inaugural address of every US president from 1789–2009. 

In [220]:
import nltk
nltk.download('inaugural')
inaugural = nltk.corpus.inaugural

[nltk_data] Downloading package inaugural to /home/mary/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


To get an idea of what is inside, we can list the files:

In [268]:
files = inaugural.fileids()
files[0:10]

['1789-Washington.txt',
 '1793-Washington.txt',
 '1797-Adams.txt',
 '1801-Jefferson.txt',
 '1805-Jefferson.txt',
 '1809-Madison.txt',
 '1813-Madison.txt',
 '1817-Monroe.txt',
 '1821-Monroe.txt',
 '1825-Adams.txt']

And examine the first few words of each file:

In [222]:
for file in files[0:10]:
    print(inaugural.words(file))

['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', ...]
['Fellow', 'citizens', ',', 'I', 'am', 'again', ...]
['When', 'it', 'was', 'first', 'perceived', ',', 'in', ...]
['Friends', 'and', 'Fellow', 'Citizens', ':', 'Called', ...]
['Proceeding', ',', 'fellow', 'citizens', ',', 'to', ...]
['Unwilling', 'to', 'depart', 'from', 'examples', 'of', ...]
['About', 'to', 'add', 'the', 'solemnity', 'of', 'an', ...]
['I', 'should', 'be', 'destitute', 'of', 'feeling', ...]
['Fellow', 'citizens', ',', 'I', 'shall', 'not', ...]
['In', 'compliance', 'with', 'an', 'usage', 'coeval', ...]


---
#### Going Further: Corpora for Learning and Practicing Text-Mining
It is difficult to source pre-prepared corpora for learning and practicing text-mining. The documents must be good quality, easily available and distributed with a license that allows text-mining. NLTK comes with a number of corpora you can download from [`nltk_data`](http://www.nltk.org/nltk_data/) but these are quite old and limited in scope. It's worth searching around for [lists of corpora](https://nlpforhackers.io/corpora/) but bear in mind you must determine the true source and licensing of any corpus for yourself.

---

### Pre-Processing Text in Gensim
Before we can start to do topic modelling we must — of course! — clean and prepare the text by tokenising, removing stopwords, stemming, and so on. We could do this with NLTK, as we have learnt, but Gensim can do that for us too.

The defaults of `preprocess_string()` and `preprocess_documents()` use the following _filters_:

* Strip any HTML or XML tags
* Replace punctuation characters with spaces
* Remove repeating whitespace characters and turn tabs and line breaks into spaces
* Remove digits
* Remove stopwords
* Remove words with length less than 3 characters
* Lowercase
* Stem the words using a Porter Stemmer

In [237]:
import gensim
from gensim.parsing.preprocessing import *

# Pre-process the first file in the corpus as an example
washington = files[0]
text = inaugural.raw(washington)
tokens = preprocess_string(text)
tokens[0:10]

['fellow',
 'citizen',
 'senat',
 'hous',
 'repres',
 'vicissitud',
 'incid',
 'life',
 'event',
 'fill']

The Porter Stemmer that comes with Gensim does not give us real words; this will make our topics less readable. In order to lemmatize the words instead, we have to specify a _list of filters_ that we want `preprocess_string()` to apply.

Before that we will import an alternative lemmatizer from the [SpaCy](https://spacy.io/) library, as it is a better by default than the NLTK one.

In [290]:
!spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/home/mary/PycharmProjects/intro-to-text-mining-with-python/venv/lib/python3.6/site-packages/en_core_web_sm
-->
/home/mary/PycharmProjects/intro-to-text-mining-with-python/venv/lib/python3.6/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [235]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.lemmatizer import LOOKUP
lemmatize = Lemmatizer(lookup=LOOKUP).lookup
lemmatize('swayed')

'sway'

(👆👆👆 All you need to understand here is that we using SpaCy's lemmatizer rather than NLTK's. If you don't understand the code, you can skip over it and continue.)

Now we apply a list of filters, which are in fact the same as defaults, except with the string method `lower()` and without the Gensim stemmer:

In [288]:
filters = [strip_tags, 
           strip_punctuation, 
           strip_multiple_whitespaces, 
           strip_numeric, 
           remove_stopwords, 
           strip_short,
           str.lower]

# Pre-process the tokens with the filters
tokens = preprocess_string(text, filters=filters)

# Lemmatize the filtered tokens with SpaCy's lemmatizer
lemmas = [lemmatize(token) for token in tokens]
lemmas[0:10]

['fellow',
 'citizen',
 'stand',
 'today',
 'humble',
 'task',
 'grateful',
 'trust',
 'bestow',
 'mindful']

---
---
## More Python Essentials
Before we go on to the next notebook `2-topic-modelling-fundamentals.ipynb` we need to a cover a few more Python essentials.

### Looping with `for` loops
We have already seen **`for` loops** in passing when we create new lists using list comprehensions:

In [271]:
game = ['rock', 'paper', 'scissors']
new_list = [move for move in game]
new_list

['rock', 'paper', 'scissors']

This is a special form of loop for comprehensions — but it essentially works the same as the normal kind.

The normal kind of `for` loop looks like this:

In [242]:
for move in game:
    print(move)

rock
paper
scissors


A `for` loop goes over every item in a list in turn — and runs whatever code is inside its _block_. It makes sure that every item is visited, and then it stops when it gets to the end. We call this **iteration**; the loop _iterates_ over the list. (Loops also work for many other types of things, like strings.)

Note that a _block_ of code in Python is indicated by _indenting_ the code by several spaces (typically four spaces).

You also get loops inside other loops:

In [276]:
lists = [
    [0, 1, 2],
    [True, False, True],
    ['straw', 'twigs', 'bricks']
]

for lst in lists:
    for item in lst:
        print(item)

0
1
2
True
False
True
straw
twigs
bricks


### Dictionaries
Dictionaries are a form of _mapping_. They map **keys** to **values**. You can think of it like the index at the back of a book, where the key is a word and its value is the page number where you can find that word in the book. To find the page number of a word, you look through the index and find the word you want (the key) and then look at the number (the value).

```
agriculture, 228 
air freight, 46 
airplane food, 19 
alcohol, 165 
alfalfa, 242 
```

_etc._


The Python dictionary is called a `dict` and it can hold (almost) any type of key and value: strings, numbers, Booleans (`True`, `False`) and more.

To create a new `dict` we use curly braces `{}` and inside put each key and value separated by a colon `:`

In [277]:
my_dict = {
    'agriculture': 228,
    'air freight': 46,
    'airplane food': 19,
    'alcohol': 165,
    'alfalfa': 242
}
my_dict

{'agriculture': 228,
 'air freight': 46,
 'airplane food': 19,
 'alcohol': 165,
 'alfalfa': 242}

So now we can find the page number (value) of any of these words (keys) by putting the key in square brackets `[]`:

In [278]:
my_dict['agriculture']

228

To add a new key-value pair to the dictionary we can use the key in square brackets `[]` and assign the new value to it with the assignment operator `=`. In this example, the new key is 'allergies' and the new value is '210'.

In [279]:
my_dict['allergies'] = 210
my_dict

{'agriculture': 228,
 'air freight': 46,
 'airplane food': 19,
 'alcohol': 165,
 'alfalfa': 242,
 'allergies': 210}

### Tuples
A tuple is a bit like a list, except unlike a list, tuples cannot be changed. You cannot add or remove items from a tuple once you have created it. Tuples are known as _immutable_.

NB: Tuple is often pronounced 'toople' if you are from the UK, or 'tupple' if you are from the US, but it doesn't really matter.

To create a new tuple we use parentheses `()`:

In [285]:
my_tuple = (1, 5.0, 'ten-thousand')
my_tuple


(1, 5.0, 'ten-thousand')

Like a list you can _slice_ a tuple, to access its items:

In [286]:
my_tuple[2]

'ten-thousand'

But unlike a list, you cannot assign a new value to any of its items:

In [287]:
my_tuple[2] = 'rainbows and unicorns'

TypeError: 'tuple' object does not support item assignment

### Reading Files and Writing to File
Last workshop I glossed over how we save to text files read them back in again. I offered this guide [Reading and Writing Files in Python](https://realpython.com/read-write-files-python/#opening-and-closing-a-file-in-python), which is an excellent in-depth look that I recommend.

In brief, in order to open files we use the `open()` function and the keyword `with`.

For reading:

`with open(file, 'r') as reader:`

For writing:

`with open(file, 'w') as writer:`

Then whatever you put inside the code block will run with the file open and ready. Once your code has finished running the file is safely closed.

We can create and then write a text file with the `write()` method:

In [250]:
with open('blackhole.txt', 'w') as writer:
    writer.write('At the center of a black hole lies a singularity.')

> Now go to the Jupyter notebook folder `workshop-2-topic-modelling`, open the newly created text file `blackhole.txt` and inspect its contents!

We can read this file back in to a string with the `read()` method:

In [249]:
with open('blackhole.txt', 'r') as reader:
    sentence = reader.read()
    
sentence

'At the center of a black hole lies a singularity.'

To write line by line (instead of the whole file at once) use `writelines()`, and likewise, to read one line at a time use `readlines()`. For all the details, see the tutorial linked above.

---
---
## Pre-Processing the Corpus and Saving to File
We can now put everything we have learnt together to process our entire corpus of speeches, and save the clean lemma tokens to text files, ready to be loaded in the next notebook `2-topic-modelling-fundamentals`.

Let's step through this code now:

1. Create a location for the `data/inaugural` folder where we want to save the files:

In [269]:
from pathlib import Path
location = Path('data', 'inaugural-test')

2. Loop over all the files in turn, using the Gensim `preprocess_string` function to prepare them, and save them as individual files:

In [264]:
for file in files:
    
    print(f'Processing file: {file}')
    
    text = inaugural.raw(file)
    tokens = preprocess_string(text, filters=filters)
    lemmas = [lemmatize(token) for token in tokens]
    
    with open(location / file, 'w') as writer:
        writer.write(' '.join(lemmas))

Processing file: 1789-Washington.txt
Processing file: 1793-Washington.txt
Processing file: 1797-Adams.txt
Processing file: 1801-Jefferson.txt
Processing file: 1805-Jefferson.txt
Processing file: 1809-Madison.txt
Processing file: 1813-Madison.txt
Processing file: 1817-Monroe.txt
Processing file: 1821-Monroe.txt
Processing file: 1825-Adams.txt
Processing file: 1829-Jackson.txt
Processing file: 1833-Jackson.txt
Processing file: 1837-VanBuren.txt
Processing file: 1841-Harrison.txt
Processing file: 1845-Polk.txt
Processing file: 1849-Taylor.txt
Processing file: 1853-Pierce.txt
Processing file: 1857-Buchanan.txt
Processing file: 1861-Lincoln.txt
Processing file: 1865-Lincoln.txt
Processing file: 1869-Grant.txt
Processing file: 1873-Grant.txt
Processing file: 1877-Hayes.txt
Processing file: 1881-Garfield.txt
Processing file: 1885-Cleveland.txt
Processing file: 1889-Harrison.txt
Processing file: 1893-Cleveland.txt
Processing file: 1897-McKinley.txt
Processing file: 1901-McKinley.txt
Processing

> Feel free to inspect these files now in the folder `data/inaugural-test`. If for some reason you have changed the code and it's not worked properly, don't worry! I've created a proper set to use in `data/inaugural`.

---
---
## Summary

In this notebook we have covered:

* Recap of Python basics from last workshop:
 * Strings and lists
 * Imports
 * Functions
* Stemming and lemmatization
* Gensim Python library for topic modelling
* More Python essentials:
 * Loops
 * Dictionaries and tuples
 * Reading and writing files

👌👌👌

In the next notebook `2-topic-modelling-fundamentals` we will walk through a full example of topic modelling using Gensim and the speeches we have prepared.