# Exploratory Topic Modelling Using Python
##### by Mike Bryant and Maria Dermentzi

This notebook aims to walk readers through the process of topic modelling in Python and accompanies the article (to be) published in the European Holocaust Research Infrastructure (EHRI) Document Blog entitled "Exploratory Topic Modelling Using Python".

#### Credits:
The transcripts that form the corpus in this tutorial were obtained through the [United States Holocaust Memorial Museum](https://www.ushmm.org/) (USHMM).

## What is Topic Modelling?

Topic modelling is a technique by which documents within a corpus are clustered based on the manner in which certain groups of terms are used together within the text. The commonalities between such term groupings tend to form what we would normally call “topics”, providing a way to automatically categorise documents by their structural content, rather than any a priori knowledge system. Topic modelling is generally most effective when a corpus is large and diverse, so the individual documents within in are not too similar in composition. In EHRI, of course, we focus on the Holocaust, so documents available to us are naturally restricted in scope. It was an interesting experiment, however, to test to what extent a corpus of Holocaust-related documents was able to be topic modelled, and what “topics” emerged within it.

The specific type of topic modelling we’re looking at is called latent Dirichlet allocation (LDA), subject of an influential paper by Blei et al. (2003).

## The Dataset/Putting Together the Corpus
We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

Most of the testimonies comprising our corpus come from survivors of the Holocaust. These testimonies are usually the output of an interview process, which typically follows a certain structure. For example, the Oral History Interview Guidelines published by the USHMM state that interviews with Holocaust survivors are usually structured in three parts: ‘prewar life, the Holocaust and wartime experiences, and postwar experiences’ (United States Holocaust Memorial Museum, 2007, p. 26). There are definite limitations to what topic modelling can reveal about a collection of documents that relate to the same general subject (in this case, the Holocaust) and that follow a more-or-less similar structure, but the results are nonetheless interesting and potentially useful.

## The Process

We import the Python libraries that we are going to use. If you are running this notebook on your local machine, you might need to install any libraries that are not already on your computer. A requirements.txt file is also provided in the Github repository.

In [1]:
import pandas as pd
import re
import requests
import numpy as np
import spacy
!python -m spacy download en_core_web_sm
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from warnings import filterwarnings
filterwarnings('ignore')
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
from bs4 import BeautifulSoup

Collecting en-core-web-sm==3.3.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mdermentzi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


To create our corpus, first, we need to create a list of the USHMM oral testimonies which are in English and have a transcript attached to them.

Using the USHMM [collections search tool](https://collections.ushmm.org/search/?f%5Bavailability%5D%5B%5D=transcript&f%5Bavailability%5D%5B%5D=english&f%5Brecord_type_facet%5D%5B%5D=Oral+History&per_page=50), we notice that, as of the 14th of June 2022, there are 2,894 oral history records which list a transcript and are in English. This means that if we set the number of search results to be displayed per page to 50 (the maximum allowed), we will get 58 pages of results in total.

Saving this number in the `pages` variable will help us iterate over the search result pages and scrape their content for further analysis.

#### Note: The next few cells take a lot of time to run. You don't need to run them because their output is provided within the GitHub repository. We kept this code in the notebook for demonstration purposes only. To load the data and experiment with this topic model and its parameters, you can load the associated `pickle` files using the .read_pickle() functions provided later in the tutorial.

In [171]:
# The url to be used for each HTTP request
url="https://collections.ushmm.org/search/?f%5Bavailability%5D%5B%5D=transcript&f%5Bavailability%5D%5B%5D=english&f%5Brecord_type_facet%5D%5B%5D=Oral+History&per_page=50"

# We create an empty array, where we will store the response of our HTTP requests
responses = []

""" 
We request the first page of results and store the total number of result pages
"""

res = requests.get(url+f"&page=1", headers={"Accept": "application/json"})
response = res.json()
pages=response['response']['pages']['total_pages']

""" 
We iterate over every page and request its content; store the response in the 
responses array; and print the status code of each response to check if 
the request was successful (the 200 status code means the request was successful).
"""

for page in range(pages):
    responses.append(requests.get(url+f"&page={page+1}"))
    print(f"{page+1}: {responses[-1].status_code}")
        
    

1: 200
2: 200
3: 200
4: 200
5: 200
6: 200
7: 200
8: 200
9: 200
10: 200
11: 200
12: 200
13: 200
14: 200
15: 200
16: 200
17: 200
18: 200
19: 200
20: 200
21: 200
22: 200
23: 200
24: 200
25: 200
26: 200
27: 200
28: 200
29: 200
30: 200
31: 200
32: 200
33: 200
34: 200
35: 200
36: 200
37: 200
38: 200
39: 200
40: 200
41: 200
42: 200
43: 200
44: 200
45: 200
46: 200
47: 200
48: 200
49: 200
50: 200
51: 200
52: 200
53: 200
54: 200
55: 200
56: 200
57: 200
58: 200


We check whether the number of responses in the `responses` array matches the total number of pages as expected.

In [172]:
len(responses)

58

Using the Beautiful Soup library (Richardson, n.d.), which makes scraping content from HTML and XML pages easier, we extract the links to each individual oral history record found in the responses that we obtained in the code cells above.

To do this, we go back to the USHMM collections search tool and observe the HTML code of a search result page using our browser's developer tools. We notice that the links that we are looking for can be found within the `div` elements with the class `documentHeader`. For each `response` in our `responses`, we:  
1. parse the content of the `response`
2. iterate over every `div` element with the class `documentHeader`
3. extract the value of the `href` parameter found in every `a` element of the `div`
4. and append it to the empty array.


In [173]:
hrefs = []
for response in responses:
    soup = BeautifulSoup(response.text, "html.parser")     
    for div in soup.find_all('div',attrs={"class" : "documentHeader"}):
        m = div.find('a')['href']
        hrefs.append(m)

We observe the output of the previous cell by printing the total number of links that were extracted and the first link in our resulting array.

We notice that we have extracted 2,894 links which correspond to the total number of search results that the USHMM collections search tool came up with.

In [174]:
print(len(hrefs), hrefs[0])

2894 /search/catalog/irn509709


We know that by appending ".json" to the end of a testimony URL, we get a .json file containing all metadata related to the record. In the next code cell, we create an array containing the URL of each Hypertext Transfer Protocol (HTTP) request that we need to send to the USHMM server in order to get the transcript and other metadata of each record.

In [None]:
json_requests = [f"https://collections.ushmm.org{href}.json" for href in hrefs]

In [None]:
# We observe the first url to check whether our code worked as expected
json_requests[0]

'https://collections.ushmm.org/search/catalog/irn504618.json'

We are now ready to send the requests to the server and get the transcripts and other metadata. We store the metadata that is of interest in a Python `dictionary`, which we append to the `testimonies` array. For every record that is successfully processed, we print its RG_number (a unique identifying number assigned to the record by USHMM staff).

In [None]:
count = 0
errors = 0
testimonies = []
for item in json_requests:
    try:
        res = requests.get(item).json()['response']['document']
        rg_number =res['rg_number'] if 'rg_number' in res else None
        date=res['display_date'][0] if 'display_date' in res else None
        text = res['fnd_content_web'][0] if 'fnd_content_web' in res else None
        conditions_access = res['conditions_access'][0] if 'conditions_access' in res else None
        conditions_use = res['conditions_use'][0] if 'conditions_use' in res else None
        rights_condition_access = res['rights_condition_access'][0] if 'rights_condition_access' in res else None
        rights_condition_use = res['rights_condition_use'][0] if 'rights_condition_use' in res else None
        
        testimonies.append({"RG_number": rg_number, 
                            "text": text, 
                            "display_date": date,
                            "conditions_access": conditions_access if conditions_access else rights_condition_access if rights_condition_access else None,
                            "conditions_use": conditions_use if conditions_use else rights_condition_use if rights_condition_use else None,
                           })
        print(rg_number)
        count += 1
    except AttributeError:
        errors += 1
        pass

RG-50.030.0124
RG-50.155.0004
RG-50.155.0011
RG-50.999.0617
RG-50.030.0272
RG-50.030.0294
RG-50.470.0015
RG-50.030.0349
RG-50.030.0325
RG-50.030.0333
RG-50.030.0275
RG-50.042.0013
RG-50.477.0803
RG-50.999.0505
RG-50.477.1158
RG-50.244.0008
RG-50.165.0066
RG-50.154.0011
RG-90.008.0016
RG-90.121.0020
RG-90.121.0016
RG-50.030.0429
RG-50.147.0016
RG-50.477.0981
RG-50.030.0430
RG-50.477.1060
RG-50.999.0533
RG-50.477.0967
RG-50.462.0084
RG-50.462.0069
RG-50.477.1403
RG-50.477.1378
RG-50.010.0015
RG-50.010.0011
RG-50.096.0001
RG-50.030.0248
RG-50.042.0005
RG-50.030.0384
RG-50.155.0002
RG-50.147.0010
RG-50.477.0784
RG-50.233.0064
RG-50.233.0069
RG-50.233.0006
RG-50.477.1458
RG-50.999.0708
RG-50.549.05.0004
RG-50.106.0036
RG-50.999.0649
RG-50.999.0593
RG-50.477.0361
RG-50.030.0215
RG-50.030.0128
RG-50.147.0009
RG-50.030.0424
RG-50.030.0873
RG-50.467.0002
RG-50.030.0464
RG-90.217.0041
RG-50.462.0081
RG-50.462.0039
RG-50.477.1412
RG-50.235.0001
RG-50.030.0138
RG-50.002.0015
RG-50.030.0279
RG-50.4

In [None]:
# We observe how many testimonies we managed to retrieve
len(testimonies)

2894

We are now ready to transfer the transcripts and their associated metadata into a tabular structure, called pandas `DataFrame` (similar to a spreadsheet) which will allow us to analyse and manipulate the data more easily using the pandas (Reback et al., 2020) Python library.

In [None]:
df = pd.DataFrame.from_dict(testimonies)

By calling the `.info()` function on a pandas DataFrame, we get an overview of its contents. We can see that `df` consists of 2,894 entries, and while the `RG_number` and `text` columns have zero null entries, some of the column values in some rows are null.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2894 entries, 0 to 2893
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          2894 non-null   object
 1   text               2894 non-null   object
 2   display_date       2798 non-null   object
 3   conditions_access  2810 non-null   object
 4   conditions_use     2378 non-null   object
dtypes: object(5)
memory usage: 113.2+ KB


In [17]:
# This is to increase the amount of text displayed within the columns
pd.set_option('display.max_colwidth', 500)

We observe the first 10 rows of the DataFrame. Notice that some of the entries list the conditions under which they can be used.

In [None]:
df.head(10)

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use
0,RG-50.030.0124,"United States Holocaust Memorial Museum Interview with Avram Lazar November 16, 1990 RG-50.030*124 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following oral history testimony is the result of a videotaped interview with Avram Lazar, conducted by Linda Kuzmack on November 16, 1990 on behalf of ...",1990 November 16,No restrictions on access,Restrictions on use. Interview cannot be used for sale in the Museum Shop. Interview cannot be used by a third party for creation of a work for commercial purposes.
1,RG-50.155.0004,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1993 January 26,No restrictions on access,No restrictions on use
2,RG-50.155.0011,"This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked f...",1985 April 17,No restrictions on access,No restrictions on use
3,RG-50.999.0617,"1 Good morning, and welcome to the United States Holocaust Memorial Museum. My name is Bill Benson. I am the host of the museum's public program, First Person. Thank you for joining us today. We are in our 18th year of First Person. Our First Person today is Mr. Harry Markowicz, whom you shall meet shortly. This 2017 season of First Person is made possible by the generosity of the Louis Franklin Smith Foundation, with additional funding from the Arlene and Daniel Fisher Foundation. We are g...",2017 July 06,No restrictions on access,No restrictions on use
4,RG-50.030.0272,"United States Holocaust Memorial Museum Interview with Peter Feigl August 23, 1995 RG-50.030*0272 PREFACE The following oral history testimony is the result of a videotaped interview with Peter Feigl, conducted by Neenah Ellis on August 23, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Ho...",1995 August 23,No restrictions on access,No restrictions on use
5,RG-50.030.0294,"United States Holocaust Memorial Museum Interview with Felix Horn July 19, 1994 RG-50.030*0294 PREFACE The following oral history testimony is the result of a videotaped interview with Felix Horn, conducted by Sandra Bradley on July 19, 1994 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Holoc...",1994 July 19,No restrictions on access,No restrictions on use
6,RG-50.470.0015,"WILLIAM MCWORKMAN: Perhaps I should start by saying that I was in the 12th armored division--one of several armored divisions in the 3rd and 7th Army who drove south toward Austria. Our original mission as Munich, but and OSS agent who we had sent forward earlier in the week, we found from the Munich burgermeister or mayor that they were going to give up the city without a fight, so our mission was changed so we headed more to the south and the east on the west side of Amersee down through--...",1995 February 21,No restrictions on access,No restrictions on use
7,RG-50.030.0349,"United States Holocaust Memorial Museum Interview with Dr. Jacques Godel August 28, 1995 RG-50.030*0349 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following oral history testimony is the result of a taped interview with Dr. Jacques Godel, conducted on August 28, 1995 on behalf of the United S...",1995 August 15,No restrictions on access,No restrictions on use
8,RG-50.030.0325,"United States Holocaust Memorial Museum Interview with Fritz Schnaittacher May 9, 1995 RG-50.030*0325 PREFACE The following oral history testimony is the result of a videotaped interview with Fritz Schnaittacher, conducted by Randy Goldman on May 9, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United ...",1995 May 09,No restrictions on access,No restrictions on use
9,RG-50.030.0333,"United States Holocaust Memorial Museum Interview with Rudolph Haas June 13, 1995 RG-50.030*0333 PREFACE The following oral history testimony is the result of a videotaped interview with Rudolph Haas, conducted by Joan Ringelheim on June 13, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States H...",1995 June 13,No restrictions on access,No restrictions on use


The following line of code saves the current DataFrame in a [`pickle`](https://docs.python.org/3/library/pickle.html) file, so that next time we run this notebook we can read the saved pickle file and continue from this point onwards without having to go through the scraping and HTTP requests process again.

#### Note: 
While the code to achieve this is provided, the file itself is not provided as part of this tutorial. However, you can import the DataFrame that includes only the unrestricted testimonies further below.

In [None]:
df.to_pickle("ushmm_oral_testimonies_df.pkl")

The following line reads the saved `pickle` file from the disk and converts it into a new DataFrame. 

In [87]:
## Read the already saved DataFrame. Uncomment (by removing the hash symbol) to run.
# df = pd.read_pickle("ushmm_oral_testimonies_df.pkl")

Some records come with restrictions on use or access. In the following cells, we are checking what these restrictions might be and how many records they concern so that we can later filter the DataFrame to only include unrestricted records.

In [88]:
pd.set_option('display.max_rows', None)
df['conditions_access'].value_counts()

No restrictions on access                                                                                                                                                                                                                                                                                                     2767
There are no known restrictions on access to this material.                                                                                                                                                                                                                                                                     32
Restrictions on access. Access to this interview is restricted to onsite at the United States Holocaust Memorial Museum. Requests for access outside the Museum must be submitted to the Tauber Holocaust Library of the Jewish Family and Children's Services of San Francisco, the Peninsula, Marin and Sonoma Counties.       8
Restrictions on access. See don

In [89]:
df['conditions_use'].value_counts()

No restrictions on use                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      1972
Restrictions on use. Interview can not be used for sale in the Museum Shop.                                                                                                                                                                                                                                                                                                                                                                                                                            

To be on the safe side copyright-wise (see also the USHMM's [Terms of Use](https://www.ushmm.org/copyright-and-legal-information/terms-of-use)), we create a copy of the DataFrame that only includes records which have no restrictions on use or access or for which the USHMM states that no further permission is required to use them.

In [90]:
unrestricted_df = df[((df['conditions_access'] == "No restrictions on access")
                        |(df['conditions_access'] == "There are no known restrictions on access to this material."))
                        & ((df['conditions_use']=='No restrictions on use')
                        | (df['conditions_use']=="To the best of the Museum's knowledge, there are no known copyright restrictions on the material(s) in this collection, or the material is in the public domain. You do not require further permission from the Museum to use this material."))]

This leaves us with 2,003 records.

In [91]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2003 entries, 1 to 2893
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          2003 non-null   object
 1   text               2003 non-null   object
 2   display_date       1912 non-null   object
 3   conditions_access  2003 non-null   object
 4   conditions_use     2003 non-null   object
dtypes: object(5)
memory usage: 93.9+ KB


In [92]:
unrestricted_df.head()

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use
1,RG-50.155.0004,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1993 January 26,No restrictions on access,No restrictions on use
2,RG-50.155.0011,"This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked f...",1985 April 17,No restrictions on access,No restrictions on use
3,RG-50.999.0617,"1 Good morning, and welcome to the United States Holocaust Memorial Museum. My name is Bill Benson. I am the host of the museum's public program, First Person. Thank you for joining us today. We are in our 18th year of First Person. Our First Person today is Mr. Harry Markowicz, whom you shall meet shortly. This 2017 season of First Person is made possible by the generosity of the Louis Franklin Smith Foundation, with additional funding from the Arlene and Daniel Fisher Foundation. We are g...",2017 July 06,No restrictions on access,No restrictions on use
4,RG-50.030.0272,"United States Holocaust Memorial Museum Interview with Peter Feigl August 23, 1995 RG-50.030*0272 PREFACE The following oral history testimony is the result of a videotaped interview with Peter Feigl, conducted by Neenah Ellis on August 23, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Ho...",1995 August 23,No restrictions on access,No restrictions on use
5,RG-50.030.0294,"United States Holocaust Memorial Museum Interview with Felix Horn July 19, 1994 RG-50.030*0294 PREFACE The following oral history testimony is the result of a videotaped interview with Felix Horn, conducted by Sandra Bradley on July 19, 1994 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Holoc...",1994 July 19,No restrictions on access,No restrictions on use


Research on text duplication and its effects on the LDA model (Schofield, Thompson, et al., 2017) suggests that the LDA model can generally handle text duplication well and duplicate texts need to cover a substantial part of the corpus before they begin to hinder the model's ability to come up with good topics. That being said, for demonstration purposes, we decided to find and remove duplicate texts in this tutorial despite the fact that they cover a rather small fraction of the corpus. An alternative would be to train the model without removing duplicate text and then if the model were to come up with a small number of topics dedicated to the repeated texts, we could safely ignore those topics (Schofield, Thompson, et al., 2017).

In [93]:
unrestricted_df[unrestricted_df.duplicated(subset='text')].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 168 to 2837
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          53 non-null     object
 1   text               53 non-null     object
 2   display_date       48 non-null     object
 3   conditions_access  53 non-null     object
 4   conditions_use     53 non-null     object
dtypes: object(5)
memory usage: 2.5+ KB


In [94]:
unrestricted_df.drop_duplicates(subset=['text'], keep='last', inplace=True)

In [95]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1950 entries, 1 to 2893
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1950 non-null   object
 1   text               1950 non-null   object
 2   display_date       1864 non-null   object
 3   conditions_access  1950 non-null   object
 4   conditions_use     1950 non-null   object
dtypes: object(5)
memory usage: 91.4+ KB


As briefly mentioned at the beginning of this notebook, through trial and error, we noticed that the model would come up with topics that comprised words in German or Dutch. This made us suspect that some transcripts which were not in English infiltrated our dataset. To put this assumption to the test, we searched for transcripts that contained the word "nicht". It is true that in the transcripts of Holocaust-related oral testimonies one might often find that interviewees might speak in multiple languages. Searching for this word within the transcripts returned transcripts that were mostly in English. However, we noticed that at least three of them were in other languages. They were thus removed.

In [15]:
unrestricted_df[unrestricted_df['text'].str.contains("nicht")]

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use
6,RG-50.030.0461,"United States Holocaust Memorial Museum Interview with Louise Segaar November 10, 2000 RG-50.030*0461 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following oral history testimony is the result of a taped interview with Louise Segaar, conducted by Katie Davis on November 10, 2000 on behalf of t...",2000 November 10,No restrictions on access,No restrictions on use
45,RG-50.030.0226,"United States Holocaust Memorial Museum Interview with Karl Stojka April 29, 1992 RG-50.030*0226 PREFACE The following oral history testimony is the result of a taped interview with Karl Stojka, conducted on April 29, 1992 on behalf of the United States Holocaust Memorial Museum. The interview is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Holocaust Memorial Museum. The reader should bear in mind ...",1992 April 29,No restrictions on access,No restrictions on use
51,RG-50.469.0002,"Herta Gelber 1 May 20, 1997 Interview with Herta Gelber May 20, 1997 Q: This is tape one of an United States Holocaust Memorial Museum Interview with Mrs. Herta Gelber conducted by Christian Kloesch on May 20, 1997 in Queens New York. Let us start the interview with the most basic facts. Could you please, tell me your name at birth, when and where you were born? A: My maiden name was Herta Gewing. I was born April the 8th 1920 in Leoben, Steiermark. Q: Could you tell me a little bit about yo...",1997 May 20,No restrictions on access,No restrictions on use
179,RG-50.462.0872,"Project Kaved Interview #3, July 24, 1997 Subject: Jeanette Rothschild J.F.: Today is the 24th of July 97. We’re interviewing Jeanette Rothschild, everybody knows Mrs. Rothschild as Aunt Jenny. Her maiden name is Fernbacher. She was born in the town of Großmannsdorf, Bavaria, on September 13th, 1898. So she will be celebrating very shortly her 99th birthday. I would like to thank Jenny very much for permitting me to interview her. Aunt Jenny, I would like you to tell us about your very earl...",1997 July 24,No restrictions on access,No restrictions on use
221,RG-50.030.0367,"United States Holocaust Memorial Museum Interview with Norman Belfer May 31, 1996 RG-50.030*0367 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following oral history testimony is the result of a videotaped interview with Norman Belfer, conducted by Joan Ringelheim on May 31, 1996 on behalf of the...",1996 May 31,No restrictions on access,No restrictions on use
240,RG-50.030.0753,"United States Holocaust Memorial Museum Interview with David Halivni June 13, 2014 RG-50.030*0753 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following interview is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the Unit...",2014 June 13,No restrictions on access,No restrictions on use
269,RG-50.030.0663,"United States Holocaust Memorial Museum Interview with Eric J. Hamberg August 8, 2012 RG 50.030*0663 http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. PREFACE The following interview is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the ...",2012 August 08,No restrictions on access,No restrictions on use
288,RG-50.030.0387,"United States Holocaust Memorial Museum Interview with Stefan Czyzewski April 8, 1998 RG-50.030*0387 PREFACE The following oral history testimony is the result of a videotaped interview with Stefan Czyzewski, conducted by Katie Davis on April 8, 1998 on behalf of the United States Holocaust Memorial Museum. The interview took place in Minneapolis, MN and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United Sta...",1998 April 08,No restrictions on access,No restrictions on use
326,RG-50.030.0334,"United States Holocaust Memorial Museum Interview with Gerda Schild Haas June 12, 1995 RG-50.030*0334 PREFACE The following oral history testimony is the result of a videotaped interview with Gerda Schild Haas, conducted by Joan Ringelheim on June 12, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the Unite...",1995 June 12,No restrictions on access,No restrictions on use
626,RG-50.030.0225,"United States Holocaust Memorial Museum Interview with Eva Rozencwajig Stock July 26, 1989 RG-50.030*0225 PREFACE The following oral history testimony is the result of a videotaped interview with Eva Rozencwajig Stock, conducted by Linda Kuzmack on July 26, 1989 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the...",1989 July 26,No restrictions on access,No restrictions on use


In [None]:
not_in_english = ["RG-90.143.0004", "RG-50.030.0488","RG-50.028.0037"]

In [None]:
unrestricted_df.drop(unrestricted_df[unrestricted_df['RG_number'].isin(not_in_english)].index, inplace=True)

Next, we want to see if there are any transcripts that are blank or consist of only whitespace characters. We used a regular expression to match any document that begins and ends with only whitespace characters and replace it with the value nan using the numpy package (Harris et al., 2020), which would make it easier to detect and remove it later on.

In [None]:
unrestricted_df['text'].replace(r'^\s+$', np.nan, regex=True, inplace=True)

In [None]:
unrestricted_df[unrestricted_df['text'].isnull()]

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use
1095,RG-50.030.0411,,2001 February 21,No restrictions on access,No restrictions on use


In [None]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1947 entries, 1 to 2893
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1947 non-null   object
 1   text               1946 non-null   object
 2   display_date       1861 non-null   object
 3   conditions_access  1947 non-null   object
 4   conditions_use     1947 non-null   object
dtypes: object(5)
memory usage: 91.3+ KB


In [None]:
unrestricted_df.dropna(axis=0, subset=(['text']), inplace=True)

In [None]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1946 entries, 1 to 2893
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1946 non-null   object
 1   text               1946 non-null   object
 2   display_date       1860 non-null   object
 3   conditions_access  1946 non-null   object
 4   conditions_use     1946 non-null   object
dtypes: object(5)
memory usage: 91.2+ KB


Following the aforementioned steps, we were left with 1,946 unrestricted, non-null transcripts, which are (hopefully) mostly in English. We save this DataFrame into another `pickle` file so that we don't have to repeat this process from scratch next time we need access to this DataFrame.

In [None]:
unrestricted_df.to_pickle("unrestricted_df.pkl")

In [175]:
# Read the already saved DataFrame.
# This loads the file exported by the function in the previous code cell, which was uploaded on Zenodo
df = pd.read_pickle("https://zenodo.org/record/6670234/files/unrestricted_df.pkl")

We load a list of default stopwords from the `nltk` package (Bird et al., 2009). Stopwords are the most common words that tend to appear in texts. These words are considered to be unhelpful when it comes to topic inference because they are present in almost every text regardless of its topic, thus being uninformative. In our case, we also want to extend this stopword list with custom words that are repeated within our corpus, which we know are uninformative. For example, there are some repeating headers that were automatically added to the majority of these transcripts and include words such as "verbatim", "transcript", "spoken", "word", "spelling", etc. There are other words such as "indecipherable" which are part of certain remarks made by the person who created the transcript but which are not related to the topic of the testimony.

Stopword removal should take place with caution. According to prior studies (Schofield, Magnusson, et al., 2017), removing stopwords from the resulting topics after training the model is likely to lead to similar results as removing them beforehand. Thus, stopword removal might not be as beneficial in topic inference as it is thought to be. Avoiding this step might save researchers time and mitigates the chances of biased stopword selection (Schofield, Magnusson, et al., 2017). If one decides to remove the stopwords, it is suggested to remove only the most "obvious" (Schofield, Magnusson, et al., 2017) and highly frequent words.

In [97]:
stopwords = stopwords.words("english")

In [98]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [99]:
stopwords.extend(['ushmm','archive', 'archives', 'preface','captioning', 'interview','interviewer', 'rg', 'yeah', 'didnt', "well", 'archival', 'transcript', 'put', 'oral', 
                  'indecipherable', 'thing', 'recording', 'source','like','dont','one','uh','go','got', 'know', 'people', 'go', 'would',
                  'us', 'said','went','came','q','a','tape','question','answer', 'information','catalog', 'collection', 'word', 
                  'spelling', 'accuracy', 'collection', 'testimony', 'reader', 'prose','error', 'result','verbatim','primary','record'])

In [100]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Another popular technique of text preprocessing for topic modelling is to lemmatise or stem the words in the corpus. Lemmatisation is the process of replacing the word in the corpus with its lemma, meaning the base form of the word as it would appear in the dictionary (Manning et al., 2009). Stemming is the process of removing the suffixes of words and keeping only the part of the word that precedes them (Manning et al., 2009).

When it comes to inferring topic models, previous studies suggest that lemmatisation improves topic coherence (Martin & Johnson, 2015; Lau et al., 2014). Martin and Johnson (2015) maintain that keeping only the nouns of the lemmatised texts improves topic coherence and speed. However, other studies suggest that applying stemming and lemmatisation during text preprocessing offers little benefit and, in some cases, might even harm the topic model  (Schofield & Mimno, 2016).

In this tutorial, we will demonstrate how to lemmatise the texts and keep only certain parts of speech (in this case the nouns), using spaCy (Honnibal et al., 2020), a popular Natural Language Processing (NLP) Python library. However, we encourage readers to consider a lighter text preprocessing approach before resorting to such techniques.

In [9]:
# Loading the English pipeline and model 
nlp = spacy.load("en_core_web_sm", exclude=["ner", "parser"])
nlp.max_length=3000000

"""
Defining the lemmatisation function:
Each document fed into the function, we parse it using the spacy pipeline; 
tokenise it; check whether each of its tokens is a noun and is not included in the stopword list or 
is not a punctuation and consists of only alphabetic characters. If a token passes
these checks, then we take its lemma and append it to the list of lemmas
that the function will output.
"""

# This function is inspired by Mattingly's (2021, 2022) tutorials on topic modelling

def lemmatisation(text, allowed_postags=["NOUN"]):
    text_out = ""
    doc = nlp(text, disable=['ner','parser'])
    new_text = []
    for token in doc:
        if token.pos_ in allowed_postags and token.lower_ not in stopwords and not token.is_punct and token.is_alpha:
            new_text.append(token.lemma_)
    text_out = new_text
    return (text_out)

  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):
  for entry_point in AVAILABLE_ENTRY_POINTS.get(self.entry_point_namespace, []):


We apply the lemmatisation function to the text of each transcript in the DataFrame.

#### Warning: This process takes too long to run. The output is provided in  `unrestricted_lemmatized_df.pkl`, which you can import through a code cell below.

In [None]:
unrestricted_df['lemmas'] = unrestricted_df['text'].apply(lambda x:  lemmatisation(x))

In [None]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1946 entries, 1 to 2893
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1946 non-null   object
 1   text               1946 non-null   object
 2   display_date       1860 non-null   object
 3   conditions_access  1946 non-null   object
 4   conditions_use     1946 non-null   object
 5   lemmas             1946 non-null   object
dtypes: object(6)
memory usage: 106.4+ KB


In [None]:
unrestricted_df.head()

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use,lemmas
1,RG-50.155.0004,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1993 January 26,No restrictions on access,No restrictions on use,[]
3,RG-50.999.0617,"1 Good morning, and welcome to the United States Holocaust Memorial Museum. My name is Bill Benson. I am the host of the museum's public program, First Person. Thank you for joining us today. We are in our 18th year of First Person. Our First Person today is Mr. Harry Markowicz, whom you shall meet shortly. This 2017 season of First Person is made possible by the generosity of the Louis Franklin Smith Foundation, with additional funding from the Arlene and Daniel Fisher Foundation. We are g...",2017 July 06,No restrictions on access,No restrictions on use,"[morning, name, host, museum, program, today, year, today, season, generosity, funding, sponsorship, series, conversation, survivor, person, account, experience, guest, volunteer, museum, museum, website, guest, account, experience, survivor, minute, time, opportunity, question, life, story, survivor, decade, individual, account, slide, presentation, introduction, photograph, sibling, parent, photo, mother, family, mother, left, back, row, braid, other, picture, aunt, uncle, grandmother, mid..."
4,RG-50.030.0272,"United States Holocaust Memorial Museum Interview with Peter Feigl August 23, 1995 RG-50.030*0272 PREFACE The following oral history testimony is the result of a videotaped interview with Peter Feigl, conducted by Neenah Ellis on August 23, 1995 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Ho...",1995 August 23,No restrictions on access,No restrictions on use,"[history, behalf, place, part, testimony, right, mind, error, name, name, father, name, year, name, parent, father, citizen, world, war, engineering, engineer, time, company, office, number, country, mother, today, part, father, year, date, convention, law, citizen, moment, father, citizenship, father, parent, circumstance, father, movie, actress, time, period, parent, matter, father, engineer, family, father, mother, word, grandmother, widow, year, child, girl, girl, engineer, time, time, a..."
5,RG-50.030.0294,"United States Holocaust Memorial Museum Interview with Felix Horn July 19, 1994 RG-50.030*0294 PREFACE The following oral history testimony is the result of a videotaped interview with Felix Horn, conducted by Sandra Bradley on July 19, 1994 on behalf of the United States Holocaust Memorial Museum. The interview took place in Washington, DC and is part of the United States Holocaust Memorial Museum's collection of oral testimonies. Rights to the interview are held by the United States Holoc...",1994 July 19,No restrictions on access,No restrictions on use,"[history, behalf, place, part, testimony, right, mind, error, name, name, beginning, slate, bit, childhood, background, war, son, class, family, birth, parent, mother, time, mom, path, career, dad, locksmith, son, child, father, responsibility, care, family, sister, brother, work, family, school, mom, school, locksmith, living, year, need, education, night, school, architect, class, family, childhood, sister, year, apartment, house, grandparent, family, cousin, day, cousin, win, parent, moth..."
6,RG-50.470.0015,"WILLIAM MCWORKMAN: Perhaps I should start by saying that I was in the 12th armored division--one of several armored divisions in the 3rd and 7th Army who drove south toward Austria. Our original mission as Munich, but and OSS agent who we had sent forward earlier in the week, we found from the Munich burgermeister or mayor that they were going to give up the city without a fight, so our mission was changed so we headed more to the south and the east on the west side of Amersee down through--...",1995 February 21,No restrictions on access,No restrictions on use,"[division, division, mission, agent, week, burgermeister, mayor, city, fight, mission, south, side, objective, objective, war, point, thing, side, side, mission, column, armor, mile, division, artillery, commander, town, camp, river, camp, village, kilometer, north, camp, number, group, division, artillery, member, division, impression, movement, area, horror, body, part, camp, pile, shack, prisoner, door, prisoner, fire, mass, body, stench, impression, blueness, skin, skin, hue, chicken, bo..."


Next, we want to investigate whether there are any transcripts for which no lemma was output, resulting in an empty lemmas column value. We remove these rows from our DataFrame.

In [None]:
unrestricted_df[unrestricted_df['lemmas'].str.len()==0]

Unnamed: 0,RG_number,text,display_date,conditions_access,conditions_use,lemmas
1,RG-50.155.0004,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1993 January 26,No restrictions on access,No restrictions on use,[]
38,RG-50.155.0002,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",approximately 1989 July 12,No restrictions on access,No restrictions on use,[]
147,RG-50.165.0109,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1989 April 28,No restrictions on access,No restrictions on use,[]
179,RG-50.447.0017,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1991 October 10,No restrictions on access,No restrictions on use,[]
206,RG-50.419.0005,"This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked f...",1988 February 24,No restrictions on access,No restrictions on use,[]
209,RG-50.010.0073,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1979 June 29,No restrictions on access,No restrictions on use,[]
210,RG-50.010.0033,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",1983 January 29,No restrictions on access,No restrictions on use,[]
218,RG-50.264.0001,"http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org Contact reference@ushmm.org for further information about this collection This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. http://collections.ushmm.org C...",,No restrictions on access,No restrictions on use,[]
241,RG-50.050.0009,"This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy.\n",,No restrictions on access,No restrictions on use,[]
324,RG-50.419.0004,"This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked for spelling or accuracy. This is a verbatim transcript of spoken word. It is not the primary source, and it has not been checked f...",1988 February 11,No restrictions on access,No restrictions on use,[]


In [None]:
unrestricted_df.drop(unrestricted_df[unrestricted_df['lemmas'].str.len()==0].index, inplace=True)

In [None]:
unrestricted_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1873 entries, 3 to 2893
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1873 non-null   object
 1   text               1873 non-null   object
 2   display_date       1794 non-null   object
 3   conditions_access  1873 non-null   object
 4   conditions_use     1873 non-null   object
 5   lemmas             1873 non-null   object
dtypes: object(6)
memory usage: 102.4+ KB


Saving the unrestricted, deduplicated, lemmatised DataFrame in a pickle file for faster future imports. This is the final DataFrame that we will save in this way and we suggest you load it if you want to experiment with different parameters of the LDA model.


In [28]:
unrestricted_df.to_pickle("unrestricted_lemmatized_df.pkl")

In [176]:
# Read already saved dataframe
df = pd.read_pickle("https://zenodo.org/record/6670234/files/unrestricted_lemmatized_df.pkl")

In [177]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1873 entries, 3 to 2893
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   RG_number          1873 non-null   object
 1   text               1873 non-null   object
 2   display_date       1794 non-null   object
 3   conditions_access  1873 non-null   object
 4   conditions_use     1873 non-null   object
 5   lemmas             1873 non-null   object
dtypes: object(6)
memory usage: 102.4+ KB


Having lemmatised the transcripts, the next step will be to create a dictionary that maps each of the lemmas to a unique integer identifier using the `Dictionary` function of the `gensim` library (Řehůřek, 2022a). Then, we will filter out words that appear in more than 60 per cent of the transcripts (Řehůřek, 2022b), which is thought to improve the quality of the topics.

In [112]:
dictionary = Dictionary(documents=df['lemmas'].to_list(), prune_at=None)
dictionary.filter_extremes(no_above=0.6, keep_n=None)  # Filter out words that appear too often
dictionary.compactify() # Assign new ids to words

In [113]:
temp = dictionary[0]  # This is only to "load" the dictionary
dictionary.id2token

{0: 'account',
 1: 'addition',
 2: 'address',
 3: 'adult',
 4: 'afternoon',
 5: 'age',
 6: 'ally',
 7: 'anniversary',
 8: 'apartment',
 9: 'applause',
 10: 'area',
 11: 'arm',
 12: 'arrest',
 13: 'arrow',
 14: 'assistance',
 15: 'atmosphere',
 16: 'attempt',
 17: 'audience',
 18: 'aunt',
 19: 'authority',
 20: 'back',
 21: 'ball',
 22: 'beach',
 23: 'beginning',
 24: 'behavior',
 25: 'belonging',
 26: 'bicycle',
 27: 'bike',
 28: 'birth',
 29: 'birthday',
 30: 'bit',
 31: 'bleach',
 32: 'block',
 33: 'book',
 34: 'border',
 35: 'boxcar',
 36: 'boy',
 37: 'braid',
 38: 'building',
 39: 'business',
 40: 'call',
 41: 'cannon',
 42: 'car',
 43: 'card',
 44: 'care',
 45: 'career',
 46: 'cart',
 47: 'case',
 48: 'cash',
 49: 'cavalry',
 50: 'chance',
 51: 'change',
 52: 'cheek',
 53: 'circumstance',
 54: 'citizen',
 55: 'city',
 56: 'civilian',
 57: 'class',
 58: 'coast',
 59: 'cocktail',
 60: 'concentration',
 61: 'concept',
 62: 'condition',
 63: 'contraption',
 64: 'conversation',
 65: 'c

Next, we need to create a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) representation of the words in each document. Each document will get transformed into a vector, where each feature will represent the number of times a specific word in the dictionary appears in the document (Řehůřek, 2022a). Converting the documents into vectors allows us to perform computations between them.

In [31]:
corpus = [dictionary.doc2bow(doc) for doc in df['lemmas'].to_list()]  # convert list of tokens to bag of word representation

In [32]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 7646
Number of documents: 1873


It is now time to train the LDA model. First, we need to parameterise it. The most important parameter is our desired number of topics. We can experiment with different numbers. In this tutorial, we will first try three and then six topics. A list of all the parameters that can be tweaked and what they mean can be found in Gensim's [documentation pages](https://radimrehurek.com/gensim/models/ldamodel.html).

In [36]:
# Set training parameters. (Řehůřek, 2022b)
num_topics = 3 # The number of topics
passes =20 # The number of times the algorithm will go through the entire corpus
iterations = 400
chunksize = 50
eval_every = None
random_state = 0 # This is used to make this process reproducible

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model_3_topics = LdaModel(
    corpus=corpus,
    id2word=id2word,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    chunksize=chunksize,
    eval_every = eval_every,
    random_state=random_state
)

Once our model is trained, we can print the topics, their IDs, and their top words.

In [40]:
model_3_topics.print_topics(num_words=30)

[(0,
  '0.024*"ghetto" + 0.011*"girl" + 0.009*"morning" + 0.009*"bread" + 0.008*"soldier" + 0.008*"water" + 0.007*"guy" + 0.007*"street" + 0.007*"barrack" + 0.006*"city" + 0.006*"factory" + 0.006*"truck" + 0.006*"boy" + 0.005*"clothe" + 0.005*"piece" + 0.005*"bit" + 0.005*"wood" + 0.005*"door" + 0.005*"prisoner" + 0.005*"car" + 0.005*"hospital" + 0.005*"couple" + 0.005*"hand" + 0.005*"army" + 0.005*"head" + 0.005*"hour" + 0.005*"doctor" + 0.004*"husband" + 0.004*"front" + 0.004*"document"'),
 (1,
  '0.010*"course" + 0.008*"kid" + 0.008*"apartment" + 0.008*"husband" + 0.008*"point" + 0.008*"bit" + 0.007*"survivor" + 0.007*"today" + 0.007*"program" + 0.007*"book" + 0.007*"uncle" + 0.006*"business" + 0.006*"daughter" + 0.006*"fact" + 0.006*"cousin" + 0.006*"community" + 0.005*"aunt" + 0.005*"girl" + 0.005*"age" + 0.005*"law" + 0.005*"museum" + 0.005*"picture" + 0.005*"store" + 0.005*"teacher" + 0.005*"world" + 0.005*"ship" + 0.005*"paper" + 0.004*"memory" + 0.004*"boy" + 0.004*"idea"'),
 

Getting a list of the top words per topic, together with the average topic coherence score (Řehůřek, 2022b).

In [38]:
top_topics = model_3_topics.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -0.4192.
[([(0.010235569, 'course'),
   (0.008294283, 'kid'),
   (0.00810182, 'apartment'),
   (0.007842252, 'husband'),
   (0.0076915333, 'point'),
   (0.0076701264, 'bit'),
   (0.0074346303, 'survivor'),
   (0.007205605, 'today'),
   (0.0065850904, 'program'),
   (0.0065850313, 'book'),
   (0.006566549, 'uncle'),
   (0.0064366832, 'business'),
   (0.0061594904, 'daughter'),
   (0.0058665173, 'fact'),
   (0.005732997, 'cousin'),
   (0.0055864793, 'community'),
   (0.005269119, 'aunt'),
   (0.005225825, 'girl'),
   (0.005131903, 'age'),
   (0.0049128905, 'law')],
  -0.36755665126422304),
 ([(0.023521075, 'ghetto'),
   (0.010591484, 'girl'),
   (0.00856532, 'morning'),
   (0.008532367, 'bread'),
   (0.008260889, 'soldier'),
   (0.007680615, 'water'),
   (0.007058362, 'guy'),
   (0.0065634847, 'street'),
   (0.0065272273, 'barrack'),
   (0.006084359, 'city'),
   (0.0059131007, 'factory'),
   (0.00588439, 'truck'),
   (0.0057917554, 'boy'),
   (0.0054929475, 'clot

Next, we visualise the topic model to observe and analyse the results. According to Sievert and Shirley (2014), creators of the LDAvis, adjusting the λ value to 0.6 usually helps with correctly interpreting a topic. We highly recommend reading their paper to better understand what each component of this visualisation means.

In [39]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(model_3_topics, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'model_3_topics.html')
vis

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


We repeat the same process as in the previous cells, but this time we want to come up with six topics.

In [63]:
from gensim.models import LdaModel

# Set training parameters.
num_topics = 6
chunksize = 50
passes =20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.
random_state = 0

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model_6_topics = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every,
    random_state=random_state
)

In [67]:
model_6_topics.print_topics(num_words=15)

[(0,
  '0.028*"program" + 0.021*"museum" + 0.019*"today" + 0.016*"mom" + 0.015*"document" + 0.013*"point" + 0.013*"survivor" + 0.012*"question" + 0.009*"course" + 0.008*"hiding" + 0.008*"right" + 0.008*"uncle" + 0.008*"photograph" + 0.007*"account" + 0.007*"page"'),
 (1,
  '0.013*"husband" + 0.012*"kid" + 0.011*"business" + 0.010*"book" + 0.010*"girl" + 0.010*"apartment" + 0.009*"daughter" + 0.009*"cousin" + 0.008*"uncle" + 0.008*"bit" + 0.008*"store" + 0.007*"boy" + 0.007*"picture" + 0.007*"age" + 0.006*"letter"'),
 (2,
  '0.035*"army" + 0.022*"hm" + 0.016*"officer" + 0.015*"guy" + 0.015*"soldier" + 0.013*"gun" + 0.012*"commander" + 0.012*"unit" + 0.010*"date" + 0.010*"tank" + 0.010*"order" + 0.010*"training" + 0.008*"division" + 0.008*"front" + 0.008*"company"'),
 (3,
  '0.026*"prisoner" + 0.011*"guard" + 0.011*"number" + 0.010*"concentration" + 0.010*"material" + 0.010*"body" + 0.009*"fact" + 0.009*"death" + 0.009*"area" + 0.007*"point" + 0.007*"officer" + 0.007*"picture" + 0.007*"b

In [65]:
top_topics = model_6_topics.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -0.5520.
[([(0.0134981675, 'husband'),
   (0.012279362, 'kid'),
   (0.010557862, 'business'),
   (0.009887477, 'book'),
   (0.009662605, 'girl'),
   (0.009569545, 'apartment'),
   (0.009253288, 'daughter'),
   (0.008691938, 'cousin'),
   (0.008139245, 'uncle'),
   (0.008045915, 'bit'),
   (0.007935302, 'store'),
   (0.0067197382, 'boy'),
   (0.006613084, 'picture'),
   (0.006523004, 'age'),
   (0.006437961, 'letter'),
   (0.006409668, 'aunt'),
   (0.0062828977, 'teacher'),
   (0.0061952183, 'class'),
   (0.006175173, 'survivor'),
   (0.0061427653, 'ship')],
  -0.43010212990966906),
 ([(0.038258873, 'ghetto'),
   (0.014282089, 'girl'),
   (0.013659356, 'bread'),
   (0.011419821, 'morning'),
   (0.011235567, 'water'),
   (0.009680779, 'soldier'),
   (0.009606886, 'city'),
   (0.0094234925, 'guy'),
   (0.009216708, 'street'),
   (0.008732312, 'factory'),
   (0.008283716, 'piece'),
   (0.008190177, 'wood'),
   (0.007613677, 'clothe'),
   (0.0074029677, 'boy'),
   (

In [66]:
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(model_6_topics, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(vis, 'model_6_topics.html') # Exports visualisation as a standalone html file
vis

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


Having trained our models, we might want to see which topics the model has assigned to a specific document. We can do this in the following way:

In [90]:
topics = model_6_topics.get_document_topics(dictionary.doc2bow(df.loc[df['RG_number']=="RG-50.470.0015"]['lemmas'].values[0]))
topics.sort(key=lambda t: t[1], reverse=True) # We sort the topics from highest to lowest probability

print(topics)

[(3, 0.72978616), (2, 0.12371493), (5, 0.0975869), (4, 0.033792496), (1, 0.01229007)]


To check which words are associated with a topic, you can use the following code:

In [98]:
terms = model_6_topics.show_topic(3)

for term in terms:
    print(f"{term[0]} | probability: {term[1]}")

prisoner | probability: 0.025587216019630432
guard | probability: 0.010809059254825115
number | probability: 0.010611455887556076
concentration | probability: 0.010144725441932678
material | probability: 0.009822363033890724
body | probability: 0.009642868302762508
fact | probability: 0.008959434926509857
death | probability: 0.008893161080777645
area | probability: 0.008517954498529434
point | probability: 0.0074925413355231285


To check the probability of a word belonging to a topic:

In [132]:
term_topics = model_6_topics.get_term_topics(dictionary.token2id['ghetto'],minimum_probability=0.0000000000000001)

term_topics.sort(key=lambda t: t[1], reverse=True)

print(term_topics)

[(5, 0.038262364), (2, 2.01423e-08), (0, 1.2646704e-08), (3, 1.1083079e-08)]


Finally, you might want to save the topic model that you have trained to be able to load it and use it for future predictions. Here is how to do this:

In [117]:
# Save model to disk.
model_6_topics.save("model_6_topics")

In [20]:
# Load a potentially pretrained model from disk.
lda = LdaModel.load("model_6_topics")

In [21]:
lda.print_topics()

[(0,
  '0.028*"program" + 0.021*"museum" + 0.019*"today" + 0.016*"mom" + 0.015*"document" + 0.013*"point" + 0.013*"survivor" + 0.012*"question" + 0.009*"course" + 0.008*"hiding"'),
 (1,
  '0.013*"husband" + 0.012*"kid" + 0.011*"business" + 0.010*"book" + 0.010*"girl" + 0.010*"apartment" + 0.009*"daughter" + 0.009*"cousin" + 0.008*"uncle" + 0.008*"bit"'),
 (2,
  '0.035*"army" + 0.022*"hm" + 0.016*"officer" + 0.015*"guy" + 0.015*"soldier" + 0.013*"gun" + 0.012*"commander" + 0.012*"unit" + 0.010*"date" + 0.010*"tank"'),
 (3,
  '0.026*"prisoner" + 0.011*"guard" + 0.011*"number" + 0.010*"concentration" + 0.010*"material" + 0.010*"body" + 0.009*"fact" + 0.009*"death" + 0.009*"area" + 0.007*"point"'),
 (4,
  '0.023*"course" + 0.014*"community" + 0.010*"organization" + 0.010*"government" + 0.008*"idea" + 0.008*"point" + 0.008*"situation" + 0.007*"problem" + 0.007*"member" + 0.007*"city"'),
 (5,
  '0.038*"ghetto" + 0.014*"girl" + 0.014*"bread" + 0.011*"morning" + 0.011*"water" + 0.010*"soldier"

You can also make predictions on new documents. For example, let's create a random toy document (in its lemmatised form):

In [181]:
unseen_doc = ['ghetto', 'food','girl','guard','concentration', 'commander', 'greece']

In [190]:
predictions = lda[dictionary.doc2bow(unseen_doc)]
predictions.sort(key=lambda t: t[1], reverse=True)
print(predictions)

[(5, 0.29797548), (3, 0.21292675), (1, 0.20727375), (2, 0.12905647), (4, 0.09619402), (0, 0.05657351)]


## References:  

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit.  O’Reilly Media, Inc.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Harris, C. R., Millman, K. J., Walt, S. J. van der, Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., Kerkwijk, M. H. van, Brett, M., Haldane, A., Río, J. F. del, Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303

Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, 530–539. https://doi.org/10.3115/v1/E14-1056

Martin, F., & Johnson, M. (2015). More Efficient Topic Modelling Through a Noun Only Approach. Proceedings of the Australasian Language Technology Association Workshop 2015, 111–115. https://aclanthology.org/U15-1013

Mattingly, W. J. B. (2021, February 23). What is Laten Dirichlet Allocation LDA (Topic Modeling for Digital Humanities 03.01). https://www.youtube.com/watch?v=o7OqhzMcDfs

Mattingly, W. J. B. (2022). Implementing LDA in Python—Introduction to Python for Humanists. In Introduction to Python for Digital Humanities. https://python-textbook.pythonhumanities.com/04_topic_modeling/03_03_lda_model_demo.html

Reback, J., McKinney, W., jbrockmendel, Van den Bossche, J., Augspurger, T., Cloud, P., gfyoung, Sinhrks, Klein, A., Roeschke, M., Hawkins, S., Tratner, J., She, C., Ayd, W., Petersen, T., Garcia, M., Schendel, J., Hayden, A., MomIsBestFriend, … Mortada Mehyar. (2020). pandas-dev/pandas: Pandas 1.0.3. Zenodo. https://doi.org/10.5281/zenodo.3715232

Řehůřek, R. (n.d.-a). Corpora and Vector Spaces. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html

Řehůřek, R. (n.d.-b). LDA Model. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50.

Richardson, L. (n.d.). Beautiful soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

Schofield, A., Magnusson, M., & Mimno, D. (2017). Pulling Out the Stops: Rethinking Stopword Removal for Topic Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 432–436. https://aclanthology.org/E17-2069

Schofield, A., & Mimno, D. (2016). Comparing Apples to Apple: The Effects of Stemmers on Topic Models. Transactions of the Association for Computational Linguistics, 4, 287–300. https://doi.org/10.1162/tacl_a_00099

Schofield, A., Thompson, L., & Mimno, D. (2017). Quantifying the Effects of Text Duplication on Semantic Models. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2737–2747. https://doi.org/10.18653/v1/D17-1290

Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. https://doi.org/10.3115/v1/W14-3110

United States Holocaust Memorial Museum. (2007). Oral History Interview Guidelines. https://www.ushmm.org/m/pdfs/20121003-oral-history-interview-guide.pdf

## Further Reading/Resources

#### Tutorials:
Mattingly, W. J. B. (2021, February 23). What is Laten Dirichlet Allocation LDA (Topic Modeling for Digital Humanities 03.01). https://www.youtube.com/watch?v=o7OqhzMcDfs

Mattingly, W. J. B. (2022). Implementing LDA in Python—Introduction to Python for Humanists. In Introduction to Python for Digital Humanities. https://python-textbook.pythonhumanities.com/04_topic_modeling/03_03_lda_model_demo.html

Řehůřek, R. (n.d.-a). Corpora and Vector Spaces. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html

Řehůřek, R. (n.d.-b). LDA Model. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py

#### Other readings:
Schmidt, B. M. (2012). Words alone: Dismantling topic models in the humanities. Journal of Digital Humanities, 2(1), 49–65.

Schofield, A. K. (2019) Text Processing for the Effective Application of Latent Dirichlet Allocation. [Online] [online]. Available from: https://ecommons.cornell.edu/handle/1813/67305 (Accessed 17 June 2022).

Sievert, C. & Shirley, K. (2014) ‘LDAvis: A method for visualizing and interpreting topics’, in Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. [Online]. 2014 Baltimore, Maryland, USA: Association for Computational Linguistics. pp. 63–70. [online]. Available from: http://aclweb.org/anthology/W14-3110 (Accessed 27 May 2022).
