# BERTopic and Parnell

## Google Colab

File paths on Google Drive
- Navigate to 'My Drive' then
- Speech report files in this directory location: 'parnell_files/sources'
- Speech register file in this directory location: 'parnell_files/speeches'

In Google Colab
- Before running anything go to 'Edit' in the toolbar at the top of the screen
- Click on 'Notebook settings' in the dropdown menu
- Change 'Hardware accelerator' to 'T4 GPU'
- Save changes

You should now be ready to run this notebook in Google Colab!

## Research Question

In this session we will explore the use of topic modelling in relation to Parnell's speeches. We will look for the main topics which appear in the speech reports and also explore filtering the corpus by date range, place, publication and keywords, before performing topic modelling on the resulting dataset. The results will be visualised in a series of graphs showing different aspects of the topics.

Having extracted some topics from the speech reports, we will be able to return to the tools used in the network analysis session to gain some additional context for our overall analysis.

## Topic Modelling

Topic models enable us to identify the latent topics within a collection of texts. They capture the underlying themes across a corpus by clustering together similar texts based on their content, essentially using the co-occurrence of words or semantic content within documents as the basis of this process.

We will be using BERTopic, a highly effective topic modelling tool that uses BERT (Bidirectional Encoder Representations from Transformers) as its basis. Transformer models enable word meanings to be recognised in context, thereby allowing for greater accuracy in terms of identifying the semantic associations which make up topics.

## Interactivity

All cells are alterable but some cells that are easy to work with interactively are highlighted at the end of the notebook. These are highlighted with the phrase 'Alterable Cells' above them in blue and have instructions

# Python Code

In [None]:
!pip install bertopic

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/MyDrive/parnell_files

Import additional Python Libraries.

In [291]:
#libraries for working with files
import glob
import os
from pathlib import Path
from natsort import natsorted
from natsort import os_sorted
#libraries for data analysis and manipulation
import pandas as pd
import string
import re
from bs4 import BeautifulSoup
from datetime import datetime
#nlp libraries
import spacy
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from gensim.utils import simple_preprocess
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from bertopic import BERTopic
#libraries for visualisations
import plotly.express as px

#optimise notebook and spacy settings
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Functions

In this notebook we will be using more functions than in previous weeks. Essentially functions divide the code up into smaller units, with each unit having a specific task. A key advantage of functions is that they can be "called" repeatedly with different inputs, so you don't have to write out a very similar piece of code repeatedly. They also keep everything well-organized and enable modularity, so that independent parts of the code can be checked and maintained separately. 

Functions are usually kept at the top of the code, after importing libraries and creating global settings, and are called during the execution of the code below. Explanations for what each function does are kept in a "docstring", in quotation marks after the "def" statement which begins each function.

### Data Extraction Functions

Functions which perform the various aspects of getting from a list of file paths through to extracting specific parts of each file.

In [292]:
def soup_objects(file_paths):
    '''Takes either a list of file paths or a file path.
    Returns a list of beautiful soup objects or single
    beautiful soup object, depending on the input.
    '''
    if type(file_paths) == list:
        soup_list = []
        for path in file_paths:
            with path.open("r", encoding="utf-8") as xml:
                source = BeautifulSoup(xml, "lxml-xml")
                soup_list.append(source)
        return soup_list
    else:
        with file_paths.open("r", encoding="utf-8") as xml:
            soup_object = BeautifulSoup(xml, "lxml-xml")
        return soup_object

In [293]:
def tei_extractor(soup_obj, element, attributes=False):
    '''Takes Beautiful soup object or list of objects,
    element using element name and, where necessary, attributes.
    Returns list of elements for all input files or a list of
    elements for input file, depending on input.
    '''
    attrib_dict ={}
    if attributes:
        attrib_dict = {attr: True for attr in attributes}

    if type(soup_obj) == list:
        elem_ls = [obj.find(element, attrib_dict) for obj in soup_obj]
        return elem_ls
    else:
        elem_ls = soup_obj.find_all(element, attrib_dict)
        return elem_ls

In [294]:
def tei_values(object_list, attribute=False):
    '''Takes a list of beautiful soup elements, if attribute
    value is being extracted include name of that attribute.
    Return element or attribute value depending on input(s)
    '''
    if attribute:
        values = [obj[attribute] for obj in object_list]
        return values
    else:
        values = [obj.get_text() for obj in object_list]
        return values

### Data Cleaning and Dataframe Functions

Functions to perform text cleaning, remove stopwords, convert results into dataframe format and clean dataframe format data.

In [295]:
def text_cleaning(text):
    '''Takes as input a string, removes/replaces special characters, newlines,
    possessive apostrophes, hyphens, underscores, digits and makes single space.
    Keeps punctuation in place.
    Returns clean string.
    '''
    text = text.replace(u"\xa0", u" ").replace("&", "and").replace("|", " ")
    text = text.replace("\n", " ").replace("’", "'").replace("'s ", ' ')
    text = text.replace("-", " "). replace("–", " ").replace("_", " ").replace("—", " ")
    non_digit_text = re.sub(r"\b\d+\b", "", text)
    sing_space_text = re.sub(r"\s\s+", " ", non_digit_text)
    sing_space_text = sing_space_text.strip()
    return sing_space_text

In [296]:
def punct_removal(text):
    '''Takes as input a string and removes punctuation, removes extra spacing.
    Returns string without punctuation.
    '''
    text = text.translate(str.maketrans(" ", " ", string.punctuation))
    text = re.sub(r"\s\s+", " ", text)
    text = text.strip()
    return text

In [297]:
def remove_stopwords(text, stopwords):
    ''' Take as input a string and list of stopwords, tokenizes
    string and removes words contained in stopwords.
    Returns re-joined string without stopwords.
    '''
    tokenized_text = text.split()
    non_stop_text = [token for token in tokenized_text if token not in stopwords]
    return ' '.join(non_stop_text)

In [298]:
def create_dataframe(data, columns):
    '''Takes as input a list of lists of data and a list of columns.
    Returns a dataframe.
    '''
    df = pd.DataFrame(data)
    df = df.transpose()
    df.columns = columns
    return df

In [299]:
def dataframe_cleaning(dataframe, clean_column=None):
    '''Takes as input a dataframe and makes lowercase, strips leading and
    trailing spaces, standardises apostrophes. Applies text_cleaning function
    to column if identified as clean column parameter.
    Returns lowercase/cleaned dataframe.
    '''
    lower_dataframe = dataframe.applymap(lambda x: x.lower())
    lower_dataframe = lower_dataframe.applymap(lambda x: x.replace("’", "'"))
    if clean_column:
        lower_dataframe[clean_column] = lower_dataframe[clean_column].apply(lambda x: text_cleaning(x))
    clean_dataframe = lower_dataframe.applymap(lambda x: x.strip())
    return clean_dataframe

### Dataframe Filtering

Functions to filter dataframe by date range and keywords.

In [300]:
def dataframe_date_window(dataframe, start_date, end_date):
    '''Takes as input dataframe, start date and end date.
    Returns dataframe with just the rows where the date is on/between
    the start date and end date.
    '''
    mask = (dataframe['date'] > start_date) & (dataframe['date'] <= end_date)
    win_dataframe = dataframe.loc[mask]
    return win_dataframe

In [301]:
def dataframe_keyword_any(dataframe, column, keywords):
    '''Takes as input dataframe, column name and a list of keywords.
    Returns dataframe with just rows where column contains any
    of the words in keywords as substring.
    '''
    keyword_df = dataframe[dataframe[column].apply
                           (lambda cell: any(re.search(word, cell) for word in keywords))]
    return keyword_df

In [302]:
def dataframe_keyword_all(dataframe, column, keywords):
    '''Takes as input dataframe, column name and a list of keywords.
    Returns dataframe with just rows where column contains all
    of the words in keywords as substring.
    '''
    keyword_df = dataframe[dataframe[column].apply
                           (lambda cell: all(re.search(word, cell) for word in keywords))]
    return keyword_df

### Topic Modelling, Sentence Counts and Visualisation

Functions to perform topic modelling and counts of sentences per year and return visualisations.

In [303]:
def bertopic_topics(dataframe, topic_model):
    '''Takes as input a dataframe and Bertopic topic model tool.
    Extracts dates and sentences for each dataframe row as lists.
    Fits sentence list to topic model, creates dictionary of topics/sentences.
    Returns topics, topic/sentence dictionary and topic model.
    '''
    sent_list = dataframe["sentence"].to_list()
    topics, probs = topic_model.fit_transform(sent_list)

    topic_docs = {topic: [] for topic in set(topics)}
    for topic, doc in zip(topics, sent_list):
        topic_docs[topic].append(doc)
    return (topics, topic_docs, topic_model)

In [304]:
def bertopic_time(dataframe, topic_model):
    '''Takes as input a dataframe and Bertopic topic model.
    Extracts dates and sentences for each dataframe row as lists.
    Sends sentence list/dates to topics_over_time provided by BERTopic 
    to create visualisation.
    Returns visualisation'''
    sent_list = dataframe["sentence"].to_list()
    date_list = dataframe["date"].to_list()
    #parameters can be adjusted for visualisation
    topics_over_time = topic_model.topics_over_time(docs=sent_list,
                                                timestamps=date_list,
                                                nr_bins=30
                                              )
    fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=15, height=500, width=1000)
    fig.update_layout(yaxis_title = "Count")
    return (fig)

In [305]:
def year_counts(dataframe):
    '''Takes as input a dataframe.
    Returns a series with number of sentences per year.
    '''
    dataframe["year"] = dataframe["date"].dt.year
    year_counts = dataframe.groupby(["year"]).size()
    return year_counts

In [306]:
def year_counts_plot(year_counts):
    '''Takes as input a year count series.
    Returns a plotly bar chart of number of sentences per year.
    '''
    year_counts_plot = px.bar(year_counts,
                          title="Sentence Count by Year",
                          labels= {
                              "value":"Count",
                              "year":"Year"
                          }
                         )
    year_counts_plot.update_layout(showlegend=False)
    return year_counts_plot

## Data Extraction

We begin by extracting the data we need from the speech report files and the speech register file.

In [307]:
# get paths for speech reports
dir_path = Path('sources/')
#get filepaths to all the XML files in the directory
xml_files = (file for file in dir_path.iterdir() if file.is_file() and file.name.lower().endswith('.xml'))
#sort xml file paths numerically using os_sorted library
xml_files = os_sorted(xml_files)

#get path for speech register file
speech_file = Path('speeches/parnell_speeches.xml')

#basename returns filename removing directory path.
#split to remove ".xml" extension so that we can use later in dataframe as id
filenames = []
for path in xml_files:
    filename = os.path.basename(path)
    filename = filename.split(".")[0]
    filenames.append(filename)

In [308]:
#extract speech reports and speech register as beautiful soup objects using function
report_objects = soup_objects(xml_files)
speech_object = soup_objects(speech_file)

In [309]:
#speech register beautiful soup objects extracted as lists using function
#speech id, speech place and data
speech_objs = tei_extractor(speech_object, element='speech_id')
place_objs = tei_extractor(speech_object, element='place', attributes=['key'])
date_objs = tei_extractor(speech_object, element='date', attributes=['when'])

#speech reports beautiful soup objects extracted as lists using function
#speech id, publication name, text
speech_rep_objs = tei_extractor(report_objects, element='term', attributes=['key'])
publication_objs = tei_extractor(report_objects, element='title', attributes=['key', 'level'])
text_objs = tei_extractor(report_objects, element='body')

In [310]:
#use function to extract values from tei elements extracted above
speech_ids = tei_values(speech_objs)
speech_places = tei_values(place_objs)
speech_dates = tei_values(date_objs, attribute='when')

speech_rep_ids = tei_values(speech_rep_objs, attribute='key')
publications = tei_values(publication_objs)
texts = tei_values(text_objs)

#for speech ids, if more than one id present, shown by inclusion of comma, convert into a list
#some reports refer to more than one speech, so we need to capture them all 
speech_rep_ids = [item.split(',') if ',' in item else item for item in speech_rep_ids]

## Data Preparation

The next stage is to convert the speech data into a dataframe format where we can easily manipulate it and get different subsets prior to analysis/visualisation.

### Speech Report Dataframe

We begin by converting the speech report data into a dataframe and performing some data cleaning on the text to standardise it and make it more suitable for data analysis. This includes making the text lower case, removing extra spacing and some special characters.

In [311]:
#prepare data for speech report dataframe, make list of lists of data and list of column names
speech_rep_data = [filenames, speech_rep_ids, publications, texts]
speech_rep_columns = ['filename', 'speech_id', 'publication', 'sentence']
#use function to turn the above lists into dataframe
speech_rep_df = create_dataframe(speech_rep_data, speech_rep_columns)
#if there is more than one speech id for a report, the report data will appear as row for each id
speech_rep_df = speech_rep_df.explode('speech_id')
#use function to clean the text in the dataframe and standardise it
speech_rep_df = dataframe_cleaning(speech_rep_df, clean_column='sentence')
speech_rep_df

Unnamed: 0,filename,speech_id,publication,sentence
0,parnell_source_00001,speech_00001,the nation,the home rule league great meeting in the rotu...
1,parnell_source_00002,speech_00001,the freeman's journal,the home rule league on saturday evening a pub...
2,parnell_source_00003,speech_00001,the nation,"the week ""though beaten, we are not vanquished..."
3,parnell_source_00004,speech_00001,the irish times,irish home rule league a public meeting of the...
4,parnell_source_00005,speech_00002,the freeman's journal,"mr. charles stewart parnell, in seconding the ..."
...,...,...,...,...
656,parnell_source_00660,speech_00396,the freeman's journal,"fellow countrymen and fellow citizens, it is n..."
657,parnell_source_00661,speech_00397,the freeman's journal,"mr. chairman, fellow citizens, and people of t..."
658,parnell_source_00662,speech_00398,the freeman's journal,"people of mallow, i certainly did not expect t..."
659,parnell_source_00663,speech_00399,the freeman's journal,"people of dungarvan, i will, through you, expr..."


### Sentence Tokenization

We then use the sentence tokenizer to divide the text in each row into sentences before amending the dataframe so that each sentence has its own row with the appropriate report data for that sentence. The sentence will be the unit we use for our topic modelling process.

Having done this we remove punctuation from all sentences in the sentence column and remove rows where the sentence is less than 3 words as these tend not to be proper sentences (crowd reactions etc).

In [312]:
#initialize nltk abbreviation words, these will be added to the sentence tokenizer
#they will prevent the tokenizer from reading some full stops as sentence-enders
punkt_param = PunktParameters()
#we can add our own abbreviation words, e.g. "hon." and "mr." frequently have full stops in the reports
punkt_param.abbrev_types = set(['hon', 'mr', 'rev', 'dr', 'm.p', 'c.s', 'c.v', 'c.e', 't.l', 'j.r', 'j.j', 'a.j',
                            'r.b', 'j.g', 'j.l', 'j.r', 'j.f', 'n.b', 'p.j', 'c.j', 't.d', 'r', 'p.p', 'l.p', 'c.c', 'wm',
                            'capt', 'messrs', 'patk', '1d', '2d', '3d', '4d', '5d', '6d', '7d', '8d', '9d', '10d', '11d',
                            '1/2d', '3/4d', 'prof', 'per cent', 'adm', '2s', '1,400,000/', '400,000/'])

#initialize nltk sentence detector for dividing text into sentences
sentence_tokenizer = PunktSentenceTokenizer(punkt_param)

#apply sentence tokenizer to each text in the dataframe to convert into a list of sentences
speech_rep_df['sentence'] = speech_rep_df['sentence'].apply(lambda x: sentence_tokenizer.tokenize(x))
#use explode on the sentence column, so that each sentence is converted into its own row
speech_rep_df = speech_rep_df.explode('sentence')
#now that the text has been divided into sentences, we can remove punctuation using function
speech_rep_df['sentence'] = speech_rep_df['sentence'].apply(lambda x: punct_removal(x))
#we then remove rows with very short sentences from our dataframe, likely to be crowd reactions etc
speech_rep_df = speech_rep_df[speech_rep_df['sentence'].apply(lambda x: len(x.split()) > 3)]
speech_rep_df

Unnamed: 0,filename,speech_id,publication,sentence
0,parnell_source_00001,speech_00001,the nation,the home rule league great meeting in the rotu...
0,parnell_source_00001,speech_00001,the nation,there was an immense attendance the platform t...
0,parnell_source_00001,speech_00001,the nation,mr charles stewart parnell high sheriff of wic...
0,parnell_source_00001,speech_00001,the nation,the following report of the proceedings is tak...
0,parnell_source_00001,speech_00001,the nation,in view of the unwise course adopted by our op...
...,...,...,...,...
660,parnell_source_00664,speech_00400,the freeman's journal,the application is a perfectly disgraceful one...
660,parnell_source_00664,speech_00400,the freeman's journal,we have had a good legal opinion that all the ...
660,parnell_source_00664,speech_00400,the freeman's journal,the chairman said they had already sent out fo...
660,parnell_source_00664,speech_00400,the freeman's journal,the chairman you must settle that yourselves


### Speech Register Dataframe

We then convert the speech register into a dataframe before applying some data cleaning to make lowercase and remove spacing. We also convert dates into datetime format.

In [313]:
#prepare data for speech register dataframe, make list of lists of data and list of column names
speech_data = [speech_ids, speech_places, speech_dates]
speech_columns = ['speech_id', 'place', 'date']
#use function to turn the above lists into dataframe
speech_df = create_dataframe(speech_data, speech_columns)
#use function to clean the text in the dataframe and standardise it
speech_df = dataframe_cleaning(speech_df)
#make speech id index so we can use it when we merge dataframes below
speech_df.set_index('speech_id', inplace=True)
#convert date column to datetime format, enables us to manipulate dataframe using dates
speech_df['date'] = pd.to_datetime(speech_df['date'], format='%Y-%m-%d')
#drop empty rows
speech_df = speech_df.dropna(axis=0)
speech_df

Unnamed: 0_level_0,place,date
speech_id,Unnamed: 1_level_1,Unnamed: 2_level_1
speech_00001,"dublin, ireland",1874-07-11
speech_00002,"dublin, ireland",1875-01-21
speech_00003,"dublin, ireland",1875-01-22
speech_00004,"navan, ireland",1875-04-12
speech_00005,"london, england",1875-04-26
...,...,...
speech_00396,"cork, ireland",1881-10-02
speech_00397,"cork, ireland",1881-10-02
speech_00398,"mallow, ireland",1881-10-03
speech_00399,"dungarvan, ireland",1881-10-05


### Joint Dataframe

Having converted our speech reports and register into dataframes, we then join the dataframes together using the speech id contained in both dataframes. We end up with each sentence row in the final dataframe containing the appropriate speech data as well.

In [314]:
#join the speech register and speech report dataframes indexing on speech_id
df_all = speech_rep_df.merge(speech_df, left_on='speech_id', right_index=True)
df_all

Unnamed: 0,filename,speech_id,publication,sentence,place,date
0,parnell_source_00001,speech_00001,the nation,the home rule league great meeting in the rotu...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,there was an immense attendance the platform t...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,mr charles stewart parnell high sheriff of wic...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,the following report of the proceedings is tak...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,in view of the unwise course adopted by our op...,"dublin, ireland",1874-07-11
...,...,...,...,...,...,...
660,parnell_source_00664,speech_00400,the freeman's journal,the application is a perfectly disgraceful one...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,we have had a good legal opinion that all the ...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,the chairman said they had already sent out fo...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,the chairman you must settle that yourselves,"waterford, ireland",1881-10-05


## Topic Modelling

Having prepared our main dataframe, we are now able to begin our topic modelling stage, which will involve manipulating the data in different ways before performing topic modelling on the resulting data subsets and getting visualisations of the results.

### Sentence Count by Year

We first do a count visualisation for the number of sentences in each year of the dataset to use alongside our topic modelling process. This gives an idea of the distribution of the currently transcribed speech report material over time.

In [315]:
#make copy of main dataframe
df_year_counts = df_all.copy()
#use function to get a count of all sentences by year
year_counts_all = year_counts(df_year_counts)
#use function to get visualisation of year/sentence count
report_sent_nums = year_counts_plot(year_counts_all)
report_sent_nums

### Initializing BERTopic

We then initialize our topic model, setting our embedding model, which is a Sentence Transformers model, alongside the minimum topic size, which is the number of times a topic needs to occur in order to be included. Other parameters are also available and can be seen on the BERTopic website.

Sentence Transformers creates embeddings for all the input sentences in our dataset and uses the similarity of these embeddings to calculate sentence similarity. In the context of BERTopic these embeddings are used to calculate topics.

BERTopic is stochastic by nature and this means that every time the tool is run we get different results, although the difference between runs should ideally be relatively small.

As mentioned before, the sentence is the unit of comparison in the BERTopic process. So sentences with similar content form clusters with each other.

In [316]:
#initialize BERTopic topic model with parameters, uses a sentence transformers model to calculate topics
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", nr_topics="auto", top_n_words=15, min_topic_size=15)

### Removing Stopwords

Next we remove stopwords from the dataframe, stopwords are words that will tend to lack specific meaning within a dataset due to their high frequency of appearance. We are also able to extend them to include our own stopwords specific to a dataset, such as crowd reactions.

In [319]:
#initialize stopwords list
stop_words = stopwords.words('english')
#extend stop words to include extra words
stop_words.extend(['every', 'would', 'cheer', 'hiss', 'applause', 'groan' 'could', 'upon', 'may', 'go',
                   'said', 'say', 'know', 'far', 'come', 'put', 'us', 'parnell', 'ireland', 'irish', 'mp',
                   'mr', 'dr', 'laughter', 'laugh', 'italics', 'irishman'])

#copy from main dataframe
df_stop = df_all.copy()
#use function to remove stopwords from all sentences in dataframe
df_stop['sentence'] = df_stop['sentence'].apply(lambda x: remove_stopwords(x, stop_words))
df_stop

Unnamed: 0,filename,speech_id,publication,sentence,place,date
0,parnell_source_00001,speech_00001,the nation,home rule league great meeting rotundo great s...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,immense attendance platform gallery admission ...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,charles stewart high sheriff wicklow occupied ...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,following report proceedings taken somewhat ab...,"dublin, ireland",1874-07-11
0,parnell_source_00001,speech_00001,the nation,view unwise course adopted opponents notably e...,"dublin, ireland",1874-07-11
...,...,...,...,...,...,...
660,parnell_source_00664,speech_00400,the freeman's journal,application perfectly disgraceful one one cann...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,good legal opinion sales since passing land ac...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,chairman already sent forms requesting particu...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,chairman must settle,"waterford, ireland",1881-10-05


## <font color="blue">Alterable Cells - Dataframe Filtering</font>

<font color="blue">General guidelines:</font>

<ul style="color: blue;">
<li>Fill out dates in 'YYYY-MM-DD' format e.g. '1880-01-15'</li>   
<li>Add one or more keywords in quotes, separated by a comma to list e.g keywords = ['labor', 'labour'].</li>
<li>Use lower-case for all letters in searches.</li>
<li>keywords will act as substrings, so 'labour' will find 'labourer'.</li>
<li>To prevent words acting as substring, search with a space on either side. So ' rent ' will not find 'different', 'referent', etc.</li>
<li>Use alongside tools from previous sessions for a fuller exploration of the dataset.</li>
</ul>

<ul style="color: blue;">
<li>Run cells underneath each header sequentially</li>
<li>Try running repeatedly and looking at how patterns of topics shift</li>
<li>Try different keywords and date ranges to see what comes up</li>
</ul>

### <font color="blue">Visualisation - All Data (Stopwords Removed)</font>

<font color="blue">Nothing to amend here in terms of keywords/dates etc, but try running repeatedly (in sequence) for different results.</font>

In [320]:
topics_data_all = bertopic_topics(df_stop, topic_model)
model_all = topics_data_all[2]

<font color="blue">Below are the topics ordered by frequency of allocation to a sentence. Within each topic are the most prominent words for that topic with their weighting score. Ignore topic 0 as this refers to outliers not allocated a topic.</font>

In [321]:
model_all.get_topics()

{-1: [('hear', 0.008336596734031958),
  ('people', 0.008275234277201244),
  ('land', 0.00789695756486613),
  ('hon', 0.00762179364396049),
  ('one', 0.0075590223280279515),
  ('cheers', 0.007486549742302133),
  ('government', 0.007343448443136852),
  ('bill', 0.007309446993899213),
  ('right', 0.0072107178111463135),
  ('country', 0.0071012761359954815),
  ('great', 0.006951028356623023),
  ('question', 0.006922067142516249),
  ('could', 0.006768653408794319),
  ('gentleman', 0.006292872381913528),
  ('time', 0.006210062084489615)],
 0: [('cheers', 0.00884170540627568),
  ('land', 0.008259987929660857),
  ('hear', 0.007631504287281451),
  ('right', 0.007612700358110571),
  ('government', 0.007602332630250162),
  ('people', 0.007299329206541224),
  ('hon', 0.007170543423671186),
  ('question', 0.006692889620379164),
  ('country', 0.006636271314889421),
  ('great', 0.006593262346023479),
  ('gentleman', 0.0063573593741663645),
  ('one', 0.0062612083243022655),
  ('act', 0.006217397822469

<font color="blue">Topics are generated through a process of clustering, where documents begin in one cluster and are then gradually divided on the basis of similarity. As such, clusters of topics can have an underlying similarity. The below visualisation shows clusters of similar topics.</font>

<font color="blue">Use the slider at the bottom and the hover functionality for more detail on what the topics are.</font>

In [322]:
model_all.visualize_topics()

<font color="blue">The below visualisation is a hierarchical clustering tree, showing how clusters are related to each other and how larger topics are divided into smaller topics on the basis of similarity.</font>

In [323]:
model_all.visualize_hierarchy()

<font color="blue">Finally, we have the topics over time visualisation, showing the distribution of topics over time.</font>

In [324]:
model_all_time = bertopic_time(df_stop, model_all)
model_all_time

### <font color="blue">Date Range</font>

<font color="blue">Set dates in first cell and then run cells underneath sequentially to get results. Error messages can occur if the format of the dates is incorrect or if not enough data for model. Experiment with different date ranges.</font>

In [325]:
#set date range
#date format - 'YYYY-MM-DD'
start_date = '1886-01-01'
end_date = '1886-12-31'
#use function to restrict dataframe to a date window
df_year_range = dataframe_date_window(df_stop, start_date, end_date)

In [326]:
topics_data_years = bertopic_topics(df_year_range, topic_model)
model_years = topics_data_years[2]

In [327]:
model_years.get_topics()

{-1: [('hon', 0.027388893673581195),
  ('gentleman', 0.02261176085708787),
  ('right', 0.021808524233359196),
  ('member', 0.021394056231123722),
  ('oshea', 0.019747707362421756),
  ('party', 0.019087322529284214),
  ('house', 0.0185507512081673),
  ('hear', 0.01798106424482604),
  ('night', 0.016821634303529794),
  ('question', 0.01616408474233609),
  ('cheers', 0.015119802718125462),
  ('people', 0.014346634320929742),
  ('bill', 0.014174417727321286),
  ('future', 0.013514946770771986),
  ('could', 0.013241409556203104)],
 0: [('right', 0.021313921626800944),
  ('hon', 0.01916326272152016),
  ('gentleman', 0.018409956708576),
  ('lynch', 0.01721341352736029),
  ('oshea', 0.015946903918599067),
  ('hear', 0.015474672715601826),
  ('cheers', 0.015367209936277939),
  ('parliament', 0.014060617149259916),
  ('people', 0.013999354117616294),
  ('galway', 0.013915958735278901),
  ('one', 0.01319334131297726),
  ('ulster', 0.012999852852343953),
  ('could', 0.012867574330641547),
  ('part

In [328]:
model_years.visualize_topics()

In [329]:
model_years.visualize_hierarchy()

In [330]:
model_years_time = bertopic_time(df_year_range, model_years)
model_years_time

### <font color="blue">Publication</font>

<font color="blue">Set publication keywords or phrases in first cell and then run cells underneath sequentially to get results and visualisations. Error messages can occur if the format of the publication_word list is incorrect or if not enough data for model.</font>

<font color="blue">Experiment with different publication searches. My initial search for 'freeman' and 'journal' will pick up 'The Freeman's Journal' and 'The Weekly Freeman's Journal'.</font>

In [331]:
publication_words = ['freeman', 'journal']
#use function to restrict dataframe to rows where publication contains any of publication_words
df_publication = dataframe_keyword_any(df_stop, column='publication', keywords=publication_words)
df_publication

Unnamed: 0,filename,speech_id,publication,sentence,place,date
1,parnell_source_00002,speech_00001,the freeman's journal,home rule league saturday evening public meeti...,"dublin, ireland",1874-07-11
1,parnell_source_00002,speech_00001,the freeman's journal,platform also well filled appearance successio...,"dublin, ireland",1874-07-11
1,parnell_source_00002,speech_00001,the freeman's journal,besides gentlemen named present messrs cs high...,"dublin, ireland",1874-07-11
1,parnell_source_00002,speech_00001,the freeman's journal,half past eight oclock motion alfred webb chai...,"dublin, ireland",1874-07-11
1,parnell_source_00002,speech_00001,the freeman's journal,voice never,"dublin, ireland",1874-07-11
...,...,...,...,...,...,...
660,parnell_source_00664,speech_00400,the freeman's journal,application perfectly disgraceful one one cann...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,good legal opinion sales since passing land ac...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,chairman already sent forms requesting particu...,"waterford, ireland",1881-10-05
660,parnell_source_00664,speech_00400,the freeman's journal,chairman must settle,"waterford, ireland",1881-10-05


In [332]:
topics_publication = bertopic_topics(df_publication, topic_model)
model_publication = topics_publication[2]

In [333]:
model_publication.get_topics()

{-1: [('hear', 0.011278975089672221),
  ('cheers', 0.010812028778575147),
  ('people', 0.009406464856206912),
  ('land', 0.009024169274190025),
  ('one', 0.008725679335953574),
  ('country', 0.00856470130214614),
  ('question', 0.008303090445536225),
  ('great', 0.007827765299344291),
  ('day', 0.007363087663055115),
  ('shall', 0.007059933194415506),
  ('men', 0.006945976092676333),
  ('present', 0.006817993054029212),
  ('time', 0.006630736852559665),
  ('could', 0.006613613347219981),
  ('government', 0.006604740847028268)],
 0: [('irishmen', 0.037459382740181416),
  ('cheers', 0.03327526445959369),
  ('thank', 0.020666365612532277),
  ('loud', 0.019227972026544287),
  ('mayor', 0.016975672577562502),
  ('voice', 0.015404640707322968),
  ('address', 0.014899007767174432),
  ('gentlemen', 0.014351965452552247),
  ('men', 0.014212894706484139),
  ('meeting', 0.013988328030920275),
  ('magnificent', 0.012840981414328372),
  ('great', 0.012815950596593212),
  ('courage', 0.0124579766127

In [334]:
model_publication.visualize_topics()

In [335]:
model_publication.visualize_hierarchy()

In [336]:
model_publication_time = bertopic_time(df_publication, model_publication)
model_publication_time

### <font color="blue">Place</font>

<font color="blue">Set place keywords or phrases in first cell and then run cells underneath sequentially to get results and visualisations. Error messages can occur if the format of the place_words list is incorrect or if not enough data for model.</font>

<font color="blue">Experiment with different place searches. Can use a series of place names to get more than one place</font>

In [355]:
place_words = ['london']
#use function to restrict dataframe to rows where place name contains any of place_words
df_place = dataframe_keyword_any(df_stop, column='place', keywords=place_words)
df_place

Unnamed: 0,filename,speech_id,publication,sentence,place,date
16,parnell_source_00017,speech_00005,hansard's parliamentary debates,supporting motion hon member cavan observed ar...,"london, england",1875-04-26
16,parnell_source_00017,speech_00005,hansard's parliamentary debates,hon member derry r smyth although agreed princ...,"london, england",1875-04-26
16,parnell_source_00017,speech_00005,hansard's parliamentary debates,chief secretary open foe course opposed also n...,"london, england",1875-04-26
16,parnell_source_00017,speech_00005,hansard's parliamentary debates,reason hon member derry given approving princi...,"london, england",1875-04-26
16,parnell_source_00017,speech_00005,hansard's parliamentary debates,coercion necessary district prevent catholics ...,"london, england",1875-04-26
...,...,...,...,...,...,...
639,parnell_source_00643,speech_00382,the freeman's journal,entered lengthy contrast policies governments ...,"london, england",1881-06-22
639,parnell_source_00643,speech_00382,the freeman's journal,view apprehension accession conservatives power,"london, england",1881-06-22
639,parnell_source_00643,speech_00382,the freeman's journal,view general election might occur moment advis...,"london, england",1881-06-22
639,parnell_source_00643,speech_00382,the freeman's journal,land league change one single inch platform,"london, england",1881-06-22


In [338]:
topics_place = bertopic_topics(df_place, topic_model)
model_place = topics_place[2]

In [339]:
model_place.get_topics()

{-1: [('government', 0.009930230745380688),
  ('hon', 0.009828502645045391),
  ('bill', 0.009118582707266825),
  ('house', 0.008645000990781722),
  ('right', 0.008428682420456255),
  ('gentleman', 0.008172711491207123),
  ('could', 0.008057167797414557),
  ('people', 0.007733899543180564),
  ('member', 0.007137670044042689),
  ('one', 0.007047263934926646),
  ('members', 0.006904967756014047),
  ('country', 0.006764423555422183),
  ('time', 0.0064689966774007),
  ('great', 0.006465014373836054),
  ('question', 0.0063870693174196275)],
 0: [('right', 0.013876069957524067),
  ('gentleman', 0.013685747407151167),
  ('hon', 0.01356847169376498),
  ('secretary', 0.009954184100182905),
  ('amendment', 0.00949336460390386),
  ('question', 0.009303586622941493),
  ('government', 0.009299027807815642),
  ('order', 0.009068985348629537),
  ('member', 0.008778966452336785),
  ('chief', 0.008393263038370803),
  ('parliament', 0.008345102542811431),
  ('landlords', 0.008269262930130285),
  ('tenant

In [340]:
model_place.visualize_topics()

In [341]:
model_place.visualize_hierarchy()

In [342]:
model_place_time = bertopic_time(df_place, model_place)
model_place_time

### <font color="blue">Sentences Containing Any Keywords in List</font>

<font color="blue">Set sentence keywords or phrases in first cell and then run cells underneath sequentially to get results and visualisations. Error messages can occur if the format of the keywords_any list is incorrect or if not enough data for model.</font>

<font color="blue">Experiment with different keyword searches. Can use a series of keywords and phrases to get a range of results</font>

In [343]:
keywords_any = ['america']
#use function to restrict dataframe to rows where sentence contains any of keywords_any
df_keyword_any = dataframe_keyword_any(df_stop, column='sentence', keywords=keywords_any)
df_keyword_any

Unnamed: 0,filename,speech_id,publication,sentence,place,date
9,parnell_source_00010,speech_00004,the drogheda argus,england remember example set american colonies...,"navan, ireland",1875-04-12
10,parnell_source_00011,speech_00004,the freeman's journal,england remember example set american colonies...,"navan, ireland",1875-04-12
11,parnell_source_00012,speech_00004,the dundalk democrat,england remember example set american colonies...,"navan, ireland",1875-04-12
12,parnell_source_00013,speech_00004,the nation,england remember example set american colonies...,"navan, ireland",1875-04-12
68,parnell_source_00071,speech_00017,the cork examiner,contended fenian prisoners done moral wrong qu...,"london, england",1875-08-01
...,...,...,...,...,...,...
654,parnell_source_00658,speech_00394,the freeman's journal,per cent call tenant farmers admire “see fine ...,"maryborough, ireland",1881-09-26
654,parnell_source_00658,speech_00394,the freeman's journal,commencement new system american competition c...,"maryborough, ireland",1881-09-26
657,parnell_source_00661,speech_00397,the freeman's journal,helped helped past countries descendants irish...,"cork, ireland",1881-10-02
657,parnell_source_00661,speech_00397,the freeman's journal,exiled countrymen america work renewed vigour ...,"cork, ireland",1881-10-02


In [344]:
topics_keyword_any = bertopic_topics(df_keyword_any, topic_model)
model_keyword_any = topics_place[2]

In [345]:
model_keyword_any.get_topics()

{-1: [('america', 0.07411963417692041),
  ('great', 0.03905965537664548),
  ('country', 0.03427551920001184),
  ('people', 0.03312199194872208),
  ('american', 0.03177520279583188),
  ('government', 0.030479648659017308),
  ('hear', 0.026936439001383643),
  ('states', 0.025764817401458015),
  ('irishmen', 0.02275736721523579),
  ('england', 0.02155991133738988),
  ('time', 0.02147152311022702),
  ('opinion', 0.02147152311022702),
  ('last', 0.021249921669766365),
  ('months', 0.021102291763543054),
  ('two', 0.02085557362550014)],
 0: [('america', 0.05952642358933751),
  ('land', 0.0466203900760042),
  ('people', 0.040503874506769004),
  ('help', 0.02907010743478977),
  ('famine', 0.02819103830551288),
  ('england', 0.0254955258228148),
  ('landlords', 0.0251705699156365),
  ('relief', 0.023794604705789164),
  ('country', 0.023760302660395922),
  ('american', 0.023549652786254476),
  ('charity', 0.022623233867167403),
  ('government', 0.019894139742670444),
  ('time', 0.019380071623193

In [346]:
model_keyword_any.visualize_topics()

In [347]:
model_keyword_any.visualize_hierarchy()

In [348]:
keyword_any_time = bertopic_time(df_keyword_any, model_keyword_any)
keyword_any_time