Import libraries for the BERTopic topic modelling process

In [1]:
import pickle
import pandas as pd
import cufflinks as cf

import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *

from ipywidgets import FloatProgress
from bertopic import BERTopic
from natsort import natsorted

cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

Function to take dataframe and topic model. Takes date column and sentence text column and changes them into lists, applies topic model to the sentence list, then creates dictionary with topics as keys and a list of sentences for that topic as values. 

The sentence list, topics and date list are used to make a visualisation of topic frequency over time, with the output changed to scattered points rather than line.

Returns the dictionary of topics/sentences and the visualisation as tuple.

In [2]:
def bertopic_time(dataframe, topic_model):
    sent_list = dataframe["text"].to_list()
    date_list = dataframe["date"].to_list()
    topics, probs = topic_model.fit_transform(sent_list)
    
    topic_docs = {topic: [] for topic in set(topics)}
    for topic, doc in zip(topics, sent_list):
        topic_docs[topic].append(doc)
    
    topics_over_time = topic_model.topics_over_time(docs=sent_list, 
                                                topics=topics, 
                                                timestamps=date_list, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20,
                                              )
    fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=15, height=500, width=1000)
    fig.update_layout(yaxis_title = "Count")
    for trace in fig.data:
        trace.update(mode="markers")
    return (fig, topic_docs)

Function which takes dataframe and the tuple created by bertopic_time function and converts into a dataframe with the topics from the visualisation, which are limited to the top n topics in the dataset, in a new "topics" column. This new dataframe contains the topics from the visualisation, not all generated topics. Created for social network visualisations for each topic in notebook four.

First divides tuple into visualisation and topic/sentence list dictionary. For the latter, removes duplicates from sentence lists using set and reconverts to list.

Searches dataframe for sentence list for each topic. If the sentence is in a row, adds that row to a new dataframe with a new "topic" column containing the topic number. Append dataframe to a list of dataframes for all topics and then concatenates these into one dataframe.

Take the data from the visualisation and use this to get topic names and numbers for visual in ordered lists. Create final dataframe exclusively for the "top topics" contained in the visualisation with the topic names from the visualisation, replacing topic numbers created above.

Returns dataframe for the visualisation topics which includes topic names for each sentence.

In [3]:
def doc_topics_maker(dataframe, vis_obj):
    visual = vis_obj[0]
    doc_topics = vis_obj[1]
    doc_topics = dict(zip(doc_topics.keys(), map(set, doc_topics.values())))
    for key, value in doc_topics.items():
        doc_topics[key] = list(value)

    df_list = []
    for key, value in doc_topics.items():
        mask = dataframe["text"].isin(value)
        top_docs_df = dataframe.loc[mask]
        top_docs_df["topic"] = key
        df_list.append(top_docs_df)
    doc_topics_df = pd.concat(df_list)

    topic_names = []
    topic_nums = []
    topic_range = len(visual.data)
    for i in range(topic_range):
        topic_name = visual.data[i].name
        topic_names.append(topic_name)
        topic_nums.append(i)
    topic_names = natsorted(topic_names)

    doc_topics_df = doc_topics_df.loc[doc_topics_df["topic"].isin(topic_nums)]
    doc_topics_df = doc_topics_df.replace(topic_nums, topic_names)
    return doc_topics_df

Function which takes dataframe and topic model, extracts sentences from dataframe and runs through topic model, returns dendrogram of clusters for most prominent topics.

In [4]:
def bertopic_tree(dataframe, topic_model, topics_num=15):
    sent_list = dataframe["text"].to_list()
    topics, probs = topic_model.fit_transform(sent_list)
    fig = topic_model.visualize_hierarchy(top_n_topics=topics_num)
    return fig

Function which counts the number of letters in each year across the corpus, returns series for number of letters per year.

In [5]:
def year_counts(dataframe):
    dataframe["year"] = dataframe["date"].dt.year
    year_counts = dataframe.groupby(["year"]).size()
    return year_counts

Function which takes the year counts series for letters per year and converts into a bar plot.

In [6]:
def year_counts_plot(year_counts):
    year_counts_plot = px.bar(year_counts, 
                          title="Letter Count by Year",
                          labels= {
                              "value":"Count",
                              "year":"Year"
                          }
                         )
    year_counts_plot.update_layout(showlegend=False)
    return year_counts_plot

Function which creates a topic frequency over time graph, omitting outlying years with small numbers of letters and large numbers of letters. The function takes a dataframe, year counts series of letters per year and parameters with amendable default values for higher and lower limits of letters per year.

First the function creates the year window by removing high and low frequency year letters from the year counts series, then uses this year window to find items in the main dataframe which match it. Finally, it sends the resulting dataframe to bertopic_time for visualisation of the data over time and returns this visualisation.

In [7]:
def year_window_vis(dataframe, year_counts, topic_model, low_lim=20, high_lim=150):
    year_counts_window = year_counts[(year_counts > low_lim) & (year_counts < high_lim)]
    year_window = year_counts_window.index.tolist()
    year_win_df = dataframe[dataframe["date"].dt.year.isin(year_window)]
    year_win_vis_obj = bertopic_time(year_win_df, topic_model)
    year_win_vis = year_win_vis_obj[0]
    return year_win_vis

Import dataframes of clean data created in notebook zero, one with letters divided into sentences and the other with letter texts as one string.

In [8]:
sent_df = pd.read_pickle("pickle/henslow_sentences.pkl")
df = pd.read_pickle("pickle/henslow_texts.pkl")

Remove rows with empty text columns from the dataframe. This can happen because of the removal of parts of text during the data cleaning process.

In [9]:
sent_df = sent_df[sent_df["text"].str.len() != 0]
sent_df.reset_index(inplace=True, drop=True)
sent_df["text"] = sent_df["text"].str.join(" ")
sent_df

Unnamed: 0,letter,date,sender,recipient,text
0,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",mead lambeth april fossils
1,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",remembrance sedgwick favour amm
2,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",thank
3,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",clark lecture mot
4,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",iron pupil
...,...,...,...,...,...
9155,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",london novr bennett act subcommittee managemen...
9156,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",advantage catalogue collection bird philosophi...
9157,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",care leadbeater
9158,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",refer case bird jenyn promise


Use boolean maak to create dataframe restricted to 1920s for use in visualisation.

In [10]:
mask_20 = (sent_df["date"] > "1819/12/31") & (sent_df["date"] < "1829/12/31")
sent_df_year_restrict_20 = sent_df.loc[mask_20]
sent_df_year_restrict_20

Unnamed: 0,letter,date,sender,recipient,text
0,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",mead lambeth april fossils
1,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",remembrance sedgwick favour amm
2,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",thank
3,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",clark lecture mot
4,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",iron pupil
...,...,...,...,...,...
9155,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",london novr bennett act subcommittee managemen...
9156,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",advantage catalogue collection bird philosophi...
9157,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",care leadbeater
9158,letters_1252.xml,1828-11-11,"Yarrell, William","Henslow, J. S.",refer case bird jenyn promise


Use boolean maak to create dataframe restricted to the period 1945 to 1957 for use in visualisation.

In [11]:
mask_late = (sent_df["date"] > "1844/12/31") & (sent_df["date"] < "1857/12/31")
sent_df_year_restrict_late = sent_df.loc[mask_late]
sent_df_year_restrict_late

Unnamed: 0,letter,date,sender,recipient,text
2608,letters_229.xml,1845-01-19,"Maund, Benjamin","Henslow, J. S.",bromsgrove january professor dict
2609,letters_229.xml,1845-01-19,"Maund, Benjamin","Henslow, J. S.",claim
2610,letters_229.xml,1845-01-19,"Maund, Benjamin","Henslow, J. S.",vulgaris elatior veris variety species opinion
2611,letters_229.xml,1845-01-19,"Maund, Benjamin","Henslow, J. S.",inconvenient note evidence notice
2612,letters_229.xml,1845-01-19,"Maund, Benjamin","Henslow, J. S.",scientist variety williams pitmaston seed cows...
...,...,...,...,...,...
9138,letters_1250.xml,1848-08-21,"Henslow, J. S.","Ransome, George",duplicate tube specimen
9139,letters_1250.xml,1848-08-21,"Henslow, J. S.","Ransome, George",sphaeria attack fructifie scape neck
9140,letters_1250.xml,1848-08-21,"Henslow, J. S.","Ransome, George",importunate hint sort book
9141,letters_1250.xml,1848-08-21,"Henslow, J. S.","Ransome, George",lindley vegetable kingdom


Create dataframes restricted to correspondence sent by Henslow and correspondence sent to him for visualisations.

In [12]:
henslow_sent_df = sent_df.loc[sent_df["sender"] == "Henslow, J. S."]
henslow_df = df.loc[df["sender"] == "Henslow, J. S."]
non_henslow_sent_df = sent_df.loc[sent_df["sender"] != "Henslow, J. S."]
non_henslow_df = df.loc[df["sender"] != "Henslow, J. S."]

Initialize BERTopic model and insert parameters, including input model and minimum topic number

In [13]:
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", nr_topics="auto", min_topic_size=15)

Create visualisation of topics over time for all input data using bertopic_time function.

In [14]:
all_vis_obj = bertopic_time(sent_df, topic_model)
all_vis = all_vis_obj[0]
all_vis

Create visualisation of topics over time for 1920s letters using bertopic_time function.

In [15]:
sent_df_year_restrict_20_vis_obj = bertopic_time(sent_df_year_restrict_20, topic_model)
sent_df_year_restrict_20_vis = sent_df_year_restrict_20_vis_obj[0]
sent_df_year_restrict_20_vis

Create visualisation of topics over time for later-period letters using bertopic_time function.

In [16]:
sent_df_year_restrict_late_obj = bertopic_time(sent_df_year_restrict_late, topic_model)
sent_df_year_restrict_late_vis = sent_df_year_restrict_late_obj[0]
sent_df_year_restrict_late_vis

Use doc_topics_maker function to create dataframes for visualisations, limiting the dataframe to the topics contained within each visualisation and showing the topic names taken from the visualisation for each row. These visualisations are created for use with the social network visualisations in notebook four.

In [17]:
doc_topics_df = doc_topics_maker(sent_df, all_vis_obj)
doc_topics_df

Unnamed: 0,letter,date,sender,recipient,text,topic
21,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",pentacrinitis caryophyllea,0_plant_botany_flora_herbarium
23,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",madriporite tubipore entrochi carypohyllea scoria,0_plant_botany_flora_herbarium
40,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",addenda plant wynch,0_plant_botany_flora_herbarium
43,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",alisma ranunculoide officinalis arenaria verna...,0_plant_botany_flora_herbarium
46,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",arenarius draba dryas octopetala epilobium als...,0_plant_botany_flora_herbarium
...,...,...,...,...,...,...
8839,letters_1221.xml,1834-08-20,"Henslow, J. S.","Phillips, John",request edinburgh proposes council,14_edinburgh_glasgow_dublin_scotland
8849,letters_1222.xml,1834-02-10,"Henslow, J. S.","Phillips, John",edinburgh generality answer query,14_edinburgh_glasgow_dublin_scotland
8852,letters_1222.xml,1834-02-10,"Henslow, J. S.","Phillips, John",forego week edinburgh sd sportsman,14_edinburgh_glasgow_dublin_scotland
8865,letters_1223.xml,1836-08-08,"Henslow, J. S.","Phillips, John",trust stop dublin,14_edinburgh_glasgow_dublin_scotland


In [18]:
doc_topics_df_20s = doc_topics_maker(sent_df, sent_df_year_restrict_20_vis_obj)
doc_topics_df_20s

Unnamed: 0,letter,date,sender,recipient,text,topic
7,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",punctatus,0_plant_specimen_arvensis_vulgaris
20,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",nautilus complanatus,0_plant_specimen_arvensis_vulgaris
21,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",pentacrinitis caryophyllea,0_plant_specimen_arvensis_vulgaris
23,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",madriporite tubipore entrochi carypohyllea scoria,0_plant_specimen_arvensis_vulgaris
28,letters_3.xml,1821-07-02,"Cumming, James","Henslow, J. S.",evening result specimen goodness insert paper ...,0_plant_specimen_arvensis_vulgaris
...,...,...,...,...,...,...
6491,letters_864.xml,1826-02-16,"Wood, J.","Henslow, J. S.",illeg,14_illeg_lamb_thro_health
6599,letters_887.xml,1826-03-10,"Lamb, J.","Henslow, J. S.",illeg illeg meeting hour lamb,14_illeg_lamb_thro_health
6868,letters_948.xml,1826-06-10,"Burch, J.","Henslow, J. S.",illeg,14_illeg_lamb_thro_health
6879,letters_951.xml,1826-07-02,"Lamb, J.","Henslow, J. S.",lamb,14_illeg_lamb_thro_health


In [19]:
doc_topics_df_late = doc_topics_maker(sent_df, sent_df_year_restrict_late_obj)
doc_topics_df_late

Unnamed: 0,letter,date,sender,recipient,text,topic
876,letters_75.xml,1828-07-07,"Greville, R. K.","Henslow, J. S.",glasgow,0_museum_ipswich_suffolk_meeting
890,letters_75.xml,1828-07-07,"Greville, R. K.","Henslow, J. S.",paris,0_museum_ipswich_suffolk_meeting
2625,letters_230.xml,1845-02-06,"Henslow, J. S.","Whewell, William",appointment curator biggs arrangement superint...,0_museum_ipswich_suffolk_meeting
2646,letters_237.xml,1846-08-28,"Henslow, J. S.","Whewell, William",suffolk august whewell silence envelope hitcha...,0_museum_ipswich_suffolk_meeting
2665,letters_238.xml,1847-01-25,"Henslow, J. S.","Jenyns, Leonard",town,0_museum_ipswich_suffolk_meeting
...,...,...,...,...,...,...
8004,letters_1126.xml,1855-05-14,"Henslow, J. S.",Unknown,cambridge,14_cambridge_oxford_provost_minority
8214,letters_1154.xml,1858-03-21,"Hooker, J. D.","Henslow, J. S.",cambridge,14_cambridge_oxford_provost_minority
8494,letters_1182.xml,1856-10-28,"Henslow, J. S.","Gould, John.",home cambridge downing_college cambridge night...,14_cambridge_oxford_provost_minority
8707,letters_1206.xml,1826-02-26,"Henslow, J. S.","Dale, J. C.",cambridge feb,14_cambridge_oxford_provost_minority


Use all_tree_vis function to create dendrogram visualisation of clusters for most prominent topics, with dataframe and topic model as parameters.

In [20]:
all_tree_vis = bertopic_tree(sent_df, topic_model)
all_tree_vis

Use year_counts and year_counts_plot functions to get a count of letters in each year and visualise.

In [21]:
year_counts_all = year_counts(df)
letter_nums = year_counts_plot(year_counts_all)
letter_nums

Create visualisation of topics over time for letters sent by Henslow using bertopic_time function.

In [22]:
henslow_vis_obj = bertopic_time(henslow_sent_df, topic_model)
henslow_vis = henslow_vis_obj[0]
henslow_vis

Use year_counts and year_counts_plot functions to get a count of letters from Henslow in each year and visualise.

In [23]:
year_counts_henslow = year_counts(henslow_df)
letter_nums_hen = year_counts_plot(year_counts_henslow)
letter_nums_hen

Create visualisation of topics over time for letters sent to Henslow using bertopic_time function.

In [24]:
non_henslow_vis_obj = bertopic_time(non_henslow_sent_df, topic_model)
non_henslow_vis = non_henslow_vis_obj[0]
non_henslow_vis

Use year_counts and year_counts_plot functions to get a count of letters to Henslow in each year and visualise.

In [25]:
year_counts_non_henslow = year_counts(non_henslow_df)
letter_nums_non_hen = year_counts_plot(year_counts_non_henslow)
letter_nums_non_hen

Use year_window_vis function to create a visualisation of topics over time, omitting years with very high or low numbers of letters.

In [26]:
all_win_vis = year_window_vis(sent_df, year_counts_all, topic_model)
all_win_vis

Use year_window_vis function to create a visualisation of topics over time for letters sent by Henslow, omitting years with very high or low numbers of letters.

In [27]:
henslow_win_vis = year_window_vis(henslow_sent_df, 
                                  year_counts_henslow, 
                                  topic_model, 
                                  low_lim=10, 
                                  high_lim=150
                                 )
henslow_win_vis

Use year_window_vis function to create a visualisation of topics over time for letters sent to Henslow, omitting years with very high or low numbers of letters.

In [28]:
non_henslow_win_vis = year_window_vis(non_henslow_sent_df, 
                                      year_counts_non_henslow, 
                                      topic_model, 
                                      low_lim=10, 
                                      high_lim=150
                                     )
non_henslow_win_vis

Save visualisations to html files

In [29]:
all_vis.write_html("images/all_letters/bertopic_noun_propn.html")
all_win_vis.write_html("images/all_letters/bertopic_noun_propn_window.html")
all_tree_vis.write_html("images/all_letters/bertopic_topic_tree.html")

sent_df_year_restrict_20_vis.write_html("images/year_window/bertopic_20s.html")
sent_df_year_restrict_late_vis.write_html("images/year_window/bertopic_1845_1857.html")

henslow_vis.write_html("images/from_henslow/from_henslow.html")
henslow_win_vis.write_html("images/from_henslow/from_henslow_window.html")

non_henslow_vis.write_html("images/to_henslow/to_henslow.html")
non_henslow_win_vis.write_html("images/to_henslow/to_henslow_window.html")

letter_nums.write_html("images/all_letters/year_count.html")
letter_nums_hen.write_html("images/from_henslow/year_count_from_henslow.html")
letter_nums_non_hen.write_html("images/to_henslow/year_count_to_henslow.html")

Save dataframes containing topic names for visualisations for use in social network analysis

In [30]:
doc_topics_df.to_pickle("pickle/documents_topics.pkl")
doc_topics_df_20s.to_pickle("pickle/documents_topics_20s.pkl")
doc_topics_df_late.to_pickle("pickle/documents_topics_late.pkl")