Import libraries for Latent Dirichlet Allocation (LDA) topic modelling and suppress warnings as these libraries throw up some unnecessary warnings

In [31]:
import warnings
warnings.filterwarnings('ignore')

In [32]:
import pickle
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.renderers.default='iframe'

import gensim
import gensim.corpora as corpora
import pyLDAvis
import pyLDAvis.gensim_models

Load dataframe of clean data created in notebook zero.

In [33]:
df = pd.read_pickle("pickle/henslow_texts.pkl")
df.head()

Unnamed: 0,letter,date,sender,recipient,text
0,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.","[mead, lambeth, april, fossils, remembrance, f..."
1,letters_2.xml,1821-11-15,"Clarke, E. D.","Henslow, J. S.","[november, analysis, grain, mineral, anglesea,..."
2,letters_3.xml,1821-07-02,"Cumming, James","Henslow, J. S.","[evening, result, specimen, goodness, insert, ..."
3,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard","[december, leonard, addenda, plant, wynch, boo..."
4,letters_5.xml,1822-11-11,"Brewster, David","Henslow, J. S.","[edinburgh, coates, crescent, november, prince..."


Send text column to list for later use.

In [34]:
df_list = df["text"].to_list()
df_list[:2]

[['mead',
  'lambeth',
  'april',
  'fossils',
  'remembrance',
  'favour',
  'amm',
  'sedgwickii',
  'thank',
  'clark',
  'lecture',
  'mot',
  'iron',
  'pupil',
  'trouble',
  'parcell',
  'help',
  'catalogue',
  'fossil',
  'isle',
  'punctatus',
  'martin',
  'sowerby',
  'lin',
  'tran',
  'pt',
  'page',
  'producti',
  'spirifer',
  'side',
  'productus',
  'scoticus',
  'spirifer',
  'cardium',
  'productus',
  'productus',
  'productus',
  'stria',
  'thready',
  'one',
  'productus',
  'trilobite',
  'amm',
  'henslowi',
  'nautilus',
  'complanatus',
  'pentacrinitis',
  'caryophyllea',
  'madriporite',
  'tubipore',
  'entrochi',
  'carypohyllea',
  'scoria'],
 ['november',
  'analysis',
  'grain',
  'mineral',
  'anglesea',
  'form',
  'gr',
  'silica',
  'alumina',
  'soda',
  'lime',
  'water',
  'absorption',
  'iron',
  'grain',
  'ch',
  'mineral',
  'gelatinize',
  'friction',
  'analcine',
  'variety',
  'clarke']]

Send date column to list for later use.

In [35]:
date_list = df["date"].to_list()
date_list[:5]

[Timestamp('1820-04-24 00:00:00'),
 Timestamp('1821-11-15 00:00:00'),
 Timestamp('1821-07-02 00:00:00'),
 Timestamp('1822-12-16 00:00:00'),
 Timestamp('1822-11-11 00:00:00')]

Give each word in corpus integer id using Gensim corpora

In [36]:
id2word = corpora.Dictionary(df_list)

Convert each text into bag of words tuples with this structure: (integer id from above, freq in doc).

In [37]:
corpus = [id2word.doc2bow(text) for text in df_list]

Use integer id to access words from id2word dictionary and convert integer ids to words using above "corpus" list. The end result is a list of tuples for each text with this structure (word, freq in doc)

In [38]:
id_words = [[(id2word[id], count) for id, count in text] for text in corpus]

Set number of topics for topic model, this can be adjusted.

Convert bag of words texts into LDA model using Gensim, both id2word dictionary and corpus list required as parameters.

In [39]:
num_topics = 5

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=100,
                                           alpha="auto")

Use pyLDAvis library specifically designed for LDA topic modelling visualisations to create visualisation of topic data.

In [40]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
vis

Having done this, we can now begin preparing our data for a visualisation of LDA topics over time.

Extract topics and scores for each document from the LDA model initialised above.

Use topic number to extract topic words, get top 5 words for each topic and join together as string. Multiply scores by 100 to get percentage, in LDA each document is divided into topic scores adding up to 100 percent. 

Create list for each topic in each doc with topic keywords and its percentage score for that document. Append to list for the document, with other topics and append that list to a list of all corpus scores.

In [41]:
all_scores = []
for doc in lda_model[corpus]:
    doc_scores = []
    for topic_num, score in doc:
        topic_words = lda_model.show_topic(topic_num)
        top_5_words = topic_words[:5]
        topic_keywords = ", ".join([word for word, score in top_5_words])
        score_perc = score * 100
        topic_ls = [topic_keywords, score_perc]
        doc_scores.append(topic_ls)
    all_scores.append(doc_scores)

Add date to each topic/score list for each document and append to new list.

In [42]:
date_docs = []
for date, doc in zip(date_list, all_scores):
    date_doc = [[date] + item for item in doc]
    date_docs.append(date_doc)

Flatten the above list so that it is comprised of non-nested lists of timestamp, topic, score.

In [43]:
flat_list = [item for sublist in date_docs for item in sublist]
flat_list[:2]

[[Timestamp('1820-04-24 00:00:00'),
  'cambridge, london, paper, museum, copy',
  17.04263538122177],
 [Timestamp('1820-04-24 00:00:00'),
  'specimen, plant, specie, fossil, speciman',
  82.53111839294434]]

Convert flat list into dataframe, sort by date and convert score column to an average for each topic by year.

In [44]:
date_df = pd.DataFrame.from_records(flat_list, columns=["date", "topic", "score"])
date_df = date_df.sort_values(by=["date"]).reset_index(drop=True)
date_df = date_df.groupby([date_df.date.dt.year, date_df.topic])["score"].mean()
date_df = date_df.rename_axis(["date", "topic"]).reset_index(name="score")
date_df

Unnamed: 0,date,topic,score
0,1818,"cambridge, london, paper, museum, copy",63.996774
1,1818,"specimen, plant, specie, fossil, speciman",35.317463
2,1819,"cambridge, london, paper, museum, copy",69.722456
3,1819,"specimen, plant, specie, fossil, speciman",29.769781
4,1820,"cambridge, london, paper, museum, copy",44.309765
...,...,...,...
190,1860,"plant, nodule, specimen, vulgaris, palustris",22.142188
191,1860,"specimen, plant, specie, fossil, speciman",12.353422
192,1861,"cambridge, london, paper, museum, copy",69.550557
193,1861,"candidate, water, bone, pit, tooth",18.646803


Use the dataframe just created to create a plotly line chart of average document topic percentage score by year.

In [45]:
fig = px.line(
    data_frame=date_df, 
    x="date", 
    y="score",
    color="topic",
    title="LDA Average Document Topic Percentages",
    width=1200,
    labels={
        "score": "Average Document Percentage",
        "date": "Year"
           }
    )

fig.update_layout(legend=dict(font=dict(size=10)))
fig.update_layout(legend_title_text="Top 5 Words in Cluster")
fig.show()