This is a relatively short notebook as the visualisations are done on Palladio, an online tool specifically designed for social network analysis.

Import libraries for social network analysis data preparation.

In [1]:
import pickle
import matplotlib.pyplot as plt
import pandas as pd

from natsort import natsorted
from datetime import datetime

Load the doc_topics dataframe created in notebook one.

In [2]:
doc_topics_df = pd.read_pickle("pickle/documents_topics.pkl")

In [3]:
doc_topics_df

Unnamed: 0,letter,date,sender,recipient,text,topic
21,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",pentacrinitis caryophyllea,0_plant_botany_flora_herbarium
23,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",madriporite tubipore entrochi carypohyllea scoria,0_plant_botany_flora_herbarium
40,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",addenda plant wynch,0_plant_botany_flora_herbarium
43,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",alisma ranunculoide officinalis arenaria verna...,0_plant_botany_flora_herbarium
46,letters_4.xml,1822-12-16,"Henslow, J. S.","Jenyns, Leonard",arenarius draba dryas octopetala epilobium als...,0_plant_botany_flora_herbarium
...,...,...,...,...,...,...
8839,letters_1221.xml,1834-08-20,"Henslow, J. S.","Phillips, John",request edinburgh proposes council,14_edinburgh_glasgow_dublin_scotland
8849,letters_1222.xml,1834-02-10,"Henslow, J. S.","Phillips, John",edinburgh generality answer query,14_edinburgh_glasgow_dublin_scotland
8852,letters_1222.xml,1834-02-10,"Henslow, J. S.","Phillips, John",forego week edinburgh sd sportsman,14_edinburgh_glasgow_dublin_scotland
8865,letters_1223.xml,1836-08-08,"Henslow, J. S.","Phillips, John",trust stop dublin,14_edinburgh_glasgow_dublin_scotland


Create network dataframe converting data into an acceptable format for Palladio. Change date column format to string and add "end date" with the same content as date column, as this is necessary for timespan visualisations.

Extract the topic names for use below.

In [4]:
network_df = doc_topics_df[["sender", "recipient", "date", "topic"]].copy()
network_df["date"] = network_df['date'].dt.strftime("%Y-%m-%d")
network_df["end date"] = network_df['date']
topic_names = list(set(network_df["topic"].to_list()))
topic_names = natsorted(topic_names)
topic_names

['0_plant_botany_flora_herbarium',
 '1_ipswich_london_suffolk_feb',
 '2_committee_catalogue_candidate_name',
 '3_johns_graham_henry_coll',
 '4_lime_glass_acid_clay',
 '5_insect_entomology_wasp_larvae',
 '6_obd_diam_obedt_magd',
 '7_pencil_foot_ink_paper',
 '8_comment_certainty_reason_doubt',
 '9_lecture_course_professor_professorship',
 '10_fossil_crag_fossils_pleistocene',
 '11_hooker_greville_dr_sinclair',
 '12_specimen_duplicate_subscription_descr...',
 '13_election_vote_voter_poll',
 '14_edinburgh_glasgow_dublin_scotland']

Use the topic names list in a for loop to create dataframes for each topic only, then export each topic dataframe to a folder of CSV files, CSV is the accepted format for Palladio.

These can then be loaded in Palladio and used to generate visualisations of the network of correspondents around each topic. The timespan feature in Palladio allows for narrowing the time range to a specific period, so that the network can be seen within that time range.

In [5]:
for name in topic_names:
    topic_df = network_df.loc[doc_topics_df["topic"] == name]
    topic_df = topic_df[["sender", "recipient", "date", "end date"]]
    topic_df.sort_values("date", inplace=True)
    topic_df.to_csv("topic_outputs/%s_topic_model.csv" % name, index=False)