This is a relatively short notebook as the visualisations are done on Palladio, an online tool specifically designed for social network analysis.

Import libraries for social network analysis data preparation.

In [1]:
import pickle
import matplotlib.pyplot as plt
import pandas as pd

from natsort import natsorted
from datetime import datetime

Load the doc_topics dataframe created in notebook one.

In [2]:
doc_topics_df = pd.read_pickle("pickle/documents_topics.pkl")

In [3]:
doc_topics_df

Unnamed: 0,letter,date,sender,recipient,text,topic
8,letters_1.xml,1820-04-24,"Sowerby, James","Henslow, J. S.",martin,0_trin_henry_wilson_john_phillip
27,letters_2.xml,1821-11-15,"Clarke, E. D.","Henslow, J. S.",clarke,0_trin_henry_wilson_john_phillip
55,letters_6.xml,1822-07-02,"Curtis, John","Henslow, J. S.",bearer dr radius leipsic england man science s...,0_trin_henry_wilson_john_phillip
65,letters_8.xml,1823-04-02,"Henslow, J. S.","Jenyns, Leonard",suspect pera,0_trin_henry_wilson_john_phillip
108,letters_13.xml,1823-10-21,"Henslow, J. S.","Winch, N. J.",hodgson,0_trin_henry_wilson_john_phillip
...,...,...,...,...,...,...
6815,letters_947.xml,1826-07-02,"Palmerston, Lord","Henslow, J. S.",oclock voter,14_election_vote_voter_poll
6847,letters_956.xml,1826-06-19,"Loyd, S.","Henslow, J. S.",vote absence result opinion university,14_election_vote_voter_poll
8392,letters_1175.xml,1845-12-30,"Bright, John","Henslow, J. S.",west election probability walkover morpeth,14_election_vote_voter_poll
8663,letters_1205.xml,1826-02-14,"Henslow, J. S.","Dale, J. C.",cambridge feby dale election,14_election_vote_voter_poll


Create network dataframe converting data into an acceptable format for Palladio. Change date column format to string and add "end date" with the same content as date column, as this is necessary for timespan visualisations.

Extract the topic names for use below.

In [4]:
network_df = doc_topics_df[["sender", "recipient", "date", "topic"]].copy()
network_df["date"] = network_df['date'].dt.strftime("%Y-%m-%d")
network_df["end date"] = network_df['date']
topic_names = list(set(network_df["topic"].to_list()))
topic_names = natsorted(topic_names)
topic_names

['0_trin_henry_wilson_john_phillip',
 '1_botany_palustris_flora_arvensis',
 '2_suffolk_hitcham_bildeston_ipswich_lond...',
 '3_lecture_examination_examiner_course',
 '4_letter_circumstance_opinion_notice',
 '5_money_salary_labour_payment',
 '6_season_summer_autumn_feb',
 '7_insect_entomology_wasp_nest',
 '8_fossil_clay_limestone_nodule',
 '9_palmerston_horse_sincerely_palmerston_...',
 '10_brother_sister_brother_law_nephew',
 '11_hooker_dr_lady_playfair',
 '12_book_sort_order_bookseller',
 '13_plate_daguerreotype_reference_stenoty...',
 '14_election_vote_voter_poll']

Use the topic names list in a for loop to create dataframes for each topic only, then export each topic dataframe to a folder of CSV files, CSV is the accepted format for Palladio.

These can then be loaded in Palladio and used to generate visualisations of the network of correspondents around each topic. The timespan feature in Palladio allows for narrowing the time range to a specific period, so that the network can be seen within that time range.

In [5]:
for name in topic_names:
    topic_df = network_df.loc[doc_topics_df["topic"] == name]
    topic_df = topic_df[["sender", "recipient", "date", "end date"]]
    topic_df.sort_values("date", inplace=True)
    topic_df.to_csv("topic_outputs/%s_topic_model.csv" % name, index=False)