# Data preprocessing

Using the data gathered from [Founders Online](https://www.founders.archives.gov/), this notebook preprocesses the data for the use of the Letters of the Revolution Project (LotRP). This preprocessing will consist of removing irrelevant documents, removing non-letters, merging split nodes (name disambiguation) and preparing a directional nodelist and edgelist to be used in social network analysis tools. 

In [1]:
#importing relevant libraries
import pandas as pd
import numpy as np
import networkx as nx
import os

In [2]:
#uploading the data
os.path.normpath(os.getcwd() + os.sep + os.pardir)
df = pd.read_csv("../data/archive_data/founders_online_archive_raw.csv")
df = df.loc[df['period'] == 'Revolutionary War']

In [3]:
#df.info()
#df.sample()

## Filtering, dropping duplicates and name disambiguation

Here we will start manipulating the data to make it fit our needs. First of all, all duplicate texts and non-correspondence will be removed. This because the project is exclusively interested in correspondence, and it can happen that unique letters appear more then once in the Founders dataset due to the way it is structured. These operations will be conducted by dropping any data entries without recipient (as you would need at least one recipient to call something correspondence), and to drop any exact duplicates. 

In the rest of this section I will be dealing with name disambiguation *i.e.* name attribution to specific people. Due to spelling mistakes and other problems with the encoding of the metadata for the Founders Dataset did the names of certain individuals split up into multiple names. By replacing (parts of) names I am making the naming of these individuals consistent again across the database. 

The results of this section will be saved into a new csv file which will be used for the data exploration in the next Juypiter Notebook. 

In [4]:
#dropping any content without recipients and removing duplicate texts.
df = df.dropna(subset=['recipients']).drop_duplicates(subset=['text'])

In [5]:
#correcting some common inconsistencies.
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(', & Fins',' & Fins', regex=True)
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(', & Cie.',' & Cie.', regex=True)
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(', & ',' | ', regex=True)
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(', Jr.',' Jr.', regex=True)
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(', Sr.',' Sr.', regex=True)
df[['authors', 'recipients']] = df[['authors', 'recipients']].replace(' \(business\)','', regex=True)

In [6]:
#correcting the inconsistently encoded names.
df[['authors','recipients']] = df[['authors','recipients']].replace('Adams, Abigail Smith','Adams, Abigail', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Hamilton, Alexander \(Lieutenant Colonel\)','Hamilton, Alexander', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Steuben, Major General','Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, Baron [von]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, Baron von','Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, Baron [von]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, baron von','Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, Baron [von]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Steuben, Baron von','Steuben, Friedrich Wilhelm Ludolf Gerhard Augustin, Baron [von]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Vergennes, Charles Gravier, comte de','Vergennes, Charles Gravier, Comte de', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Dumas, Charles-Guillaume-Frédéric','Dumas, Charles William Frederic', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Dumas, C. W. F.','Dumas, Charles William Frederic', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Boudinot, Elias Jr.','Boudinot, Elias', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Bowdinot, Elias','Boudinot, Elias', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Schuyler, Philip John','Schuyler, Philip', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Lee, Henry Jr.','Lee, Henry', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Nelson, Thomas Jr.','Nelson, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Muhlenberg, John Peter Gabriel','Muhlenberg, Peter', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Harrison, Benjamin Sr.','Harrison, Benjamin', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Williams, Jonathan Jr.','Williams, Jonathan', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Wharton, Thomas Jr.','Wharton, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Virginia Delegates in Congress','Virginia Delegates', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Virginia Delegates, American Continental Congress','Virginia Delegates', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Thaxter, John Jr.','Thaxter, John', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Schweighauser, John Daniel','Schweighauser, Jean-Daniel', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Rendon, Francisco','Rendón, Francisco', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Neufville, Jean de','Neufville, Jean [de]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Neufville, Leendert de','Neufville, Leonard [de]', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('McDougall, Maj. Gen. Alexander','McDougall, Alexander', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Malcom, Colonel William','Malcom, William', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Létombe, Philippe-André-Joseph de','Létombe, Philippe André Joseph de', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Johnson, Thomas Jr.','Johnson, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Hamilton, Alexander (Lieutenant Colonel)','Hamilton, Alexander', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Greene, Nathaniel','Greene, Nathanael', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Gérard, Conrad-Alexandre','Gérard, Conrad Alexandre', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Genet, Edme-Jacques','Genet, Edmé Jacques', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Dana, Francis M.','Dana, Francis', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Cooke, Nicholas Sr.','Cooke, Nicholas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Chaumont, Jacques Donatien, Leray de','Chaumont, Jacques-Donatien Le Ray de', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Church, William Singleton','Digges, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Fitzpatrick, William','Digges, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Dundas, T.','Digges, Thomas', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Ross, Timothy D.','Digges, Thomas', regex=True) #Thomas Digges used pseudonyms
df[['authors','recipients']] = df[['authors','recipients']].replace('Beaumarchais, Pierre-Augustin Caron de','Beaumarchais, Pierre Augustin Caron de', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Rochambeau, Comte de','Rochambeau, Jean-Baptiste Donatien de Vimeur, comte de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Rochambeau, Jean-Baptiste-Donatien de Vimeur, comte de, Jean-Baptiste Donatien de Vimeur, comte de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Barbé de Marbois, François','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Barbé-Marbois, François de','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Barbé-Marbois, François','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Barbé-Marbois, Marquis de','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Barbé-Marbois, Pierre-François','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Marbois, François Barbé de','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Marbois, François Marquis de Barbé-Marbois','Barbé-Marbois (Barbé de Marbois), François', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('La Vauguyon, Paul-François de Quélen de Stuer de Caussade, duc de','La Vauguyon, Paul François de Quélen de Stuer de Causade, Duc de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Dubuysson des Aix, Charles-François, chevalier','Du Buysson des Aix, Charles-François, vicomte', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, —— (f. 1779–80)', 'Charles-René-Dominique Sochet', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, —— (f. 1780–1782)','Destouches, Charles-René-Dominique Sochet', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, ——','Destouches, Charles-René-Dominique Sochet', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, Chevalier','Destouches, Charles-René-Dominique Sochet', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, Charles-René-Dominique Sochet \(f. 1779–80\)','Destouches, Charles-René-Dominique Sochet', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Destouches, Charles-René-Dominique Sochet \(f. 1780–1782\)','Destouches, Charles-René-Dominique Sochet', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Chaumont, Jacques-Donatien Le Ray de', 'Chaumont, Jacques Donatien, Leray de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Sartine, Antoine-Raymond-Gualbert-Gabriel de', 'Sartine, Antoine Raymond Jean Gualbert Gabriel de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Armand, Charles', 'Armand (Armand-Charles Tuffin, marquis de La Rouërie)', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Armand Tuffin, Charles, marquis de La Rouërie', 'Armand (Armand-Charles Tuffin, marquis de La Rouërie)', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Estaing, Charles-Hector Theodat, comte d’', 'Estaing, Charles-Hector Théodat, comte d’', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Estaing, Charles-Hector, comte d’', 'Estaing, Charles-Hector Théodat, comte d’', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Estaing, Charles-Henri, comte d’', 'Estaing, Charles-Hector Théodat, comte d’', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Caracciolo, Domenico, Marchese di Villamaina', 'Caracciolo, Domenico, Marchesse di Villa Marina', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Arendt, Henry Leonard Philip, baron d’', 'Arendt, Henry Leonard Philip', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Trumbull, Jonathan Sr.', 'Trumbull, Jonathan', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Berubé de Costentin, ——', 'Costentin, Berubé de', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Rocquette, Jacques, Elsevier, T. A.', 'Rocqùette, J.,Th. A. Elsevier, & P. Th. Rocqùette,', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Dubbeldemuts, Adrianus', 'Dubbeldemuts, F. & A.', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Dubbeldemuts, Franco', 'Dubbeldemuts, F. & A.', regex=True) 
df[['authors','recipients']] = df[['authors','recipients']].replace('Parsons, Samuel Holden', 'Parsons, Samuel H.', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Stirling, Lord \(née William Alexander\)','Stirling, Lord (né William Alexander)', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Alexander, William Lord Stirling','Stirling, Lord (né William Alexander)', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Stirling, Major General','Stirling, Lord (né William Alexander)', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Alexander, William','Stirling, Lord (né William Alexander)', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('La Lande & Fynje, de','La Lande & Fynje', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Horneca, Fizeaux & Cie.','Horneca, Fizeaux & Co.', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Staphorst, Nicholas & Jacob van','Staphorst, Nicolaas & Jacob van', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('Ambler, Jaquelin \(Jacquelin\)','Ambler, Jacquelin', regex=True)
df[['authors','recipients']] = df[['authors','recipients']].replace('La Luzerne, Anne César, Chevalier de','La Luzerne, Anne-César, chevalier de', regex=True)

In [7]:
#downloading the csv with the results
df.to_csv("../data/archive_data/founders_online_data_exploration_data.csv")

## Converting to network tables

In this section I will be converting the data from Founders Online to be usable for our social network analysis. This first starts by unnesting the authors and recipients, and then applying some further filters. The unnesting is necessary as letters can (and do!) have multiple authors and/or recipients. Unnesting then means every combination author and/or recipient will get their own entry within the dataset. Additionaly, some filtering will be done to remove unknown, anonymous and otherwise non-existant authors. Additionally entries with certain other authors or recipients will be removed due to various reasons. 

In the second part functions are defined to make an edgelist in a directional graph format. This will be done by feeding the unnested dataframe through a networkx graph and exporting is as a dataframe back again. During this process also a weight value will be assigned which is equivalent to the amount of letters send by one author to any of their recipients. Based on the new edgelist, a nodelist will be generated. The edge and nodelists will then be downloaded into their respective data directory.

The resulting edgelist and nodelist is mainly meant to be used in the social networking tool Gephi. However it should also be compatible with a tool such as Palladio. Other tools have not been tested.

In [8]:
#defining function to remove any non existant authors and unwanted authors and recipients.
def remove_author_recipient(df, names):
  for name in names:
    df = df[df['authors'].str.contains(name)==False]
    df = df[df['recipients'].str.contains(name)==False]
  df = df.dropna(subset=['authors'])
  return df

names = ['Unknown','UNKNOWN','Anonymous','First Joint Commission at Paris', 
         'American Commissioners','Son','Son ()','Sons ()','Cie.',' Cie.', 'fils'
         , 'et al.','Zoon ()', 'American Peace Commissioners','William Bradford'
         ,'Smith, William', 'Smith, John','Thornton, John','Rocquette, Pieter Th.']

In [9]:
#unnesting the authors and recipients and applying the previously defined function
df = df[['authors','recipients','id']].set_index(['id']).apply(lambda x: x.str.split('|')).explode('authors').explode('recipients').reset_index() 
df['authors'] = df['authors'].str.strip()
df['recipients'] = df['recipients'].str.strip()                                                                                                  
df = remove_author_recipient(df,names).reset_index(drop=True)

  df = df[df['authors'].str.contains(name)==False]
  df = df[df['recipients'].str.contains(name)==False]


In [10]:
#defining the edgelist generator and generating a network graph
def edgelist_gen(df):
  edgelist = df[['authors','recipients']]
  edgelist['Weight'] = edgelist.groupby(['authors', 'recipients'])['authors'].transform(len)
  edgelist = edgelist.rename(columns={"authors":"Source"}).rename(columns={"recipients":"Target"})
  return edgelist

def nxgraphing (edgelist):  
  G = nx.from_pandas_edgelist(edgelist, source='Source',target='Target', edge_attr='Weight', create_using=nx.DiGraph())
  #remove = [edge for edge, degree in G.degree() if degree < 2] #here we specify that we only want nodes with at least 2 degrees in our graph
  #G.remove_nodes_from(remove)
  return G

In [11]:
#applying the functions and describing the graph
G = nxgraphing(edgelist_gen(df))
#print(nx.info(G))

In [12]:
#downloading the edgelist and nodelist
df_edge = nx.to_pandas_edgelist(G, source='Source',target='Target')
df_edge.to_csv("../data/network_data/gephi_input/lotrp_edges.csv", sep=';', index=False)
print('edgelist saved')

df_nodes=pd.DataFrame(np.unique(df_edge[['Source', 'Target']].astype(str).values))
df_nodes.to_csv("../data/network_data/gephi_input/lotrp_nodes.csv", sep=';', header=['Node'], index_label=['Id'])
print('nodelist saved')

edgelist saved
nodelist saved
