# Networks in JQA Papers

Inspired by the _Six Degrees of Francis Bacon_ Project ([DHQ article](http://digitalhumanities.org/dhq/vol/10/3/000244/000244.html) & [site](http://www.sixdegreesoffrancisbacon.com/?ids=10000473&min_confidence=60&type=network)), this notebook explores statistically inferred networks in John Quincy Adams's papers.

The encoding of the JQA papers by the [Massachusetts Historical Society](https://www.masshist.org/) captures many historical people mentioned in JQA's diaries.

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg
from sklearn import preprocessing
from sklearn.covariance import graphical_lasso, ledoit_wolf, shrunk_covariance, empirical_covariance
# GraphicalLasso, GraphicalLassoCV,
    
# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/SemanticData/"

## Import and Clean Data

With network graphs, how should I understand the distribution of data? Mentions are expected to be unbalanced, which reflects thd document's priorities.

In [2]:
%%time

# Read in file; select columns; drop rows with NA values (entries without a named person).
df = pd.read_csv(abs_dir + 'Output/ParsedXML/JQA_dataframe.txt',
                 sep = '\t')[['entry', 'people']] \
    .dropna()

# Split string of people into individuals.
df['people'] = df['people'].str.split(r',|;')

# Explode list so that each list value becomes a row.
df = df.explode('people')

# Create entry-person matrix.
df = pd.crosstab(df['entry'], df['people'])

df.head()

CPU times: user 1.38 s, sys: 77.1 ms, total: 1.46 s
Wall time: 1.48 s
CPU times: user 1.39 s, sys: 77.2 ms, total: 1.46 s
Wall time: 1.48 s


people,Ishbosheth,Willis Alston,abbot-benjamin,abbot-joel,abbot-joel2,abbot-joseph,abbot-unknown,abbott-joel,abbott-joseph,abdon,...,young-unknown5,yriarte-unknown,yuan-ruan,zaeb-unknown,zaeb-unknown2,zea-francisco,zeabermudez-unknown,zekiel homespun,ziba,zozaya-jose
entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
jqadiaries-v23-1821-05-07,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jqadiaries-v23-1821-05-08,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jqadiaries-v23-1821-05-09,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jqadiaries-v23-1821-05-10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
jqadiaries-v23-1821-05-12,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Throw Graphical Lasso at Dataframe

Explanation
1. Purpose of Graphical Lasso
2. Why it's useful for adjacency matrix, especially network co-occurance.


Understanding sparse inverse coviance extimation ([W3cub](https://docs.w3cub.com/scikit_learn/auto_examples/covariance/plot_sparse_cov/)).


From ____:

    "The Lasso regression model is a type of penalized regression model, which 'shrinks' the size of the regression coefficients by a given factor (called a lambda parameter in the statistical world and an alpha parameter in the machine learning world). The goal of shrinking the size of the regression coefficients is to prevent over-fitting the model to the traiing data. By shrinking the size of the regression coefficients, we get a model that more poorly predicts our outcome (e.g. has increased bias), but which we hop will be more stabl when applied to unseen data (e.g has decreased variance).  One of the nice things about the Lasso is that, given the way the penalties are applied to the regression coefficients, the size of certain regression coefficients can be shrunk all the way to zero, which effectively results in model-based feature selection. For problems that benefit from insight into the relationships between predictors and outcomes, Lasso regression is very handy because it identifies the important predictive variables *and *provides an estimation of the size and direction of the partial bi-variate relationships between the predictors and the outcome."

From Xang:
    
    "In other words, the goal of graphical lasso is to induce from your data an undirected graph with sparse connections. This fact will come handy later when we try to illustrate the ETF graph and identify possible clusters."

Xang, Jason X. "[Machine Learning in Action in Finance: Using Graphical Lasso to Identify Trading Pairs in International Stock ETFs](https://towardsdatascience.com/machine-learning-in-action-in-finance-using-graphical-lasso-to-identify-trading-pairs-in-fa00d29c71a7)," <i>towards data science</i>, Accessed 09/28/2020.

____. "[Porting Ideas to Math: A Step-by-Step Derivation of Graphical Lasso](https://towardsdatascience.com/porting-ideas-to-math-a-step-by-step-derivation-of-graphical-lasso-2e01f7165d95)," <i>towards data science</i>, Accessed 09/28/2020.

#### Test Example

#### JQA Model

In [3]:
%%time

# Standardize scale of matrix values.
standardScaler = preprocessing.StandardScaler()
X = standardScaler.fit_transform(df)

# Estimate Empirical Covariance.
# Set shrinkage closer to 1 for poorly-conditioned data.
emp_cov = empirical_covariance(X)
shrunk_cov = shrunk_covariance(emp_cov, shrinkage=0.6)

# Create model of adjacency matrix.
model = graphical_lasso(shrunk_cov, alpha = 1e-6)

CPU times: user 8h 9min 10s, sys: 1min 57s, total: 8h 11min 7s
Wall time: 2h 8min 6s


#### Convert & Reshape Covariance Array into Dataframe.

In [13]:
%%time

'''
model[0] = covariance array of graphical lasso

model[0] = precision array of graphical lasso
'''

# Convert array to dataframe.
cov_df = pd.DataFrame(data = np.around(model[0], decimals=3),
                      index = df.columns,
                      columns = df.columns)

# Create new 'source' column that corresponds to index (person).
cov_df['source'] = cov_df.index

# Reshape dataframe to focus on source, target, and weight.
# Remove same-person pairs (weight = 1) and negative correlations (weight > 0).
# Rename 'people' column name to 'target'.
cov_df = pd.melt(cov_df, id_vars = ['source'], value_name = 'weight') \
    .query('(weight < 1.00) & (weight > 0)') \
    .rename(columns = {'people':'target'})


cov_df

CPU times: user 2.03 s, sys: 194 ms, total: 2.22 s
Wall time: 2.13 s


Unnamed: 0,source,target,weight
10,abner,Ishbosheth,0.400
23,adams-charles2,Ishbosheth,0.042
416,biddle-nicholas,Ishbosheth,0.072
437,black-alexander,Ishbosheth,0.400
568,brent-daniel,Ishbosheth,0.027
...,...,...,...
26398824,whipple-thomas,zozaya-jose,0.036
26398839,white-unknown3,zozaya-jose,0.115
26398874,wilcocks-unknown2,zozaya-jose,0.019
26398914,williamson-william,zozaya-jose,0.027


## Convert Dataframe to Network Data

In [22]:
%%time

# Create list of unique entities from source and target columns.
nodes = cov_df['source'] \
    .append(pd.DataFrame(cov_df['target'].values.tolist()), ignore_index = True) \
    .drop_duplicates() \
    .rename(columns = {0:'label'})

# Create identifying codes for labels.
nodes = nodes \
    .assign(source = nodes['label'].astype('category').cat.codes) \
    .sort_values(['source'], ascending = True) # Sorting matches labels with source codes.

# Create dictionary to map values to codes.
nodes_dictionary = nodes.set_index('label')['source'].to_dict()

# Create links dataframe and map links to nodes' codes.
links = cov_df \
    .assign(source = cov_df['source'].map(nodes_dictionary),
            target = cov_df['target'].map(nodes_dictionary))

print (links.shape)
links.head()

(343366, 3)
CPU times: user 161 ms, sys: 56.4 ms, total: 218 ms
Wall time: 218 ms


Unnamed: 0,source,target,weight
10,10,0,0.4
23,23,0,0.042
416,416,0,0.072
437,437,0,0.4
568,568,0,0.027


#### Make Adjustments/Filter

In [19]:
%%time

# Subset data.
# 0.4 Correlation Coefficient (weigh) considered 'moderate' in Dancey & Reidy (psychology)
# and 'strong' in Quinnipiac Univeristy (politics).
links = links.query('weight >= 0.4')

links

CPU times: user 3.78 ms, sys: 1.88 ms, total: 5.66 ms
Wall time: 4.71 ms


Unnamed: 0,source,target,weight
10,10,0,0.4
437,437,0,0.4
1013,1013,0,0.4
1985,1985,0,0.4
1986,1986,0,0.4
...,...,...,...
26390531,1763,5133,0.4
26390906,2137,5133,0.4
26391863,3093,5133,0.4
26392660,3890,5133,0.4


## Write Data to File

In [23]:
%%time

nodes.to_csv(abs_dir + "Output/Dataframes/Graphs/JQA_Network_correlation/graphLasso_nodes.csv",
             sep = ',', index = False)

links.to_csv(abs_dir + "Output/Dataframes/Graphs/JQA_Network_correlation/graphLasso_links.csv",
             sep = ',', index = False)

CPU times: user 554 ms, sys: 20.8 ms, total: 575 ms
Wall time: 581 ms
