# Introduction to the Historical Data Digital Toolkit (HDDT) #

In [7]:
# First we call up the python packages we need to perform the analysis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(20, 10))
from operator import itemgetter
import networkx as nx
from networkx.algorithms import community #This part of networkx, for community detection, needs to be imported separately.
import nbconvert
import csv

# <img src="xxxx.png">

## What is a dataset? ##

A 'Complete' dataset would be one like this, where all of the data can be contained witin a perfect rectangular block of cells ('containers') and every container contains only one data item and every data item can be located by the coordinates 'Row n, Column n'

<img src="data 1.png">

When historical data is used often some data is missing (permanently lost) and a Historical Data Digital Toolkit musy be able to accept 'Incomplete' datasets. The HDDT does not lose functionality because of the incomplete nature of much historical data.

<img src="data 2.png">

The HDDT has been designed to accept Irregular datasets. The surviving evidence of the past is not only often Incomplete, it is Irregular, where datasets frequently have different dimensions. (Either because the data in itself is intrinsically different or because different data collectors use different cataloguing methods) 

<img src="data 3.png">

For the HDDT a dataset is a data set of any dimensions, complete or incomplete. The only requirement is that all datasets must contain a single common and universally shared data item. The HDDT requires all data sets to contain datasets that have PERSON (Name) in one of its rows.

Conflicts between dataset Person (Name)'s are resolved by adopting the 'main dataset' as the Authority Index. In this case the main dataset is that kindly provided by RAI and, with careful matching of Person (Name)'s found in other datasets, the RAI naming rule applies throughout.

# The Entity Relationship Diagram of the HDDT #

<img src="ERD.png">

# Data Sources #

# Data Types #

## Static bipartite or bigraph data ##

## Dynamic bigraph data ##

## Dymanic social network data ##

In [8]:
with open('vw_3_all_bipartite_names_1_2_202108181938.csv', 'r') as nodecsv: # Open the file
    nodereader = csv.reader(nodecsv) # Read the csv
    nodes = [n for n in nodereader][1:]  # Retrieve the data (using Python list comprhension and list slicing to remove the header row.
    
node_names = [n[0] for n in nodes] # Get a list of only the node names    

with open('vw_2_all_bipartite_memberships_xid_202108181937.csv', 'r') as edgecsv: # Open the file
    edgereader = csv.reader(edgecsv) # Read the csv
    edges = [tuple(e) for e in edgereader][1:]  # Retrieve the data

In [9]:
nodes

[[' Joseph Storrs'],
 ['A  Mackintosh Shaw'],
 ['A  de Fullner'],
 ['A , jun Ramsay'],
 ['A A Stewart'],
 ['A Ambrose'],
 ['A B Stark'],
 ['A B Wright'],
 ['A Bell'],
 ['A C Brebner'],
 ['A Crowley'],
 ['A Dale'],
 ['A Ellis'],
 ['A F Forsell'],
 ['A Fitzjames'],
 ['A Friend'],
 ['A G Cross'],
 ['A H Russell'],
 ['A Heaviside'],
 ['A Hodgson'],
 ['A Ioannides'],
 ['A J Larking'],
 ['A J Lorking'],
 ['A Janson'],
 ['A L  (pere) Gosse'],
 ['A L Wigan'],
 ['A Milne'],
 ['A P Balkwill'],
 ['A Reid'],
 ['A Roberts'],
 ['A Schumann'],
 ['A Thistlethwaite'],
 ['A W Parsons'],
 ['A Wells'],
 ['AI'],
 ['APS'],
 ['ASL'],
 ['Abell Smith'],
 ['Aberdeen Horticultural Society'],
 ['Abraham Crowley'],
 ['Abraham Fisher'],
 ['Abraham Logan'],
 ['Abraham Sewell'],
 ['Abram Rawlinson Barclay'],
 ['Academia Quirurgia of Madrid'],
 ['Academie Hongroise de Pest'],
 ['Academy of Anatolia'],
 ['Academy of Medicine and Surgery of Madrid and Barcelona'],
 ['Academy of Natural Sciences Philadelphia'],
 ['Academ

In [10]:
edges

[('Arthur William A Beckett', 'ASL'),
 ('Arthur William A Beckett', 'London'),
 ('Arthur William A Beckett', 'literary'),
 ('Andrew Mercer Adam', 'ASL'),
 ('Andrew Mercer Adam', 'armed services'),
 ('Andrew Mercer Adam', 'country'),
 ('Andrew Mercer Adam', 'medical'),
 ('H R Adam', 'AI'),
 ('H R Adam', 'ASL'),
 ('H R Adam', 'Africa'),
 ('William Adam', 'ESL'),
 ('William Adam', 'political'),
 ('Henry John Adams', 'ASL'),
 ('Henry John Adams', 'London'),
 ('William (1) Adams', 'Athenaeum Club'),
 ('William (1) Adams', 'ESL'),
 ('William (2) Adams', 'AI'),
 ('William (2) Adams', 'ESL'),
 ('William (2) Adams', 'London'),
 ('William (2) Adams', 'Medical Society of London'),
 ('William (2) Adams', 'Medical and Chirurgical Society of London'),
 ('William (2) Adams', 'Pathological Society of London'),
 ('William (2) Adams', 'Royal College of Surgeons'),
 ('William (2) Adams', 'medical'),
 ('William Adlam', 'ASL'),
 ('William Adlam', 'Somersetshire Archaeological and Natural History Society'),

In [11]:
print(len(node_names))
print(len(edges))

3560
9992


In [12]:
G = nx.Graph()
G.add_nodes_from(node_names)
G.add_edges_from(edges)
print(nx.info(G))

Name: 
Type: Graph
Number of nodes: 3560
Number of edges: 9992
Average degree:   5.6135


In [13]:
nx.write_gexf(G, 'jnb_hddt_intro.gexf')

# This is BIG data! #

<img src="intro.png">

# END #