# Introduction to the 'Historical Data Digital Toolkit' (HDDT) # 

# The centres for the emergence of anthropology in Britain 1830 -1870, #
# and the 600 Quakers amongst 3000 institution builders. #

| Jump to: |
|---|
|[My PhD project](#1)|
|[HDDT Objective](#2)|
|[Datasets](#3)|
|[Data Pipeline](#4)|
|[SQLite Database](#5)|
|[Architecture](#6)|
|[Big Data](#7)|

<a id='1'></a>

# 1. My PhD project #

## Subject: ##

My own research, in collaboration with others, has revealed the extensive social connectivity between the roughly 600 members of a ‘Quaker Led Network (QLN)’ and their involvement within a community of  roughly 3000, spread across four organisations in Britain active between 1830 and 1870, which the QLN network helped to set up and staff. I call these, the ‘Centres for the Emergence of Discipline of Anthropology in Britain’ (CEDA).
 
## Question 1: ##

What can be revealed if a historian uses data science to study a large historical community over a long period of time by bringing together and integrating metadata from catalogues, indexes, and genealogical data from different sources?
 
## Methodology: ##

I have designed, built and I am now using a suite of open-source and reproducible relational database technologies and digital analytic tools to visualise and scrutinise the entire community of some 3000 activists over 40 years (1830-1870), picking out the Quakers amongst them so that the community can be explored at both group and individual level. I am able to model the ‘connected’ relationships between the individual members of the CEDA through time, including kinship, education, occupations, locations and organisations.I call my model the Historical Data Digital Toolkit (HDDT).
 
## Question 2: ##
What is the extent of Quaker involvement in the CEDA, over the 40 year time span researched, and was Quaker kinship as socially cohesive as (say) education or occupation amongst the wider community?

<a id='2'></a>


# 2 HTTD Objective #

To enable historians to create a desktop data pipeline by integrating multiple archival data piplines into one SQLite HDDT database. The resulting HDDT data can then be cleaned, viewed, analysed and visualised by using compatible open source software packages. The HDDT is designed to enable historical ordered data to be surveyed, much as an archaeologist might use a variety of technologies to survey a territorial site of historical interest. The HDDT is not a tool for finding individual or even several items in an archive.

The HDDT constitutes a new approach to digital and nineteenth century history with the use of data science techniques deployed to the study of ordered historical data. 
Exponential population growth in the nineteenth century combined with a cultural, disciplined  and extensive interest in collecting, cataloguing and preserving historial artefacts and manuscripts, offers to the historian an opportunity to study communities and whole societies en masse.
The historical archives of the nineteenth century are often very large (and are immense when viewed collectively), they are too big to be surveyed by the naked eye or even by using 'office' based technologies. Nonetheless they offer rich sources of ordered data sets (organised, catalogued and frequently cross-referenced to each other).
Further more, archived nineteenth century manuscripts frequently include within them ordered data items including surveys, lists and indexes. These are a rich source of historical data. Data science is now capable of handing odereded historical data on large scales, it is time to devise a way for digital historical data to be studied in and of itself, free of narratology and capable of including mass data. In the past, historical research in the archive was largely limited to using search engines to 'narrow down' from the whole of the collection to a discrete set of manuscripts or artefacts to produce a manageble handfull of items capable of study by a few visits (usually) by one historian to an archive.
This approach is still immensely valuabe but when looking at the modern period, it is insufficient, because it leaves too much unaccounted for, leaves too many ignored and uses a torch where a lighthouse is needed.
The HDDT addresses this need and provides a satisfactory way of studying mass data.

### Archival data can now be surveyed much as an archaeological site can - by using technology ###

<img src="diggers.png">


<a id='3'></a>


# 3 Contributing datasets #

## The HDDT integrates ordered datasets from a variety of sources to create one SQLite HTTD database ##

Historians can create a bespoke database taking data from multiple sources using the HDDT. This project takes data from:

| Source |Records |
| --- |---|
| Royal Anthropological Insitiute (RAI)| 2260 |
| Quaker Family History Society (QFHS) |593|
| Independent research at RAI |1171|
| Independent research at Friends House Quaker Archive, London |30|
| total records | 3095 |

Component datasets can be 'Complete', 'Incomplete' or 'Irregular' as long as the contributing datasets consist of records where at least one column is shared. In this Project person_name was common to all datasets.

### A 'complete' dataset ### 

Would be one like this, where all of the data can be contained witin a perfect rectangular block of cells ('containers') and every container contains only one data item and every data item can be located by the coordinates 'Row n, Column n'

<img src="data_1.png">

### An 'incomplete' dataset ###

When historical data is used often some data is missing (permanently lost) and the HDDT is able to accept 'Incomplete' datasets. The HDDT does not lose functionality because of the incomplete nature of much historical data.

<img src="data_2.png">

### An 'irregular' dataset ###

The HDDT has been designed to accept Irregular datasets. The surviving evidence of the past is not only often Incomplete, it is frequently Irregular, where multiple datasets have different dimensions. (Either because the data in itself is intrinsically different or because different data collectors use different cataloguing methods) 

<img src="data_3.png">

For the HDDT a qualifying dataset is a data set of any dimensions, complete, incomplete or irregular. The only requirement is that all datasets must contain a single common containing one universally shared data item. The HDDT requires all data sets to contain datatables that can be referenced to a PERSON (Name) in one of its rows.



<a id='4'></a>


# 4 Data pipeline #

<img src="pipeline.png">

# 3 Relational Database options #

###  Bigraph, Social Network - Static or Dynamic? ###

<img src="bipartite_diag.png">

## The HDDT can do all three (and even blend types) ##

Throughout the HDDT Data items are variously called 'columns', 'names', nodes' and 'source and target' with the links bewteen data items being registered in 'many to many (m2m)' tables, all of which can be used to generate 'edges' tables (for data visualisation in Gephi). 

<a id='5'></a>


# 5 Integrated HTTD SQLite database - 3095 persons #
## Entity Relationship Diagram (ERD) ##

<img src="ERD.png">

At the heart of the HDDT (and the SQLite database) is the 'person table', this holds the data (attributes) unique to each person. All contributing datasets have a person column and often attributes, some of which may be shared across datasets. Conflicts between dataset Person (Name)'s are resolved by choosing to accept the 'RAI dataset' as the 'Authority Index'. With careful matching of Person (Name)'s found in other datasets, the RAI naming rule therefore applies throughout.

Tables of data items shared amongst persons (such as 'occupation', 'location', 'societies', 'clubs') are linked to the person table by m2m tables.

There are also person_person tables to capture family relationships

<a id='6'></a>

# 6 HDDT Design Architecture #

The following packages are required to make the HDDT:

| package | Use | 
| --- | --- |
| GitHub | for version control and sharing | 
| SQL | the database |
| VSC | building the database |
| VSC | version control interface to Git |
| DBeaver | data cleaning, data management and ananysis |
| Jupyter Notebook | data analysis |
| Gephi | data visualisation | 


They were chosen because they are universal, popular, open-source and suitable for handling historical ordered data.

## Project containers contain all required resources ##


<img src="file_structure.png">

Data management needs caeful consideration and design. The HDDT uses the cocept of project containers where every container is set up and initialised as a GitHub repo. Then a Jupyter Notebook is created in the same container. All resources needed for a project are then copied from master containers (such as the template CSV files and dataframes in the ceda/database/views container.

Gexf graph files and gexf project files also are set up and saved in each container.

Relative links can then be used and their integrity preserved.

GitGub can also be used for version control providing an audit trail of changes, additions and deletions to the HDDT system. 

## 6a Github ##

The entire HDDT project, its description, structure, organisation and resources are contained in one GitHub account:

https://github.com/KelvinBeerJones


<img src="git.png">

## 6b SQLite ##

See 5. Entity Relationship Diagram

SQLite was chosen as the database build. 

SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day (https://www.sqlite.org/index.html). 

## 6c VSC (code for building the HDDT) ##

Visual Studio Code is a lightweight but powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux. It comes with built-in support for JavaScript, TypeScript and Node.js and has a rich ecosystem of extensions for other languages (such as C++, C#, Java, Python, PHP, Go) and runtimes (such as .NET and Unity)(https://code.visualstudio.com/).

<img src="vsc.png">

## 6d VSC (Version control interface local to online Git Repo's ##

VSC is also used to integrate the desktop version of the HDDT with GitHub. (https://code.visualstudio.com/docs/editor/github)

<img src="vsc_for_git _version_control.png">

## 6e DBeaver ##

 Universal Database Tool

DBeaver is a free multi-platform database tool for developers, database administrators, analysts and all people who need to work with databases. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, DB2, SQL Server, Sybase, MS Access, Teradata, Firebird, Apache Hive, Phoenix, Presto, etc.(https://dbeaver.io/)

<img src="DBeaver.png">

## 6f Jupyter Notebooks ##

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning. (https://jupyter.org/).

<img src="jnb.png">

## 6g Gephi ##



Gephi is a tool for data analysts and scientists keen to explore and understand graphs. Like Photoshop™ but for graph data, the user interacts with the representation, manipulate the structures, shapes and colors to reveal hidden patterns. The goal is to help data analysts to make hypothesis, intuitively discover patterns, isolate structure singularities or faults during data sourcing. It is a complementary tool to traditional statistics, as visual thinking with interactive interfaces is now recognized to facilitate reasoning. This is a software for Exploratory Data Analysis, a paradigm appeared in the Visual Analytics field of research.(https://gephi.org/features/)

<img src="gephi.png">

<a id='7'></a>


# 7 600 Quakers amongst 3000 activists for 40 years. This is BIG data! #

<img src="big_data.png">

# END #