# Liberating Archives: Preparing Research-Ready Databases 

### Notebooks and tutorials for Digital Archivists, Humanists, and Social Scientists 


<img src="pics/book.jpg"/>

__by__
Nick Adams, 
Johannes Fritz, 
Evelyn Mwangi, 
Jake Ryland Williams, 
Aaron Culich,
Jeff Gordon, 
Rebecca Fan

__A project of:__
<table border=0px width=100%>
<tr>
<td border=0> <img src="pics/goodnotalias copy.png" alt="Drawing" style="width: 200px;"/> 
<td border=0> <img src="pics/DLab.png" alt="Drawing" style="width: 200px;"/> 
<td border=0> <img src="pics/ssrc.png" alt="Drawing" style="width: 200px;"/>
</tr>
</table>

__Acknowledgements:__
Cody Hennesy (for consultation)
Berkeley’s Social Science Matrix (for ongoing support of a course based on these materials)
Stefan van der Walt (for consultation)
Aakriti Kaul (for consultation)




## Table of Contents

0.0 – Introduction  
0.1 – Begin by Wishing  
1.0 – Gathering your documents  
1.1 – Using selenium to collect a large number of pages  
2.0 – Parsing your Documents  
2.1 – Introduction to Regular Expressions  
3.0 – Organizing your Research-Ready Database  
3.1 – Evaluate and Iterate  
4.0 – Storing and Sharing your Research-Ready Database  
5.0 – Conclusion  
6.0 – Appendix – Additional Scripts


## Executive Summary

Billions of pages of textual data are stored in digital archives accessible to the public through the internet. But almost none of these archives are designed to facilitate research using powerful contemporary computational text analysis methods. Instead, people have to access and read each document one-at-a-time, just as they would physical documents at their local library. This is not ideal. And as computational text analysis techniques continue to mature – providing researchers with increasing capacities to explore and inquire into massive sets of documents – the unrealized potential of digital archives grows larger.

With so much important human history documented digitally but inaccessible to computer-aided analysis, it behooves librarians, archivists, digital humanists, and social scientists to re-organize digital archives for new computational approaches. However, most people who have actually converted a digital archive into a research-ready database can tell you: the process is rather daunting if you are doing it for the first time with no resource to guide you. So, in the spirit of training many people to fish, the GoodlyLabs have teamed up with the Social Science Research Council and the Computational Text Analysis Working Group (CTAWG) at UC Berkeley’s D-Lab to produce this set of tutorials. 

The tutorials walk beginners through a process that:
1.	collects text documents spread across a digital archive into a single corpus (i.e. document set) retaining document-level metadata (i.e. variables describing the document);
2.	identifies useful and meaningful structures within the documents (e.g. subheadings) that will provide traction for later queries into the textual data set;
3.	organizes the documents into a database structure that allows a computer to efficiently search for important elements in the data; 
4.	links the data from the documents to other data sources describing the people, events, or objects described in the documents; and 
5.	shares all this better-organized, research-ready data through a web-portal allowing anyone with an internet connection to ask questions about the data they never could before. 

We have taken special care to write these tutorials for absolute beginners. Though this executive summary assumes a readership with some prior knowledge of computational text analysis, our tutorials avoid jargon and relate each step of the process to something readers already do in their daily lives. Converting a digital archive into a research-ready database is certainly a complex process best done in (something like) the order we recommend, but each step is simple on its own, and draws on cognitive skills our readers already possess. 

With this publication, we provide more than just tutorials and exercises designed to teach skills. We provide the programming scripts we’ve written to collect, parse, re-organize, and publicly share transcripts of U.S. Congressional Hearings. We encourage readers to copy these scripts and re-purpose them for their own use.   

Finally, let us say how important it is to us that these tutorials be experienced as fun and empowering. If anything can be simplified or made clearer, please make a suggestion. We will do our best to incorporate all feedback into these continually improving documents. 
  
## Introduction
Hello! This is an exciting moment! You are about to embark on an adventure that will change the way you experience the world, and likely make history. Whether your name will appear in high school text books remains to be seen, but the work you do with the help of these tutorials will almost certainly be described to students by analogy to the Gutenberg printing press. Recall that in the centuries before Gutenberg Europe was in a “dark age” of intellectual malaise. But in the decades after the printing press was created, Europe experienced a renaissance and enlightenment with profound social, political, and economic consequences. 

Your work will be no less consequential. For the first time in history, humans can enlist the help of computers to analyze thousands of documents in seconds, and find patterns in minutes that a dedicated team of human researchers could only find after years of work. Those patterns might describe the behaviors of our political leaders, the ways we are collectively constructing notions of race or gender, or how police and protesters interact during times of political tumult. By enlisting the help of computers to process all these documents, we can hold up a mirror to society allowing us to all see more clearly how we are creating our realities, and how we might recreate them for the better. 

But none of this important analysis can happen until the world’s documents are prepared for computers to ingest, parse, search, and analyze. And none of the patterns computers can help us identify will make any difference if they are not shared openly with the rest of society in a way the public can understand.

This is where you come in! You, a person who, walking down the street, appears like any other. A person who puts their pants on one leg at a time… who sometimes trips on the sidewalk, occasionally says the wrong thing, and too often feels underappreciated by colleagues, friends, or family. Little did they know that you were a hero all along. Little did they know that you would be among a small band of people who change human history by making human history so legible. Little did they know…

And you may not believe it yourself. But over the course of several days – whether you make your way through these tutorials in a couple weeks or over a few months – you will learn how computers read language. You will learn how they do work, and how you can put them to work.

To your surprise, you will learn that programming a computer is not as hard as you thought – that you already know 90% of what you need to know to do it – because you already know how a flow chart works, how to write out task lists, and how to sequence them. You will also learn how you are smarter than a computer in many ways, and how you can teach it what you already know about language. And finally – as your work, and that of others like you, increasingly feeds back into society – you will learn that the world is not at all what you thought it was. Just like vast amounts of dark energy and matter permeate the universe without physicist quite being able to understand it; our social world includes vast latent potential waiting to be activated. All we need to do is show it to people.

### Where We’re Headed: A Research-Ready Database

You may already have a dataset that you want to unlock and make available to researchers and the public. And you may already have an idea of the positive impact that will result when those documents are shared. But, to give you a sense of the difference between our current reality and the reality you will create by using these tutorials, consider the database we have created along the way to producing these tutorials – a database of speeches occurring in U.S. Congressional Hearings.

We’ll start by having a look at the Congressional Hearings records made available by the Government Publishing Office: The GPO is doing a great service to our democracy by making these Hearings transcripts available, and they are beautifully organized for preservation purposes. If anyone wants to know what happened during a particular hearing and/or particular session of Congress, they can easily navigate to the record and read along as if they were there in the Capitol building watching our government in action.

But what if they wanted to do more than just virtually observe a single hearing? What if, for instance, they wanted to know what all the Democratic Representatives from rural districts were saying in in the Agriculture Committee over the course of 30 years? They would have to read all the transcripts of all the hearings over that time period, carefully ignoring what Republicans and Democrats from urban districts said. They’d have to take notes in a separate document about the relevant statements. Then, they’d have to read through all those notes and try to find patterns. That could be years of work. 

What if they could get that question answered in an afternoon? All the speeches are in those hearing transcripts. And it’s not hard to know who is a Democrat or a Republican, or whether they hail from an urban or rural district. That information is public knowledge. So finding all those particular speeches should be easy. … But it’s not. 

Over the course of these tutorials, you will learn how to organize a set of documents (like the hearings); link them to widely available data describing the people, places, or things appearing in those documents; and make a website allowing others to very easily search through all the documents to find just what they are looking for. By the end of your training, you will be able to create a website that works more like this:

In [1]:
%%HTML
<video width="800" height="550" controls>
  <source src="pics/CQ.mp4" type="video/mp4">
</video>

As you can see, the difference in a curious citizen’s experience is stark. Without the work we are embarking on today, digital archives like the GPO’s Congressional Hearings transcripts are only “open,” “transparent,” and “accessible” to the incredibly determined researcher with months of free time. But once we are finished with a digital archive, even a novice can ask and answer deep and powerful questions about the people, places, things, and ideas appearing in a set of documents. The databases you will be able to produce by following these tutorials are “research-ready” for both the public and scholars. They have the following characteristics: 

* Digital text (machine-readable)

* Query-able (token search across researcher-defined set of documents)

* Retaining original structure, formatting

* Supplemental search-able annotations (e.g. labeling speech acts, speakers)

* Linked to other data describing the same object     


### You Can Do This!

The work we will be doing together centers around three major phases, each depending primarily on skills you already have: 
(1) collecting documents from web-based digital archives, 
(2) finding additional meaningful units of text in those documents, and 
(3) organizing all the data into a database format. 

__Phase 1:__ You know how to browse the web, and how to click through menus and folders to find files. Just click, click, scroll, click, download… and you have one of the files you need. Phase 1 will teach you how to command a computer to do that click, click, scroll, click, download 100 times per minute, so that you can collect all the documents of your archive together in several hours instead of several weeks. 

__Phase 2:__ You know how to skim through a newspaper by looking at the page headings for ‘Sports,’ ‘World News,’ ‘Weather,’ etc. And you know where to expect the title of an article and the author’s name. Phase 2 will teach you how to command a computer to recognize similarly reliable bits of information throughout your documents so that, for instance, people can search through just the ‘Weather’ reports while ignoring all the ‘Sports’ news. Here, again, you already know what to do, and you know much more than any computer. We’re just going to help you show a computer how to look for and keep track of the sorts of information you would look for. 

__Phase 3:__ Many of us have used a database at some job that required us to enter information about customers. But chances are, you’ve created relational databases (like those we will use in our projects) before, too. A friend of mine, for instance, recently created a very simple relational database without even knowing it. She created a spreadsheet of addresses for all the friends and family to whom she writes holiday greeting cards. Then in a separate file she listed each person’s name and typed out a thoughtful message for that person. (Typing the notes in a word processor allowed her, later when she wrote the messages out by hand, to focus on her penmanship and not the content of the greeting.) Unbeknownst to my friend, she was creating a relational database when she created two separate files – one for addresses, and one for the message content – with overlapping information regarding the names of her greeting card recipients. 

Relational databases are so useful because we often want different sets of information stored in separate files (even though they share some information in common), and we want to be able to search across those multiple files to pull together information from them in some later moment. In my friend’s case, she built a relational database because she preferred not to type out the messages in a spreadsheet when she could use a word processor instead. But computers have their own reasons to prefer splitting data into separate files (which we will discuss). As long as those files have some item of data in common, we (and the computer) can always piece all the information back together. 

Querying a relational database (which we will also show you how to do with a computer) is something my friend did without knowing it, too. When she wrote me a holiday card, she manually queried her own little database to seek the information ‘address’ and ‘message content’ from all records (across all the files) where the ‘name’ of the record was ‘Nick Adams.’ In Phase 3, will teach you how a computer prefers that we organize the separate files of a database so that it can store data efficiently, and quickly perform queries across all the files so it can give us the specific information we want.

__Bonus Phase 4:__ As a bonus to ensure that all our work is as useful as possible to as many people as possible, we will walk you through some considerations about how to share your data with the world, and show you how we’ve done it for our particular case. We know you know how to share! You’ve been doing it at least since kindergarten.   

### What to Expect Along the Way: An Iterative Sequence

None of these phases can be completed without giving some thought to the others. But trying to complete all of them at once would likely overwhelm you. (It would overwhelm us!) Though no single step in these tutorials is particularly difficult, each phase has enough complexity to challenge an intelligent beginner. And often, data from one phase needs to be carried forward into the next phase. So we recommend (especially to beginners) that you focus your attention on completing one phase at a time in the sequence we illustrate. 

Celebrate your arrival at checkpoints, too! As you will learn (hopefully without our help!) computers are quite finicky about the commands we give them. Sometimes a stray comma can confuse the machine and it can take even a veteran programmer many hours to find that little mistake. So, take your time. Expect bumps along the way. Don’t let yourself get too discouraged. And, we reiterate: enjoy your successes as you go. It’s exciting when the computer works for you. It saves you thousands of hours of tedium. So, spend a nice moment enjoying the dopamine rush that comes when the computer, quite magically, does what you hope it will do! 

__Iterate.__ As with most design and engineering processes, we should not begin until we have some vision of the outcome we are trying to accomplish. But we also cannot expect to have a perfect vision of that outcome before commencing. So, we have to prepare ourselves now for the likelihood that getting to our goal may require two or more iterations of our process. For instance, we might realize only once we (think we) are at the very end of our project and querying our database that we failed to collect information about some important variable way back in the beginning. Such a mistake would not be catastrophic. And it wouldn’t require you re-do all or even a majority of the work. But it might require re-factoring your instructions to the computer, and/or re-organizing your database. We want you to start this project with the expectation that your first attempt will fall somewhat short of your vision, because you will never gain momentum if you are paralyzed by the fear of making a mistake. That said, we also attempt to share with you some of the insights we have learned through trial-and-error, and to point out those moments where extra forethought can save you trouble down the road.
For example, the next page and the very beginning of our tutorials, does not start with Phase I. It starts with some forethought about what kind of data you will want to organize into a database in Phase III. Let’s get started! 