Skip to content

The 5th project for the Metis Data Science Bootcamp. It is an interactive archive for the newly digitized documents of the League of Nations.

Notifications You must be signed in to change notification settings

John-Keating/auto-archive

Repository files navigation

Auto-Archive

Dec 29, 2015 Sorry for the incomplete and messy state of this repo. Deadlines and holidays got the better of me. I am in the process of cleaning the code I have and planning the next steps. In the meantime, if you would like some idea of what Auto-Archive can do see these slides. -- John

Introduction

Auto-Archive is suite of programs designed to preserve documents in an interactive, electronic archive with the minimal amount of human input.

In particular, it aims to correct errors introduced when a document is read using an Optical Character Recognition (OCR) software, which is a common problem for historical documents.

Motivation

Archives are a uniquely vulnerable repository of the world's culture. Even when documents have been gathered to form an archive, they are vulnerable to flame, neglect, theft, and deliberate destruction. The destruction of a people's archive is a common tool of war. The loss of an archive is irrevocable. Storing documents electronically does not automatically prevent loss: spilled coffee, computer viruses and other modern plagues routinely destroy data. Storing archives electronically does, however, mean that they can be ubiquitous, easily accessed, and easily shared. Auto-Archive save archives from destruction by making them easy to copy and save them from neglect by making them easy to access.

One day, I would love to see a relative of yours using auto-archive to create a family archive. I want to see archives in small hard drives smuggled out of war zones.

Origin and History

Auto-Archive began on November 23, 2015 as the 5th and final project for the Metis Data Science Bootcamp. Its goal was to improve the OCR text of the PDF documents enough so that they could be modeled by common Natural Language Processing (NLP) tools.

Reading PDF documents

A Flexible Database

Text Correction

To Do List

About

The 5th project for the Metis Data Science Bootcamp. It is an interactive archive for the newly digitized documents of the League of Nations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published