Auto-Archive

Dec 29, 2015 Sorry for the incomplete and messy state of this repo. Deadlines and holidays got the better of me. I am in the process of cleaning the code I have and planning the next steps. In the meantime, if you would like some idea of what Auto-Archive can do see these slides. -- John

Introduction

Auto-Archive is suite of programs designed to preserve documents in an interactive, electronic archive with the minimal amount of human input.

In particular, it aims to correct errors introduced when a document is read using an Optical Character Recognition (OCR) software, which is a common problem for historical documents.

Motivation

Archives are a uniquely vulnerable repository of the world's culture. Even when documents have been gathered to form an archive, they are vulnerable to flame, neglect, theft, and deliberate destruction. The destruction of a people's archive is a common tool of war. The loss of an archive is irrevocable. Storing documents electronically does not automatically prevent loss: spilled coffee, computer viruses and other modern plagues routinely destroy data. Storing archives electronically does, however, mean that they can be ubiquitous, easily accessed, and easily shared. Auto-Archive save archives from destruction by making them easy to copy and save them from neglect by making them easy to access.

One day, I would love to see a relative of yours using auto-archive to create a family archive. I want to see archives in small hard drives smuggled out of war zones.

Origin and History

Auto-Archive began on November 23, 2015 as the 5th and final project for the Metis Data Science Bootcamp. Its goal was to improve the OCR text of the PDF documents enough so that they could be modeled by common Natural Language Processing (NLP) tools.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Auto-ArchiveSlides.pdf		Auto-ArchiveSlides.pdf
README.md		README.md
download_lon_text_test.ipynb		download_lon_text_test.ipynb
download_lon_texts_metadata.py		download_lon_texts_metadata.py
find_date_in_first_10 lines.ipynb		find_date_in_first_10 lines.ipynb
find_date_in_first_10_lines_2.ipynb		find_date_in_first_10_lines_2.ipynb
find_months.ipynb		find_months.ipynb
get_pdf_text.py		get_pdf_text.py
parse_and_anneal_text.py		parse_and_anneal_text.py
test_scrape_un_docs.py		test_scrape_un_docs.py
text_cleaning.py		text_cleaning.py
text_segmentation_and_parsing_test.ipynb		text_segmentation_and_parsing_test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto-Archive

Introduction

Motivation

Origin and History

Reading PDF documents

A Flexible Database

Text Correction

To Do List

About

Releases

Packages

Languages

John-Keating/auto-archive

Folders and files

Latest commit

History

Repository files navigation

Auto-Archive

Introduction

Motivation

Origin and History

Reading PDF documents

A Flexible Database

Text Correction

To Do List

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages