Dec 29, 2015 Sorry for the incomplete and messy state of this repo. Deadlines and holidays got the better of me. I am in the process of cleaning the code I have and planning the next steps. In the meantime, if you would like some idea of what Auto-Archive can do see these slides. -- John
Auto-Archive is suite of programs designed to preserve documents in an interactive, electronic archive with the minimal amount of human input.
In particular, it aims to correct errors introduced when a document is read using an Optical Character Recognition (OCR) software, which is a common problem for historical documents.
Archives are a uniquely vulnerable repository of the world's culture. Even when documents have been gathered to form an archive, they are vulnerable to flame, neglect, theft, and deliberate destruction. The destruction of a people's archive is a common tool of war. The loss of an archive is irrevocable. Storing documents electronically does not automatically prevent loss: spilled coffee, computer viruses and other modern plagues routinely destroy data. Storing archives electronically does, however, mean that they can be ubiquitous, easily accessed, and easily shared. Auto-Archive save archives from destruction by making them easy to copy and save them from neglect by making them easy to access.
One day, I would love to see a relative of yours using auto-archive to create a family archive. I want to see archives in small hard drives smuggled out of war zones.
Auto-Archive began on November 23, 2015 as the 5th and final project for the Metis Data Science Bootcamp. Its goal was to improve the OCR text of the PDF documents enough so that they could be modeled by common Natural Language Processing (NLP) tools.