Skip to content
Henric Andersson edited this page Mar 24, 2015 · 5 revisions

Welcome to the LagerDox wiki!

LagerDox is a system allowing you store and search for documents. It uses OCR to understand the contents and tries to both categorize (according to your settings) as well as figuring out the date you received the document. Lastly it makes it possible to search all documents with various options.

External dependencies

It does NOT rely on any cloud functionality, it will scan a folder on your system and add any documents it finds to a LOCAL database and then move the PDF to a folder structure on your system. It also generates thumbs to aid you with visual recognition.

It's quite possible to tie this into solutions such as ownCloud, DropBox, Google Drive or various cool FUSE implementations which use a variety of back end solutions.

Integrity

The PDFs will not be touched, unless you use the splitter page or manually ask LagerDox to split a PDF into several parts. Even then it's done by simply slicing the document.

Should you ever lose to database, you can recreate it by moving ALL documents into the incoming folder and they will be rescanned. So no information is ever lost (except for metadata created by LagerDox itself, such as categories).

Typical setup

Recipe:

  • Scanner (preferably ADF with duplex scanning)
  • Linux machine
    • Samba
    • Apache
    • PHP
    • MySQL

Cooking:

  • Install the software on Linux (see Install), setup your scanner so it points to the incoming folder of your setup (typically a shared folder on Linux, thus samba).
  • Launch the monitor script (sorry, not automatic yet)
  • Scan away
  • Look in awe at the shiny, searchable, dated and categorized documents using your web browser.

Background

I found the US has a tendency to bury you in paperwork, which is quite annoying. Especially when you need to find something.

Given that you can buy multi-function printers with full-duplex automatic document feeders (that was a mouthful) for about $99, it shouldn't be difficult to solve that problem. Except it is, they will at best just dump everything in a folder on your desktop with some light OCR done to the text.

The alternative is (of course) to let google drive, evernote, or similar cloud solutions handle the problem. Except I didn't feel comfortable uploading everything to a cloud service given all the breaches that happens. I mean, the goal was basically to shred the paper that was scanned, keeping only the digitial copy.

Thus, lagerdox was born. It's based on Tesseract OCR engine since it's open source and quite capable at dealing with obscure text and languages (even multi-language). The interface is written mostly in PHP and even some of the shell scripts call upon the PHP to minimize the need to duplicate code.

Since early 2014 I haven't had a need to do any code changes to lagerdox, it is right now at "good enough" state for my own use. Beginning of 2015 a colleague of mine talked about making a similar system, but when I showed him mine he wanted to use that as a base instead of reinventing the wheel. And given the number of times the same discussions have happened during these years, I figured maybe it was time to share what I had, even if it's not the most beautiful code on this earth.

Why the odd name?

While one could argue that the name sucks, I'll give you some background :-)

Lager = Storage in Swedish

Dox = Docs = Documents

so basically storage for documents