docs2csv

Scan a folder of document files of all types and extract the text into a CSV suitable for import into Overview. Currently supports TXT, PDF, JPG, HTML, MHTML, RTF, and Microsoft Word, PowerPoint and Excel.

PDFs will be OCRd if -o set and they contain no text, or always if -f set. JPGs will be OCRd if -o set.

First you will need to install

Poppler, for pdfimages (and pdftotext on some systems) On Linux, use aptitude, apt-get or yum:

aptitude install poppler-utils poppler-data

On the Mac, you can install from source or use MacPorts:

sudo port install poppler | brew install poppler
Tesseract, for OCR

[aptitude | port | brew] install [tesseract | tesseract-ocr]

Without Tesseract installed, you'll still be able to extract text from documents, but you won't be able to automatically OCR them.

Typical usage

ruby docs2csv.rb -r -o directory-to-scan [outputfile]

If outputfile is omitted, docs2csv will write the CSV to stdout.

This scans the directory recursively, and OCRs any PDFs which may need it. Other options:

-l, --list                       Only list files, do not process
-r, --recurse                    Scan directory recursively
-o, --ocr                        OCR jpgs and pdfs that do not contain text
-f, --force-ocr                  Force OCR on all pdfs

Viewing the original files from Overview

The extracted text will be shown in the Overview document viewer, but not the original document pages. You can view the original files in your browser via Overview's "source file" links, if you start up a simple web server like this:

python -m SimpleHTTPServer

The "source file" links use the URL column that docs2csv writes, which has addresses of the form http://localhost:8000/[filename]. You need to run this server from the same directory where you originally ran docs2csv, as these file URLs are relative.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
tessdata		tessdata
README.md		README.md
docs2csv.rb		docs2csv.rb
tika-app-1.4.jar		tika-app-1.4.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tessdata

tessdata

README.md

README.md

docs2csv.rb

docs2csv.rb

tika-app-1.4.jar

tika-app-1.4.jar

Repository files navigation

docs2csv

About

Releases

Packages

Contributors 2

Languages

overview/docs2csv

Folders and files

Latest commit

History

Repository files navigation

docs2csv

About

Resources

Stars

Watchers

Forks

Languages