Data + Narrative 2019 Liberating Data from Documents: Free and Reliable Methods

This repo contains information used in an introductory class on extracting data from PDFs taught at the Boston University Data + Narrative workshop on 6/4/19.

Our objective was to convert two files from PDFs into machine-readable data using Tabula, import that data into Excel, clean and analyze it.

The advanced level of this class can be found here.

Sublime Text

Sublime Text is a good text editor that'll help you understand what your raw, converted text file looks like before importing it into Excel.

Tabula

Tabula is a browser-based tool for extracting tables within PDFs and converting them to other file formats.

Other tools

Cometdocs

CometDocs is a web-based service for converting between many different file formats.

Excalibur

Similar to Tabula, Excalibur is a browser-based tool for extracting tables within PDFs and converting them to other file formats.

Xpdf

Xpdf is a command line-based tool for converting PDFs to text files.

Handling misaligned columns in Excel

Find-and-replace
Text to columns
Delete and shift cells left and right

Two tips and a warning for checking the integrity of your newly-converted data

Does your original PDF file contain a row with the sum of each column? Sum your converted data for each column to make sure it matches the figures in that row.
Use column sorting liberally. It'll help you quickly get rid of blank rows and other artifacts that tend to accompany data converted from PDFs.
Beware errant text (e.g. page numbers, header rows that repeat on every page, rows of summary data) in the PDF. That stuff can easily sneak into your data columns and seriously trip up your analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pdfs		pdfs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data + Narrative 2019 Liberating Data from Documents: Free and Reliable Methods

Sublime Text

Tabula

Other tools

Cometdocs

Excalibur

Xpdf

Handling misaligned columns in Excel

Two tips and a warning for checking the integrity of your newly-converted data

About

Releases

Packages

JoeYerardi/data-plus-narrative-2019-data-from-documents

Folders and files

Latest commit

History

Repository files navigation

Data + Narrative 2019 Liberating Data from Documents: Free and Reliable Methods

Sublime Text

Tabula

Other tools

Cometdocs

Excalibur

Xpdf

Handling misaligned columns in Excel

Two tips and a warning for checking the integrity of your newly-converted data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages