Automated data extraction from U.S. state Comprehensive Annual Financial Reports (CAFR).
- taxonomy: The XBRL Taxonomy
- templates: Where table templates (explained below) should be located. Any .txt files in this directory will be loaded automatically by
- data: The state CAFR files we are using as inputs.
- example-output: Some pre-generated examples of the output this system produces, in XBRL-XML format, CSV format, and MS-Excel format.
- analysis: Various explorations performed around XBRL, CAFR, and PDF conversions. Everything in this folder is purely documented thought process. Nothing in here is used by the system.
- results: This folder should be empty in the repository. It is where
miner.pyplaces its output.
First-time setup (Linux)
You'll need a basic Python environment and the ability to check out this code repository, so you'll need at least the following packages:
- python (specifically python2)
Depending on your OS distribution of Python, you may need to manually
install setuptools. See
that page for info; on Debian GNU/Linux and probably Ubuntu, you can
sudo apt-get update && sudo apt-get install python-setuptools.
pip to facilitate the rest of the dependency installation
sudo easy_install pip
(If your distribution uses Python 3 by default, you may need to change
Now install some Python tools that will help you bootstrap the Python environment.
sudo pip install -UI setuptools pip virtualenv
Pick a place to store the repo. I usually put projects in a
in my home folder, but you can adjust this accordingly.
cd into that
cd ~/Code) Then:
git clone https://github.com/OpenTechStrategies/cafr-parsing virtualenv cafr-parsing
(If your distribution uses Python 3 by default, you'll need to change the
virtualenv line to be
virtualenv -p /usr/bin/python2 cafr-parsing or something
along those lines.)
cd into the cafr-parsing repo and "activate" this environment.
pip, we'll install all the Python libraries defined in the
requirements.txt file. (This is sort of like a Ruby
cd cafr-parsing source bin/activate pip install -r requirements.txt
You can ensure that the virtual environment is using an isolated version of Python 2:
`which python` python --version
miner.py parses CAFR files, which are PDF documents, and produces JSON files, which can be automatically translated to other formats easily (e.g., XBRL, CSV, .xlsx).
In order to know which tables to extract from the PDF files and what their fields mean,
miner.py must be supplied with a "template" for each table: a manually-constructed JSON file that tells
miner.py exactly how to recognize that table and how to map the data in the table to the desired output fields.
For now, the invocation process is just to open up
miner.py in a text editor and add calls to the end like this:
Once you've set up as many calls as you want, run
miner.py (assuming you've already done the setup steps listed above in the "Installation" section):
$ python miner.py
It may take a while to run, possibly minutes. When it's done, the results will be in the
results/ directory. There will be one result file for each table for a given state CAFR in a given year.
data/AL_cafr2011.pdfis an example CAFR input file
templates/AL_statewidenetassets.txtis an example template file
results/AL_cafr2011-statewide_net_assets.xmlis an example output file (this won't exist until you run
XBRL is not particularly useful to humans without software to render the content. Example CSV output, which were created by exporting the XBRL output to CSV with an XBRL viewer, can be found in the
example-output/csv directory. Even more readable examples in XLSX format are located in the
example-output/xlsx directory. Note when reading the CSV and XLSX files that the columns may appear in a different order than they do in the original PDFs.
Alternatively, examples of XBRL output can be found in the
To view the XBRL directly:
- Download and install an XBRL viewer.
- Copy the taxonomy files (located in the taxonomy directory) into a working folder of your choosing.
- Copy the results files (located in the results directory) into that working folder.
- Open the results files in the XBRL viewer.
These are resources that were helpful while exploring:
- Basic information on using pdfminer.
- A more complete example of a pdfminer parser.
- Open Source XBRL Editor
XBRL Taxonomy Information
- 10 common mistakes in creating xbrl taxonomies
- Modeling Business Information Using XBRL
- High Level XBRL components
- XBRL Style Guide
- Taxonomy Documentation
- Taxonomy Examples
- XBRL in Plain English
- There are dozens of TODO flags scattered throughout the code base. Some are minor, some are major.
- Continued refinement of the taxonomy structure.
- Creation of additional templates for additional states.
- Tools to assist in template generation.
- Command line tools to run the miner from the command line.
- Validation tools to identify when a template no longer matches the schema.
- Think about whether this ties in with http://open-data-standards.github.io/ (and thence https://github.com/open-data-standards) in any useful way.