Skip to content
This repository has been archived by the owner on Jan 5, 2022. It is now read-only.

Input Processing

Eric Jackson edited this page Mar 21, 2015 · 10 revisions

Overview

Converting the raw data provided by a government organization into a form that can be used in the GBE platform is the biggest challenge in making it easy to set up a budget site.

Our basic approach here is to design a set of "canonical" CSV formats that we can import into the platform and then to build tools to convert raw files into those canonical formats.

The Tool

We have implemented our data conversion utilities by adding a gb:process command to Laravel's artisan command-line tool. The gb:process command takes an input file, an output file and a JSON instruction file. Examples of both the data files and the instruction files can be found in the sample_data directory in the repository.

The instruction file consists of a header-specification (the exact header line to be written out to the final output file) and an array of "instructions" that are applied to the input file. The instructions form a pipeline, i.e., the output of one instruction becomes the input of the next.

All of the instructions are line-oriented, filtering or transforming the incoming lines to a set of output lines (although certain instructions are able to operate on multiple lines at once).

Each instruction consists of the name of a command to apply, a specification to be used by the command, and a run-flag that allows the command's operation to be turned off or on (handy primarily for debugging). Most of the commands make use of regular expressions to match lines or parts of lines in order to select or transform them.

Commands

The following commands have been implemented:

Select

Select uses the given regular expression to select lines of input that are to be passed through to the output.

Discard

Discard uses the given regular expression to select lines of input that are to be dropped from the output.

Transform

Transform uses the given regular expression pattern to select lines of input and then applies the replacement pattern to generate the transformed output.

Merge

Merge uses the given regular expression pattern to select a line together with a specified number of lines before and after it. A select specification (not the same as the command) then details which of those lines are to be merged and in what order using a specified "glue" string. Most useful when combined with Transforms, as described next.

MergeTransform

MergeTransform combines the merge command with a set of transformations to be applied to each of the selected lines before they are merged.

PluckColumns

PluckColumns specifies which columns of each incoming CSV line are to be retained and in what order they are to be output.

Examples

There are several example instruction files in Asheville sample_data directory. The processAll.sh Bash script shows all the commands used to transform the various raw data files in the data directory to their canonical formats in the canonical directory. These instruction files also contain some comments that further document how the processing commands work.

The avl_actual.json instruction file is probably the easiest to understand. It selects lines containing budget entries, performs a transformation on them to break up the account-string into a set of account and category columns, and then plucks and reorders the columns into the needed form.

The extract_accounts.json is similar, except that it performs a series of transformations to label accounts by type (expense, revenue, etc.).

The avl_budget.json file makes use of MergeTransform since the information we need is actually spread across multiple lines.

Finally, a number of files are processed using the null.json instruction file which has no instructions. This just runs the original file through the processor without doing anything to the lines except replacing the end-of-line character (helps in viewing files that were created on other operating systems if you're on a Mac).

The Canonical Formats

For now we will keep the canonical formats very simple. Most are obvious from viewing the files in the Asheville canonical directory. The budget and actual spending data files require a bit of comment. They consist of a header row (which is ignored) followed by an arbitrary number of data rows. Each data row consists of:

  • Account Code
  • Amount
  • One or more category code columns, in the order established by the dataset definition

Possible Future Format

The canonical CSV format for a set of data items consists of a (required) header row, optional parameter rows (any row after the header can contain a new set of parameters, but especially the second row to set parameters for the rest of the file if other than the default), and an arbitrary number of data rows.

No parameter row is required. If not set, then the default format above is assumed. If a parameter row exists, then the following more general format is allowed:

  • Account Code
  • nPeriods Data item columns (column labels specify period)
  • nCategories Category Code columns (column labels specify category type)
  • Arbitrary number of additional-properties columns that will be stored as name:value pairs, with the name given by the column header.

Parameter rows have 2 columns. The first contains the string "#GBE#", the second contains a JSON string with name:value pairs setting various parameters used in processing the file. For now the main parameters are nCategories and nPeriods. Later we'll probably have things like whether accounts or categories should be created when unknown or ignored, or should cause an error.

A parameter row may be occur on any row and will change settings for all subsequent rows.