Skip to content

Input Checker

Cundo Arellano edited this page Sep 10, 2020 · 10 revisions

Background

Activity-based travel models rely on data from a variety of sources (zonal data, highway networks, transit networks, synthetic population, etc). A problem in any of these inputs can affect the accuracy of model outputs and/or can result in run time error(s) during the model run. It is very important that the analyst carefully prepare and review all inputs prior to running the model. However, even with the best of efforts, sometimes errors in input data remain undetected. In order to aid the analyst in the input checking process, an automated Input Checker Tool was developed for use with the AB Model. The following sections describe the setup and application of this tool:

Input Checker Tool Implementation

The Input Checker Tool was implemented in Python and makes heavy use of the pandas and numpy packages. The input checker was integrated into the overall SANDAG AB model as an Emme tool. Specifically, the input_checker.py Python script itself is called by the master_run.py Python script. The main inputs to the input checker are a list of ABM inputs, a list of QA/QC checks to be performed on these inputs and the actual AB Model inputs. All inputs are read and loaded to memory as pandas DataFrames (2-dimensional data tables). The input checks are specified by the user as pandas expressions which are solved by the input checker on the input pandas DataFrames. The input checker generates a summary of checks file summarizing the results of all of the input checks.

When creating a scenario folder, an input checker (input_checker) directory is created within the scenario folder. The input_checker directory initially only contains a config directory with 2 configuration files: a list of inputs and a list of checks. After a first successful input checker run, a summary of checks text file is created within the input_checker directory.

Process Overview

The input checker executes the following steps:

1. Read Inputs:

First, the input checker reads all the inputs specified in the list of inputs and loads them to memory as pandas DataFrames.

2. Run Checks:

Then, the list of input checks is read. The input checker loops through the list of input checks and evaluates the checks as either True (passed) or False (failed). The result of each check is sent to the summary module. The user must specify the severity level of each check as - Fatal, Logical or Warning.

3. Generate Summary of Checks File(s):

Next, a summary of checks file is generated. The input checker summary file provides the number failed checks per severity level and a summary of each check's result. The checks that failed are moved up in order of the severity level specified for the test. In addition to the general summary file, an additional summary file is created for failed checks, of severity level logical, that resulted in more than 25 failed records. The additional file contains a complete list of failed records and is output to the same input checker directory. If there are no logical failures or no logical failures with more than 25 records, the additional file will not be created.

4. Check for Logical Errors:

The next step is to check for any logical errors. When generating the summary file, the input checker keeps track of the number of failed checks per severity level. If there is a single failed check, with a severity level of logical, a prompt window appears notifying users which logical check failed and that the summary file should be consulted for more details. Users must then decide if they would like to continue (by clicking OK) or terminate the input checker (by clicking Cancel). When choosing OK, the prompt window will appear however many times there are other logical check failures. When choosing Cancel, if the input checker is ran via the stand-alone tool, the input checker will terminate, and if the input checker is ran via the Master Run GUI, the entire model run will terminate.

5. Check for Fatal Errors:

The final step is to check for any fatal errors. When generating the summary file, the input checker keeps track of the number of failed checks per severity level. If there is a single failed check, with a severity level of fatal, the model run is terminated and the user is directed to the summary file for further details.

Configuring the Input Checker

Configuring the input checker involves specifying both the inputs and and the checks to be performed on them. This section describes the configuration details of the two settings file - config/inputs_list.csv and config/inputs_checks.csv.

Specifying Inputs

Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv file. Each row in inputs_list.csv represents an ABM input. The attributes that user must specify for each input are described in the table below:

Attribute Description
Input_Table The name of the input table. The inputs are loaded into the input checker's memory as data-frames under this name.
Property_Token For each input, its corresponding property token as listed in the SANDAG ABM Properties file. For each input's property token, the input checker looks up the corresponding file path within the properties file. For Emme objects, this field should be specified as 'NA'.
Emme_Object The name of the Emme network object whose attributes must be exported. Must be specified as 'NA' for non Emme network objects (i.e. CSV or DBF). The input checker is currently capable of reading the following Emme network objects: NODE, CENTROID, LINK, TRANSIT_LINE, and TRANSIT_SEGMENT
Fields The list of attributes to be exported from the Emme network object. Refer to the Network Object Attributes page for a complete list of attributes per Emme network object. If all Emme network object attributes are desired, the user must specify 'All' for this field. All fields are read for CSV and DBF inputs.
Column_Map A column name can be specified if some columns must be renamed for easier reference.
Input_Description The description of the input file.

All non Emme network object inputs must be in either CSV or DBF format. For the Emme-based SANDAG ABM network inputs, an Emme database (i.e. Emmebank) contains the combined traffic and transit networks along with their associated attributes (Refer to the Setup and Configuration page for more information on the Emme database). Given the input checker is called from an opened Emme Modeller instance, the input checker tool has full access to the Emme database, scenarios and network. The input checker loads the Emme database, base scenario and network and then obtains attributes of the specified Emme network objects. The user must specify each input as an Emme object (e.g. LINK) or either a CSV or DBF file in the inputs or uec sub-directories. The CSV inputs are read into memory from the specified sub-directory. If defined, columns are renamed as per user specifications in the Column_Map column.

The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by the input checker

Specifying Checks

The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv file. Each row in checks_list.csv represents a specific operation to be performed on a specific input listed in inputs_list.csv. The listed operations are evaluated from top to bottom. Each operation can be classified as a Test or Calculation. For Test operations, the pandas expression is evaluated and the result is sent to the summary module of the input checker. For Calculation operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. Additionally, if a Calculation operation results in a new pandas dataframe object (e.g. a subset of an input dataframe), the new/modified dataframe will be stored to memory. Users may use this new dataframe to conduct input checks, however there is a limitation in that the new dataframe may not be further modified. The table below describes the various tokens that users must specify for each Test or Calculation operation:

Attribute Description
Test The name of the QA/QC check. The check results are referenced using this name in the summary file. For calculation operations, this becomes the name of the resulting object.
Input_Table The name of the input table on which the check is to be performed. The name must match the name specified under the Input_Table field in the inputs_list.csv
Input_ID_Column The name of the unique ID column. This serves as the input table's (i.e. DataFrame) row index by which checks and results are carried out and stored, respectively.
Severity The severity level of the test - Fatal, Logical or Warning
Type The type of operation - Test or Calculation.
Expression The pandas expression to be evaluated.
Test_Vals A list of values on which the test needs to be repeated. List must be comma separated. Test for each value is summarized separately.
Report_Statistics Any additional statistics from the test that must be reported to the summary file.
Test_Description The description of the check that is being performed.

Severity Levels

An important step in specifying checks is assigning a severity level to each check. The input checker allows the user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:

Fatal

If the input checker fails a fatal check, it returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, the analyst should only set the severity level of Fatal for checks that must pass in order to proceed with a model run.

Logical

The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.

Warnings

The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.

Expressions

At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test expression must evaluate to a single logical value (TRUE or FALSE) or a vector of logical values. Therefore, the Test expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND, OR, EQUAL, GREATER THAN, LESS THAN, IN, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation expression can be any Python data type to be used by a subsequent expression.

The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE. In case of a vector result, the test is declared as failed if any value in the vector is FALSE. Therefore, the expression must be designed to evaluate to TRUE if there are no problems in the input data.

Conventions for Writing Expressions

Rules and conventions for writing the input checker expressions are summarized below:

  • Expressions must be a valid Python/pandas expression
  • Expressions must be designed to evaluate to FALSE to indicate any errors in data
  • Each expression must evaluate to logical value(s) (i.e. TRUE or FALSE)
  • Each expression must be applied to valid input table specified in inputs_list.csv or make use of intermediate tables created by preceding Calculation expressions
  • Expressions must use the same table names as specified in inputs_list.csv or the Test name of the Calculation object
  • Expressions must use the same field names as specified in inputs_list.csv. If a column map was specified, then the new names must be used
  • Expressions can be looped over a list of Test_Vals to reduce number of expressions
  • The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
  • Expressions can be commented by adding a "#" in front of the Test name. All checks whose test name starts with a "#" are ignored by the input checker

Example Expressions

Below are some example expressions for different types of checks:

Data Completeness Checks

Check if household income field exists in the input synthetic population:

For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals token (separated by comma):

Boundary Checks

Check if number of household inhabitants ('persons') is greater than zero for each household:

households.persons > 0

Predefined Value Checks

Check if each person's employment status ('pemploy') matches the predefined employment status categories (1,2,3,4):

persons.pemploy.apply(lambda x: True if x in [1,2,3,4] else False)

It is possible that all person records pass the above test but one of the employment status categories may not have a single person record. To check for such cases, the following expression can be used:

set(persons.pemploy)=={1,2,3,4}

Consistency Check

Check if total employment across occupation categories sum to total employment for each Master Geographical Reference Area (MGRA). Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation expression:

mgra_data[[col for col in mgra_data if (col.startswith('emp')) and not (col.endswith('total'))]].sum(axis=1)

The result of the above expression is a MGRA level vector:mgra_total_employment Next, the total employment field can be compared against mgra_total_employment

mgra_data.emp_total==mgra_total_employment

Other Checks

Check if household IDs start from 1 and are sequential:

(min(households.hhid)==1) & (max(households.hhid)==len(households.hhid))

Logical Checks

To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the summation of each employment land-use category field per MGRA with the total employment field per MGRA. These values should match exactly. For this check, first the summation of workers per land-use category are calculated per MGRA. Then, the summation value is compared against the emp_total (i.e. Total Employment for MGRA) field. This can be achieved with a Calculate operation and subsequent Test operation where the test expression utilizes the calculated value, as shown below:

Network Checks

While most of the above checks apply to link and node level attributes, some checks might be unique to some other network objects such as transit routes. In Emmme, transit line segment names must be unique. This requires performing a check on transit line segment data as follows:

len(set(transit_segments.id)) == len(transit_segments.id)

The design of network level checks will depend on the transportation modeling software being used.

Running the Input Checker

Given the input checker was integrated as an Emme tool, users may decide to run or skip the input checker prior to running the core SANDAG AB model via the Master run interface (as described in Run the Model). Alternatively, users may launch the input checker independently by running it via its stand-alone Emme tool (SANDAG toolbox > Import > Input checker). To commence the tool, users simply need to click on Run on the input checker GUI, as is shown in the figure below.

It should be noted that if users wish to run network related checks, the import_network.py AB model step must be run first. Under the Master run interface, this model step corresponds to the Skip build of highway and transit network check.

Analyzing the Input Checker Summary File(s)

The final output from the input checker is a summary file and potentially a logical failures summary file, which are output directly to the input checker directory. The summary file is named as inputCheckerSummary[YEAR-MONTH-DAY].txt and the logical failures summary file is named as completeLogicalFails_[YEAR-MONTH-DAY].txt. The summary files can be opened using any text editor. The results of all checks are summarized in the inputCheckerSummary file, while the completeLogicalFails file lists a complete list of failed records for failed checks, of severity level logical, that resulted in more than 25 failed records. The following sections describe the organization and details of the summary files.

Organization

The inputCheckerSummary file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. The completeLogicalFails file summarizes all failed checks of severity type logical. Both files provide essentially the same information for each check, except that the completeLogicalFails focuses on providing a complete list of failed records for failed logical checks.

The input checker organizes the check results under the following headings:

  • TALLY OF FAILED CHECKS: A tally of all failed checks per severity level
  • IMMEDIATE ACTION REQUIRED: All failed FATAL checks are summarized under this heading
  • ACTION REQUIRED: All failed LOGICAL checks are summarized under this heading
  • WARNINGS: All failed WARNING checks are summarized under this heading
  • SUMMARY OF ALL PASSED CHECKS: A complete summary of all passed checks

Check Summary

A check summary generated for each check. The table below shows the elements of a check summary:

Attribute Description
Input File Name The name of the input file on which the check was evaluated
Input File Location Path to the location of the input file. Not applicable to Emme Objects.
Emme Object The name of the Emme object, if applicable
Input Description The description of the input as specified in inputs_list.csv
Test Name The name of the test as specified in checks_list.csv
Test Description The description of the test
Test Severity The severity level of the test
TEST RESULT The result of the test - PASSED or FAILED
TEST results for each test val Test result for each Test_Vals (i.e. test value) on which the test was repeated
Test Statistics The value of the expression specified under the Report_Statistic token of checks_list.csv. For the inputCheckerSummary file, the first 25 values are printed in case of vector results. For the completeLogicalFails file, all values are printed in case of vector results.
ID Column The name of the unique ID column of the input data table
List of failed IDs For the inputCheckerSummary file the first 25 IDs for which the test failed. For the completeLogicalFails file, a complete list of IDs for which the logical test failed. This is generated in case of vector results.
Number of failures Total number of failures in case of vector result

Go To Top

Next Section: Properties File

Clone this wiki locally