Metric Calculation

Criminal Justice Metrics

As discussed previously, our ingest system is capable of transforming raw, arbitrary data about the criminal justice system into organized, normalized information in a common format. However, that is of little use if we don't then analyze that information to understand the performance and health of our criminal justice system.

There is an enormous wealth of high quality research in criminal justice, but the space has severe challenges. Virtually every bit of research behind reform requires enormous levels of effort to acquire, prepare, and analyze data. Even having done so, the output of the research is all too often limited in terms of potential impact. Just as there has historically been no common schema for justice data, there has historically been little standardization of metric definitions and methodologies -- because of this, reproducing results remains a challenge and comparison across jurisdictions or agencies can be waved away, reducing accountability.

Combined with the barriers to data access and collection, these analytical barriers have made it difficult to produce analysis that has depth, breadth, and timeliness.

Our calculation system strives to reduce those barriers by providing common baseline metrics and performance indicators and the tooling for sophisticated analysis, built atop our common schema.

Calculation Channels

There are currently three types of channels for calculation: batch processing, query processing, and manual exploration. These are described in their respective wiki pages. All of these calculations are stored in our data warehouse.

  • Batch processing - via Cloud Dataflow, batch jobs can be executed which perform complex logic on entities and entity graphs to identify measure certain events within the justice system. For example, recidivism measurement requires looking at the full history of a person's interactions with the correctional and supervision systems. These jobs read individual-level data exported from our database to our data warehouse, and write metrics back into the data warehouse where they can be directly consumed, or joined with query results.
  • Query processing - BigQuery, our data warehouse, supports the registration of views: a virtual table defined by a saved SQL query. For many classes of metrics, a SQL query that joins across some number of tables is sufficient to produce the desired calculations. Views can reference other views and the tables produced by batch processing jobs, providing ample flexibility to share common query logic, produce methodological variants, and build up a common calculation language.
    • Querying a view is much like querying a standard table in that it simply executes a SQL query and returns the corresponding result set -- these result sets must be provided to consumers who want to report on the information. This is described in further detail in Data Warehouse.
  • Manual exploration - BigQuery permissions can be granted on specific datasets to specific users, allowing both internal staff and authorized partners to explore the data warehouse with all of the tooling available to BigQuery users, including the BigQuery console, Python/Jupyter notebooks, scripting via R or Python, and more. Virtually all of our calculations that eventually end up in consuming applications begins as manual exploration, and some projects involve a significant initial exploration effort while the trail is still being blazed.

Access Patterns

There are a wide variety of access patterns available to the calculations in our data warehouse, given that BigQuery has a wide API surface. At present, we have established a few main access patterns, but expansion is on the way.

  • Programmatic - the BigQuery API has client bindings in most popular languages, including Java, R, and Python. We have used the API directly to power our batch processing jobs, and any authorized partner can read from the warehouse in the same fashion once they have been granted access to a desired dataset. It is likely that at a future point we will host a shim API in front of this for domain-specific calculation retrieval.
  • Direct SQL querying - the BigQuery console provides authenticated users the ability to execute SQL directly in the browser. This is useful for manual exploration and troubleshooting, but is also sufficient for a good number of users whose informational needs are quantitatively simpler.
  • Data exchange - on a case-by-case basis, we may directly transfer data out of BigQuery to a partner through authorized exports of some subset of a desired dataset. This tends to be a one-off operation, but it is conceivable that some future exchange would be built off of periodic exports of datasets to an available data portal.