Skip to content

Harvest O&M and Debugging

Susan Valente edited this page Apr 17, 2026 · 5 revisions

Harvest O&M

In order to properly manage Data.gov's harvest system, we have the following procedures, tools, guidelines, and cadences to make sure things are working properly. This is meant to be a living document, with edits as new discoveries and/or updates are made.

Error Handling

Most reported errors (validation, transformation, source/job errors, etc.) are "expected" errors. If there are recurring issues getting data from a source, or most of a source being invalid, this should be raised to the data provider via team lead.

However, there are jobs that fail due to long-running infrastructure reasons. These jobs should be examined more carefully. To investigate:

  • Go to the cloud.gov log dashboard. Zoom into the right day and update the query for the job ID. This will give you a time frame of when the job was running. Note that the end time reported on the job may be later than when the task was actually doing work.
  • Check the Harvest API (see below) for job counts and record errors for the relevant source and time window.

If there are discrepancies in job counts, the sync job should be run immediately (details TBD). After any sync process is complete, evaluate whether a re-harvest is needed based on how far out the next scheduled harvest is and how many changes are expected.

Harvest API

The Harvest API gives us a lot of power to find, classify, and debug various issues. The routes are defined here. Each of our 6 object types (organizations, harvest_sources, harvest_jobs, harvest_records, harvest_job_errors, and harvest_record_errors) is queryable.

Simple options:

  • page: the page number to extract
  • per_page: number of records per page
  • paginate: whether to use pagination (defaults to true)
  • count: return just the count of results, not the results themselves

More advanced filtering and ordering:

  • order_by: field(s) to order by
  • facets: custom "where"-style filtering, safe from SQL injection. Supports multiple evaluations, e.g. finding error records created between a start and end time: https://harvest.data.gov/harvest_record_errors/?facets=date_created%20gt%202025-08-21T22:00:00,%20date_created%20lt%202025-08-21T23:59:59
  • Use ilike_op to search for strings: https://harvest.data.gov/harvest_record_errors/?facets=message%20ilike_op%20failed%25

The full list of operators is here.

Please add useful query examples as they are discovered or used:

Clone this wiki locally