This repo provides the datasets.json
file, used as "ground truth"
for the knowledge graph work in ADRF and Rich Context.
For a diagram of how this dataset list fits within the overall ETL
workflow used to update the knowledge graph, see the OmniGraffle
source at docs/kg_etl_workflow.graffle
in this repo.
Having a separate repo helps us manage changes carefully. This is metadata not data, so serves it as the basis for linking. That requires auditing of any changes, to avoid breaking links in the graph downstream from any update.
Consequently, each update must be handled through a pull request and audited in a code review.
- work in a separate branch and update from master
- look for other PRs (work in progress) and note the IDs used
- request a range of up to 5 IDs on the
rich_context
channel on Slack - make edits in your branch
- confirm through unit tests:
python test.py
At that point, create a PR and have someone else on the team review it.
Also, don't commit code here except for consistency checks used on the dataset list itself.
At a minimum, each record in the datasets.json
file must have these
required fields:
provider
-- name of the data provider inproviders.json
title
-- name of the datasetid
-- a unique sequential identifier
For the names, use what the data provider shows on their web page and try to be as consise as possible.
When adding records:
- first, make sure the
providers.json
entry is correct - add to the bottom of the file
- increment the
id
number manually
Other fields that may be included:
alt_title
-- list of alternative titles or abbreviations, aka "mentions"url
-- URL for the main page describing the datasetdoi
-- a unique persistent identifier assigned by the data provideralt_ids
-- stored as a list, other unique identifiers (alternative DOIs, etc.)description
-- a brief (tweet sized) text description of the datasetdate
-- date of publication, which may help resolve conflicting identifiers
- spot checks on urls, titles, etc
- unify naming conventioins
- is 'program data' a dataset? revisit after november workshop
- add check for commas within entries
The datasets enumerated in datasets.json
may have additional
metadata, which would be given to us by the data provider or client
using the dataset.
These fields might include (but not limited to):
keywords
andcategories
- list of terms associated with the datasetgeographical coverage
- geography that the dataset covers, e.g New York State, Germanytemporal coverage
- time period of the dataset. If the dataset is regularly released, e.g. the U.S. Census, the value could be 'decennial'data steward
- person responsible for protecting and sharing the dataset - id should come fromdata_stewards.json
(not yet in existence)customer
- client or partner who requested that the dataset be entered into our knowledge graph - id should come fromcustomers.json
(not yet in existence)long_description
- longer form description of datasetin_adrf
- boolean value indicating whether or not the dataset is in the ADRFfunder
- organization (could be the agency) that funded creation or dissemination of the dataset