name and description are required components for data sources, which are currently not yet incorporated in the labeling and annotation.
Additionally, some collectors pull in additional metadata that can be used for further elaborating on metadata for the URL when submitting to data sources.
Thus, logic should be updated to include subtasks which, depending on the batch strategy of the URL in question, extract different metadata that will be submitted to data sources.
A mapping based on different strategies will be provided below:
- ckan
- submitted_name -> name
- description
- record_format
- data_portal_type
- supplying_entity
- auto_googler
- title -> name
- snippet -> description
- muckrock_simple_search/county_search/all_search
Ideally, the logic mapping collector URL metadata to data source metadata is flexible to keys missing.
In the database, I'm thinking of having a URL_optional_data_source_metadata table, with a 1:1 relationship with the URL in question.
- This will hold all metadata not required for the data source.
- Each column will represent a different value, complete with validation.