-
Notifications
You must be signed in to change notification settings - Fork 70
[Tamandua] Describe models for configuring sources #588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
aaxelb
merged 6 commits into
CenterForOpenScience:feature/project-tamandua
from
aaxelb:feature/project-tamandua
Feb 16, 2017
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
8f5dd5a
Describe models for configuring sources
aaxelb 62916df
SourceConfig => IngestConfig
aaxelb db3ad38
Consistent table definitions
aaxelb 93547a9
Define tables using tables.
aaxelb 9e2b8c3
Fix some table stuff.
aaxelb 7ca98c4
Updates
aaxelb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,115 @@ | ||
| # SQL Tables | ||
|
|
||
| ## Template | ||
|
|
||
| ### {ModelName} | ||
| {Description} | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | | | ✓ | ✓ | ✓ | | | | ||
|
|
||
| #### Other indices | ||
| * `{column_name}`, `{column_name}`, ... [(unique)] | ||
| * ... | ||
|
|
||
| ## Data | ||
|
|
||
| ### SourceUniqueIdentifier (SUID) | ||
| Identifier for a specific document from a specific source. | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `identifier` | text | | | | | Identifier given to the document by the source | | ||
| | `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document | | ||
|
|
||
| #### Other indices | ||
| * `source_doc_id`, `ingest_config_id` (unique) | ||
|
|
||
| ### RawData | ||
| Raw data, exactly as it was given to SHARE. | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `suid_id` | int | | | ✓ | | SUID for this datum | | ||
| | `data` | text | | | | | The raw data itself (typically JSON or XML string) | | ||
| | `sha256` | text | unique | | | | SHA-256 hash of `data` | | ||
| | `harvest_logs` | m2m | | | | | List of HarvestLogs for harvester runs that found this exact datum | | ||
|
|
||
| ## Ingest Configuration | ||
|
|
||
| ### IngestConfig | ||
| Describes one way to harvest metadata from a Source, and how to transform the result. | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `source_id` | int | | | ✓ | | Source to harvest from | | ||
| | `base_url` | text | | | | | URL of the API or endpoint where the metadata is available | | ||
| | `earliest_date` | date | | ✓ | | | Earliest date with available data | | ||
| | `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds | | ||
| | `rate_limit_period` | int | | | | 1 | Number of seconds for every `rate_limit_allowance` requests | | ||
| | `harvester_id` | int | | | ✓ | | Harvester to use | | ||
| | `harvester_kwargs` | jsonb | | ✓ | | | JSON object passed to the harvester as kwargs | | ||
| | `transformer_id` | int | | | ✓ | | Transformer to use | | ||
| | `transformer_kwargs` | jsonb | | ✓ | | | JSON object passed to the transformer as kwargs, along with the harvested raw data | | ||
| | `disabled` | bool | | | | False | True if this ingest config should not be run automatically | | ||
|
|
||
| ### Source | ||
| A Source is a place metadata comes from. | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `name` | text | unique | | | | Short name | | ||
| | `long_title` | text | unique | | | | Full, human-friendly name | | ||
| | `home_page` | text | | ✓ | | | URL | | ||
| | `icon` | image | | ✓ | | | Recognizable icon for the source | | ||
| | `user_id` | int | | | ✓ | | User with permission to submit data as this source (TODO: replace with django permissions stuff) | | ||
|
|
||
| ### Harvester | ||
| Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere) | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass | | ||
| | `date_created` | datetime | | | | now | | | ||
| | `date_modified` | datetime | | | | now (on update) | | | ||
|
|
||
| ### Transformer | ||
| Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere) | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass | | ||
| | `date_created` | datetime | | | | now | | | ||
| | `date_modified` | datetime | | | | now (on update) | | | ||
|
|
||
| ## Logs | ||
|
|
||
| ### HarvestLog | ||
| Log entries to track the status of a specific harvester run. | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run | | ||
| | `harvester_version` | text | | | | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') | ||
| | `start_date` | datetime | | | | | Beginning of the date range to harvest | | ||
| | `end_date` | datetime | | | | | End of the date range to harvest | | ||
| | `started` | datetime | | | | | Time `status` was set to STARTED | | ||
| | `status` | text | | | | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} | | ||
|
|
||
| #### Other indices | ||
| * `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique) | ||
|
|
||
| ### TransformLog | ||
| Log entries to track the status of a transform task | ||
|
|
||
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `raw_id` | int | | | ✓ | | RawData to be transformed | | ||
| | `ingest_config_id` | int | | | ✓ | | IngestConfig used | | ||
| | `transformer_version` | text | | | | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') | ||
| | `started` | datetime | | | | | Time `status` was set to STARTED | | ||
| | `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} | | ||
|
|
||
| #### Other indices | ||
| * `raw_id`, `transformer_version` (unique) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use a float here. Probably less understandable... Not sure how much I like it, just punting as an option.
1 req / 1 sec =
1.01 req / 5 sec =
0.2There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that, but two columns seemed like less work than reimplementing our rate limiting. Putting it in a float would lose the difference between 1 req/5 sec and 100 req/500 sec (the latter allowing larger bursts), though I don't know how much that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not think of that. Let's keep it this way in that case 👍 nice catch