From 8f5dd5aed39f47e56f2a595ea3ea3e715542219a Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Wed, 15 Feb 2017 11:12:42 -0500 Subject: [PATCH 1/6] Describe models for configuring sources --- whitepapers/Tables.md | 35 +++++++++++++++++++++++++++++++++++ whitepapers/tasks/Harvest.md | 12 ++++++------ 2 files changed, 41 insertions(+), 6 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index 1a489f90e..b3601d7d3 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -2,4 +2,39 @@ +## Pipeline configuration +### Source +A Source is a place metadata comes from. + +#### Columns +* `name` -- Short, unique string +* `long_title` -- full, human-friendly name +* `home_page` -- URL +* `favicon` -- +* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) + +### Harvester + +#### Columns +* `key` -- Unique key that can be used to get the corresponding Harvester subclass +* `version` -- + +### Transformer + +#### Columns +* `key` -- Unique key that can be used to get the corresponding Transformer subclass +* `version` -- + +### SourceConfig(?) +Describes one way to harvest metadata from a Source, and how to transform the result. + +#### Columns +* `source_id` -- PK of the source +* `base_url` -- URL of the API/endpoint where the metadata is available +* `harvester_id` -- PK of the harvester to use +* `harvester_kwargs` -- JSON object passed to the harvester as kwargs +* `transformer_id` -- PK of the transformer to use +* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data +* `earliest_date` -- Earliest date with available data (nullable) +* `disabled` -- Boolean diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md index 155573b5e..3da567511 100644 --- a/whitepapers/tasks/Harvest.md +++ b/whitepapers/tasks/Harvest.md @@ -17,7 +17,7 @@ ## Parameters -* `source_id` -- The PK of the source to harvest from +* `source_config_id` -- The PK of the source to harvest from * `start_date` -- * `end_date` -- * `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted) @@ -32,14 +32,14 @@ ## Steps ### Preventative measures -* If the specified `source` is disabled and `force` or `ignore_disabled` is not set, crash -* For the given `source` find up to the last 5 harvest jobs with the same versions +* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash +* For the given `source_config` find up to the last 5 harvest jobs with the same versions * If they are all failed, throw an exception (Refuse to run) ### Setup -* Lock the `source` (NOWAIT) +* Lock the `source_config` (NOWAIT) * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing) -* Get or create HarvestJob(source_id, version, harvester, date ranges...) +* Get or create `HarvestJob(source_config_id, version, harvester, date ranges...)` * if found and status is: * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts. * STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts. @@ -47,7 +47,7 @@ * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False * Chunk the date range and spawn a harvest task for each chunk * Set status to `SPLIT` and exit -* Load the harvester for the given source +* Load the harvester for the given `source_config` ### Actually Harvest * Harvest data between the specified datetimes, respecting `limit` and `rate_limit` From 62916dfa7aa8598bb8d7ccad2a9acd1b05f20ebc Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Wed, 15 Feb 2017 12:58:38 -0500 Subject: [PATCH 2/6] SourceConfig => IngestConfig --- whitepapers/Tables.md | 11 ++++++----- whitepapers/tasks/Harvest.md | 28 ++++++++++++++-------------- 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index b3601d7d3..7becb2384 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -11,30 +11,31 @@ A Source is a place metadata comes from. * `name` -- Short, unique string * `long_title` -- full, human-friendly name * `home_page` -- URL -* `favicon` -- +* `icon` -- * `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) ### Harvester #### Columns * `key` -- Unique key that can be used to get the corresponding Harvester subclass -* `version` -- +* `date_created` -- ### Transformer #### Columns * `key` -- Unique key that can be used to get the corresponding Transformer subclass -* `version` -- +* `date_created` -- -### SourceConfig(?) +### IngestConfig Describes one way to harvest metadata from a Source, and how to transform the result. #### Columns * `source_id` -- PK of the source * `base_url` -- URL of the API/endpoint where the metadata is available +* `earliest_date` -- Earliest date with available data (nullable) +* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited) * `harvester_id` -- PK of the harvester to use * `harvester_kwargs` -- JSON object passed to the harvester as kwargs * `transformer_id` -- PK of the transformer to use * `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data -* `earliest_date` -- Earliest date with available data (nullable) * `disabled` -- Boolean diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md index 3da567511..cec1c8464 100644 --- a/whitepapers/tasks/Harvest.md +++ b/whitepapers/tasks/Harvest.md @@ -17,50 +17,50 @@ ## Parameters -* `source_config_id` -- The PK of the source to harvest from +* `ingest_config_id` -- The PK of the IngestConfig to use * `start_date` -- * `end_date` -- -* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted) -* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimitted) +* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited) * `superfluous` -- Take certain actions that have previously suceeded * `transform` -- Should TransformJobs be launched for collected data. Defaults to `True` * `no_split` -- Should harvest jobs be split into multiple? Default to `False` -* `ignore_disabled` -- Run the task, even with disabled sources +* `ignore_disabled` -- Run the task, even with disabled ingest configs * `force` -- Force the task to run, against all odds ## Steps ### Preventative measures -* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash -* For the given `source_config` find up to the last 5 harvest jobs with the same versions +* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash +* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions * If they are all failed, throw an exception (Refuse to run) ### Setup -* Lock the `source_config` (NOWAIT) +* Lock the `ingest_config` (NOWAIT) * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing) -* Get or create `HarvestJob(source_config_id, version, harvester, date ranges...)` +* Get or create `HarvestJob(ingest_config_id, harvester_version, date ranges...)` * if found and status is: * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts. - * STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts. + * STARTED: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts. * Set HarvestJob status to `STARTED` * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False * Chunk the date range and spawn a harvest task for each chunk * Set status to `SPLIT` and exit -* Load the harvester for the given `source_config` +* Load the harvester for the given `ingest_config` ### Actually Harvest -* Harvest data between the specified datetimes, respecting `limit` and `rate_limit` +* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit` ### Pass the data along * Begin catching any exceptions * For each piece of data recieved (Perferably in bulk/chunks) - * Get or create SourceUniqueIdentifier(suid, source_id) + * Get or create `SourceUniqueIdentifier(suid, source_id)` + * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate. * Get or create RawData(hash, suid) * For each piece of data (After saving to keep as transactional as possible) - * Get or create TransformLogs(raw_id, version) + * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)` * if the log already exists and superfluous is not set, exit - * Start the transform task(raw_id, version) unless `transform` is `False` + * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False` ### Clean up * If an exception was caught, set status to `FAILED` and insert the exception/traceback From db3ad3827fbb8f3b375cbd61744cd2d012f4c855 Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Wed, 15 Feb 2017 17:00:07 -0500 Subject: [PATCH 3/6] Consistent table definitions --- whitepapers/Tables.md | 107 ++++++++++++++++++++++++++------- whitepapers/tasks/Harvest.md | 10 +-- whitepapers/tasks/Transform.md | 32 +++++----- 3 files changed, 107 insertions(+), 42 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index 7becb2384..7ed63c4a4 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -1,41 +1,106 @@ # SQL Tables +## Template +### {ModelName} +{Description} -## Pipeline configuration +#### Columns +* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],]) +* ... + +#### Multi-column indices +* `{column_name}`, `{column_name}`, ... [(unique)] +* ... + +## Data + +### SourceUniqueIdentifier (SUID) +Identifier for a specific document from a specific source. + +#### Columns +* `source_doc_id` -- Identifier given to the document by the source (text) +* `ingest_config_id` -- PK of the IngestConfig used to ingest the document (int) + +#### Multi-column indices +* `source_doc_id`, `ingest_config_id` (unique) + +### RawData +Raw data, exactly as it was given to SHARE. + +#### Columns +* `suid_id` -- PK of the SUID for this datum (int) +* `data` -- The raw data itself (text) +* `sha256` -- SHA-256 hash of `data` (text) +* `date_seen` -- The last time this exact data was harvested (datetime) +* `date_harvested` -- The first time this exact data was harvested (datetime) + +## Ingest Configuration + +### IngestConfig +Describes one way to harvest metadata from a Source, and how to transform the result. + +#### Columns +* `source_id` -- PK of the source (int) +* `base_url` -- URL of the API/endpoint where the metadata is available (text) +* `earliest_date` -- Earliest date with available data (date, nullable) +* `rate_limit_allowance` -- Number of requests allowed every `rate_limit_period` seconds (positive int, default=5) +* `rate_limit_period` -- Number of seconds for every `rate_limit_allowance` requests (positive int, default=1) +* `harvester_id` -- PK of the harvester to use (int) +* `harvester_kwargs` -- JSON object passed to the harvester as kwargs (json, nullable) +* `transformer_id` -- PK of the transformer to use (int) +* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data (json, nullable) +* `disabled` -- True if this ingest config should not be run automatically (boolean) ### Source A Source is a place metadata comes from. #### Columns -* `name` -- Short, unique string -* `long_title` -- full, human-friendly name -* `home_page` -- URL -* `icon` -- -* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) +* `name` -- Short name (text, unique) +* `long_title` -- Full, human-friendly name (text, unique) +* `home_page` -- URL (text, nullable) +* `icon` -- Icon for the source (image, nullable) +* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) (int) ### Harvester +Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere) #### Columns -* `key` -- Unique key that can be used to get the corresponding Harvester subclass -* `date_created` -- +* `key` -- Key that can be used to get the corresponding Harvester subclass (text, unique) +* `date_created` -- Date created (datetime) ### Transformer +Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere) #### Columns -* `key` -- Unique key that can be used to get the corresponding Transformer subclass -* `date_created` -- +* `key` -- Key that can be used to get the corresponding Transformer subclass (text, unique) +* `date_created` -- Date created (datetime) -### IngestConfig -Describes one way to harvest metadata from a Source, and how to transform the result. +## Logs + +### HarvestLog +Log entries to track the status of a specific harvester run. + +#### Columns +* `ingest_config_id` -- PK of the IngestConfig for this harvester run (int) +* `harvester_version` -- Current version of the harvester in format 'x.x.x' (text) +* `start_date` -- Beginning of the date range to harvest (datetime) +* `end_date` -- End of the date range to harvest (datetime) +* `started` -- Time this harvester run began (datetime) +* `status` -- Status of the harvester run (string, choices={INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED}, default=INITIAL) + +#### Multi-column indices +* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique) + +### TransformLog +Log entries to track the status of a specific harvester run. #### Columns -* `source_id` -- PK of the source -* `base_url` -- URL of the API/endpoint where the metadata is available -* `earliest_date` -- Earliest date with available data (nullable) -* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited) -* `harvester_id` -- PK of the harvester to use -* `harvester_kwargs` -- JSON object passed to the harvester as kwargs -* `transformer_id` -- PK of the transformer to use -* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data -* `disabled` -- Boolean +* `raw_id` -- PK of the RawData to be transformed (int) +* `ingest_config_id` -- PK of the IngestConfig (int) +* `transformer_version` -- Current version of the transformer in format 'x.x.x' (text) +* `started` -- Time this transform task began (datetime) +* `status` -- Status of the transform task (string, choices={INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED}, default=INITIAL) + +#### Multi-column indices +* `raw_id`, `transformer_version` (unique) diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md index cec1c8464..d5c2db00a 100644 --- a/whitepapers/tasks/Harvest.md +++ b/whitepapers/tasks/Harvest.md @@ -38,11 +38,11 @@ ### Setup * Lock the `ingest_config` (NOWAIT) * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing) -* Get or create `HarvestJob(ingest_config_id, harvester_version, date ranges...)` +* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`) * if found and status is: * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts. - * STARTED: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts. -* Set HarvestJob status to `STARTED` + * `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts. +* Set HarvestLog status to `STARTED` * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False * Chunk the date range and spawn a harvest task for each chunk * Set status to `SPLIT` and exit @@ -58,9 +58,9 @@ * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate. * Get or create RawData(hash, suid) * For each piece of data (After saving to keep as transactional as possible) - * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)` + * Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)` * if the log already exists and superfluous is not set, exit - * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False` + * Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False` ### Clean up * If an exception was caught, set status to `FAILED` and insert the exception/traceback diff --git a/whitepapers/tasks/Transform.md b/whitepapers/tasks/Transform.md index bb5076a47..4a3429469 100644 --- a/whitepapers/tasks/Transform.md +++ b/whitepapers/tasks/Transform.md @@ -2,15 +2,15 @@ ## Responsibilities -* Parsing data using source specific parsers +* Parsing data using source-specific parsers * Applying global cleaners to the data -* Catching any extranious exceptions and storing them in the ProcessLog and marking the ProcessLog as failed +* Catching any extraneous exceptions, storing them in the TransformLog, and marking the TransformLog `FAILED` ## Parameters * `raw_id` -- -* `processor_version` -- -* `cleaner_version` -- +* `transformer_version` -- +* `regulator_version` -- * `superfluous` -- @@ -19,23 +19,23 @@ ### Setup * Load RawData by id. * Crash, if not found. -* If not defined set `processor_version` to the latest. -* If not defined set `cleaner_version` to the latest. -* Find and lock ProcessLog(`raw_id`, `processor_version`) (SELECT FOR UPDATE NOWAIT) +* If not defined set `transformer_version` to the latest. +* If not defined set `regulator_version` to the latest. +* Find and lock TransformLog(`raw_id`, `transformer_version`) (SELECT FOR UPDATE NOWAIT) * If not found, log an error. Create, Commit, Lock. - * If the create fails, Log an error and exit. + * If the create fails, log an error and exit. * If the lock times out/isn't granted. Log an error and exit. -* If the found ProcessLog's status is finished/done and `superfluous` is `False` exit. -* Set the status of the ProcessLog to in-progress +* If the found TransformLog's status is `SUCCEEDED` and `superfluous` is `False` exit. +* Set the status of the TransformLog to `STARTED` ### Check for racing -* Search for any equivilent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished processing - * If found set status to rescheduled and exit +* Search for any equivalent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished transfoming + * If found set status to `RESCHEDULED` and exit -### Actually process the data +### Actually transform the data * Start a transaction -* Load the processor -* Process data +* Load the transformer +* Transform data * Load the cleaning suite * Clean data @@ -52,4 +52,4 @@ * Commit transaction * Release all locks * Start disambiguation tasks for updated states -* Set ProcessLog status to Done +* Set TransformLog status to `SUCCEEDED` From 93547a9b33cb76e70f0d141b4742f72cb914a39c Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Thu, 16 Feb 2017 09:55:40 -0500 Subject: [PATCH 4/6] Define tables using tables. --- whitepapers/Tables.md | 110 ++++++++++++++++++++++++------------------ 1 file changed, 63 insertions(+), 47 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index 7ed63c4a4..53c2518bc 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -5,11 +5,11 @@ ### {ModelName} {Description} -#### Columns -* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],]) -* ... +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| | | ✓ | ✓ | ✓ | | | -#### Multi-column indices +#### Other indices * `{column_name}`, `{column_name}`, ... [(unique)] * ... @@ -19,21 +19,25 @@ Identifier for a specific document from a specific source. #### Columns -* `source_doc_id` -- Identifier given to the document by the source (text) -* `ingest_config_id` -- PK of the IngestConfig used to ingest the document (int) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `source_doc_id` | text | | | | | Identifier given to the document by the source | +| `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document | -#### Multi-column indices +#### Other indices * `source_doc_id`, `ingest_config_id` (unique) ### RawData Raw data, exactly as it was given to SHARE. #### Columns -* `suid_id` -- PK of the SUID for this datum (int) -* `data` -- The raw data itself (text) -* `sha256` -- SHA-256 hash of `data` (text) -* `date_seen` -- The last time this exact data was harvested (datetime) -* `date_harvested` -- The first time this exact data was harvested (datetime) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `suid_id` | int | | | ✓ | | SUID for this datum | +| `data` | text | | | | | The raw data itself (typically JSON or XML string) | +| `sha256` | text | unique | | | | SHA-256 hash of `data` | +| `date_seen` | datetime | | | | now (every update) | The last time this exact data was harvested | +| `date_harvested` | datetime | | | | now (on insert) | The first time this exact data was harvested | ## Ingest Configuration @@ -41,40 +45,48 @@ Raw data, exactly as it was given to SHARE. Describes one way to harvest metadata from a Source, and how to transform the result. #### Columns -* `source_id` -- PK of the source (int) -* `base_url` -- URL of the API/endpoint where the metadata is available (text) -* `earliest_date` -- Earliest date with available data (date, nullable) -* `rate_limit_allowance` -- Number of requests allowed every `rate_limit_period` seconds (positive int, default=5) -* `rate_limit_period` -- Number of seconds for every `rate_limit_allowance` requests (positive int, default=1) -* `harvester_id` -- PK of the harvester to use (int) -* `harvester_kwargs` -- JSON object passed to the harvester as kwargs (json, nullable) -* `transformer_id` -- PK of the transformer to use (int) -* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data (json, nullable) -* `disabled` -- True if this ingest config should not be run automatically (boolean) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `source_id` | int | | | ✓ | | Source to harvest from | +| `base_url` | text | | | | | URL of the API or endpoint where the metadata is available | +| `earliest_date` | date | | ✓ | | | Earliest date with available data | +| `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds | +| `rate_limit_period` | int | | | | 1 | Number of seconds for every `rate_limit_allowance` requests | +| `harvester_id` | int | | | ✓ | | Harvester to use | +| `harvester_kwargs` | jsonb | | ✓ | | | JSON object passed to the harvester as kwargs | +| `transformer_id` | int | | | ✓ | | Transformer to use | +| `transformer_kwargs` | jsonb | | ✓ | | | JSON object passed to the transformer as kwargs, along with the harvested raw data | +| `disabled` | bool | | | | False | True if this ingest config should not be run automatically | ### Source A Source is a place metadata comes from. #### Columns -* `name` -- Short name (text, unique) -* `long_title` -- Full, human-friendly name (text, unique) -* `home_page` -- URL (text, nullable) -* `icon` -- Icon for the source (image, nullable) -* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) (int) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `name` | text | unique | | | | Short name | +| `long_title` | text | unique | | | | Full, human-friendly name | +| `home_page` | text | | ✓ | | | URL | +| `icon` | image | | ✓ | | | Recognizable icon for the source | +| `user_id` | int | | | ✓ | | User with permission to submit data as this source (TODO: replace with django permissions stuff) | ### Harvester Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere) #### Columns -* `key` -- Key that can be used to get the corresponding Harvester subclass (text, unique) -* `date_created` -- Date created (datetime) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass | +| `date_created` | datetime | | | | now (on insert) | | ### Transformer Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere) #### Columns -* `key` -- Key that can be used to get the corresponding Transformer subclass (text, unique) -* `date_created` -- Date created (datetime) +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass | +| `date_created` | datetime | | | | now (on insert) | | ## Logs @@ -82,25 +94,29 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe Log entries to track the status of a specific harvester run. #### Columns -* `ingest_config_id` -- PK of the IngestConfig for this harvester run (int) -* `harvester_version` -- Current version of the harvester in format 'x.x.x' (text) -* `start_date` -- Beginning of the date range to harvest (datetime) -* `end_date` -- End of the date range to harvest (datetime) -* `started` -- Time this harvester run began (datetime) -* `status` -- Status of the harvester run (string, choices={INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED}, default=INITIAL) - -#### Multi-column indices +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run | +| `harvester_version` | text | | | | | Current version of the harvester in format 'x.x.x' | +| `start_date` | datetime | | | | | Beginning of the date range to harvest | +| `end_date` | datetime | | | | | End of the date range to harvest | +| `started` | datetime | | | | | Time `status` was set to STARTED | +| `status` | text | | | | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} | + +#### Other indices * `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique) ### TransformLog -Log entries to track the status of a specific harvester run. +Log entries to track the status of a transform task #### Columns -* `raw_id` -- PK of the RawData to be transformed (int) -* `ingest_config_id` -- PK of the IngestConfig (int) -* `transformer_version` -- Current version of the transformer in format 'x.x.x' (text) -* `started` -- Time this transform task began (datetime) -* `status` -- Status of the transform task (string, choices={INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED}, default=INITIAL) - -#### Multi-column indices +| Column | Type | Indexed | Nullable | FK | Default | Description | +|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +| `raw_id` | int | | | ✓ | | RawData to be transformed | +| `ingest_config_id` | int | | | ✓ | | IngestConfig used | +| `transformer_version` | text | | | | | Current version of the transformer in format 'x.x.x' | +| `started` | datetime | | | | | Time `status` was set to STARTED | +| `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} | + +#### Other indices * `raw_id`, `transformer_version` (unique) From 9e2b8c33d9258337e2901522a868a97562d1bdf8 Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Thu, 16 Feb 2017 10:05:06 -0500 Subject: [PATCH 5/6] Fix some table stuff. --- whitepapers/Tables.md | 26 +++++++++----------------- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index 53c2518bc..73d3ff55d 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -6,7 +6,7 @@ {Description} | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | | | ✓ | ✓ | ✓ | | | #### Other indices @@ -18,9 +18,8 @@ ### SourceUniqueIdentifier (SUID) Identifier for a specific document from a specific source. -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `source_doc_id` | text | | | | | Identifier given to the document by the source | | `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document | @@ -30,9 +29,8 @@ Identifier for a specific document from a specific source. ### RawData Raw data, exactly as it was given to SHARE. -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `suid_id` | int | | | ✓ | | SUID for this datum | | `data` | text | | | | | The raw data itself (typically JSON or XML string) | | `sha256` | text | unique | | | | SHA-256 hash of `data` | @@ -44,9 +42,8 @@ Raw data, exactly as it was given to SHARE. ### IngestConfig Describes one way to harvest metadata from a Source, and how to transform the result. -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `source_id` | int | | | ✓ | | Source to harvest from | | `base_url` | text | | | | | URL of the API or endpoint where the metadata is available | | `earliest_date` | date | | ✓ | | | Earliest date with available data | @@ -61,9 +58,8 @@ Describes one way to harvest metadata from a Source, and how to transform the re ### Source A Source is a place metadata comes from. -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `name` | text | unique | | | | Short name | | `long_title` | text | unique | | | | Full, human-friendly name | | `home_page` | text | | ✓ | | | URL | @@ -73,18 +69,16 @@ A Source is a place metadata comes from. ### Harvester Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere) -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass | | `date_created` | datetime | | | | now (on insert) | | ### Transformer Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere) -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass | | `date_created` | datetime | | | | now (on insert) | | @@ -93,9 +87,8 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe ### HarvestLog Log entries to track the status of a specific harvester run. -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run | | `harvester_version` | text | | | | | Current version of the harvester in format 'x.x.x' | | `start_date` | datetime | | | | | Beginning of the date range to harvest | @@ -109,9 +102,8 @@ Log entries to track the status of a specific harvester run. ### TransformLog Log entries to track the status of a transform task -#### Columns | Column | Type | Indexed | Nullable | FK | Default | Description | -|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------| +|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `raw_id` | int | | | ✓ | | RawData to be transformed | | `ingest_config_id` | int | | | ✓ | | IngestConfig used | | `transformer_version` | text | | | | | Current version of the transformer in format 'x.x.x' | From 7ca98c4b2bcf629b989cd19d25c638fbf79d7845 Mon Sep 17 00:00:00 2001 From: Abram Booth Date: Thu, 16 Feb 2017 13:49:16 -0500 Subject: [PATCH 6/6] Updates --- whitepapers/Tables.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md index 73d3ff55d..36f8f2c81 100644 --- a/whitepapers/Tables.md +++ b/whitepapers/Tables.md @@ -20,7 +20,7 @@ Identifier for a specific document from a specific source. | Column | Type | Indexed | Nullable | FK | Default | Description | |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| -| `source_doc_id` | text | | | | | Identifier given to the document by the source | +| `identifier` | text | | | | | Identifier given to the document by the source | | `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document | #### Other indices @@ -34,8 +34,7 @@ Raw data, exactly as it was given to SHARE. | `suid_id` | int | | | ✓ | | SUID for this datum | | `data` | text | | | | | The raw data itself (typically JSON or XML string) | | `sha256` | text | unique | | | | SHA-256 hash of `data` | -| `date_seen` | datetime | | | | now (every update) | The last time this exact data was harvested | -| `date_harvested` | datetime | | | | now (on insert) | The first time this exact data was harvested | +| `harvest_logs` | m2m | | | | | List of HarvestLogs for harvester runs that found this exact datum | ## Ingest Configuration @@ -72,7 +71,8 @@ Each row corresponds to a Harvester implementation in python. (TODO: describe th | Column | Type | Indexed | Nullable | FK | Default | Description | |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass | -| `date_created` | datetime | | | | now (on insert) | | +| `date_created` | datetime | | | | now | | +| `date_modified` | datetime | | | | now (on update) | | ### Transformer Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere) @@ -80,7 +80,8 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe | Column | Type | Indexed | Nullable | FK | Default | Description | |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass | -| `date_created` | datetime | | | | now (on insert) | | +| `date_created` | datetime | | | | now | | +| `date_modified` | datetime | | | | now (on update) | | ## Logs @@ -90,7 +91,7 @@ Log entries to track the status of a specific harvester run. | Column | Type | Indexed | Nullable | FK | Default | Description | |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run | -| `harvester_version` | text | | | | | Current version of the harvester in format 'x.x.x' | +| `harvester_version` | text | | | | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') | `start_date` | datetime | | | | | Beginning of the date range to harvest | | `end_date` | datetime | | | | | End of the date range to harvest | | `started` | datetime | | | | | Time `status` was set to STARTED | @@ -106,7 +107,7 @@ Log entries to track the status of a transform task |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | `raw_id` | int | | | ✓ | | RawData to be transformed | | `ingest_config_id` | int | | | ✓ | | IngestConfig used | -| `transformer_version` | text | | | | | Current version of the transformer in format 'x.x.x' | +| `transformer_version` | text | | | | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') | `started` | datetime | | | | | Time `status` was set to STARTED | | `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |