From 8f5dd5aed39f47e56f2a595ea3ea3e715542219a Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Wed, 15 Feb 2017 11:12:42 -0500
Subject: [PATCH 1/6] Describe models for configuring sources

---
 whitepapers/Tables.md        | 35 +++++++++++++++++++++++++++++++++++
 whitepapers/tasks/Harvest.md | 12 ++++++------
 2 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index 1a489f90e..b3601d7d3 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -2,4 +2,39 @@
 
 
 
+## Pipeline configuration
 
+### Source
+A Source is a place metadata comes from.
+
+#### Columns
+* `name` -- Short, unique string
+* `long_title` -- full, human-friendly name
+* `home_page` -- URL
+* `favicon` -- 
+* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff)
+
+### Harvester
+
+#### Columns
+* `key` -- Unique key that can be used to get the corresponding Harvester subclass
+* `version` --
+
+### Transformer
+
+#### Columns
+* `key` -- Unique key that can be used to get the corresponding Transformer subclass
+* `version` --
+
+### SourceConfig(?)
+Describes one way to harvest metadata from a Source, and how to transform the result.
+
+#### Columns
+* `source_id` -- PK of the source
+* `base_url` -- URL of the API/endpoint where the metadata is available
+* `harvester_id` -- PK of the harvester to use
+* `harvester_kwargs` -- JSON object passed to the harvester as kwargs
+* `transformer_id` -- PK of the transformer to use
+* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data
+* `earliest_date` -- Earliest date with available data (nullable)
+* `disabled` -- Boolean
diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md
index 155573b5e..3da567511 100644
--- a/whitepapers/tasks/Harvest.md
+++ b/whitepapers/tasks/Harvest.md
@@ -17,7 +17,7 @@
 
 
 ## Parameters
-* `source_id` -- The PK of the source to harvest from
+* `source_config_id` -- The PK of the source to harvest from
 * `start_date` --
 * `end_date` -- 
 * `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted)
@@ -32,14 +32,14 @@
 ## Steps
 
 ### Preventative measures
-* If the specified `source` is disabled and `force` or `ignore_disabled` is not set, crash
-* For the given `source` find up to the last 5 harvest jobs with the same versions
+* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash
+* For the given `source_config` find up to the last 5 harvest jobs with the same versions
 * If they are all failed, throw an exception (Refuse to run)
 
 ### Setup
-* Lock the `source` (NOWAIT)
+* Lock the `source_config` (NOWAIT)
   * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
-* Get or create HarvestJob(source_id, version, harvester, date ranges...)
+* Get or create `HarvestJob(source_config_id, version, harvester, date ranges...)`
   * if found and status is:
     * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
     * STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts.
@@ -47,7 +47,7 @@
 * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
   * Chunk the date range and spawn a harvest task for each chunk
   * Set status to `SPLIT` and exit
-* Load the harvester for the given source
+* Load the harvester for the given `source_config`
 
 ### Actually Harvest
 * Harvest data between the specified datetimes, respecting `limit` and `rate_limit`

From 62916dfa7aa8598bb8d7ccad2a9acd1b05f20ebc Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Wed, 15 Feb 2017 12:58:38 -0500
Subject: [PATCH 2/6] SourceConfig => IngestConfig

---
 whitepapers/Tables.md        | 11 ++++++-----
 whitepapers/tasks/Harvest.md | 28 ++++++++++++++--------------
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index b3601d7d3..7becb2384 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -11,30 +11,31 @@ A Source is a place metadata comes from.
 * `name` -- Short, unique string
 * `long_title` -- full, human-friendly name
 * `home_page` -- URL
-* `favicon` -- 
+* `icon` -- 
 * `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff)
 
 ### Harvester
 
 #### Columns
 * `key` -- Unique key that can be used to get the corresponding Harvester subclass
-* `version` --
+* `date_created` --
 
 ### Transformer
 
 #### Columns
 * `key` -- Unique key that can be used to get the corresponding Transformer subclass
-* `version` --
+* `date_created` --
 
-### SourceConfig(?)
+### IngestConfig
 Describes one way to harvest metadata from a Source, and how to transform the result.
 
 #### Columns
 * `source_id` -- PK of the source
 * `base_url` -- URL of the API/endpoint where the metadata is available
+* `earliest_date` -- Earliest date with available data (nullable)
+* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited)
 * `harvester_id` -- PK of the harvester to use
 * `harvester_kwargs` -- JSON object passed to the harvester as kwargs
 * `transformer_id` -- PK of the transformer to use
 * `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data
-* `earliest_date` -- Earliest date with available data (nullable)
 * `disabled` -- Boolean
diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md
index 3da567511..cec1c8464 100644
--- a/whitepapers/tasks/Harvest.md
+++ b/whitepapers/tasks/Harvest.md
@@ -17,50 +17,50 @@
 
 
 ## Parameters
-* `source_config_id` -- The PK of the source to harvest from
+* `ingest_config_id` -- The PK of the IngestConfig to use
 * `start_date` --
 * `end_date` -- 
-* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted)
-* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimitted)
+* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
 * `superfluous` -- Take certain actions that have previously suceeded
 * `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
 * `no_split` -- Should harvest jobs be split into multiple? Default to `False`
-* `ignore_disabled` -- Run the task, even with disabled sources
+* `ignore_disabled` -- Run the task, even with disabled ingest configs
 * `force` -- Force the task to run, against all odds
 
 
 ## Steps
 
 ### Preventative measures
-* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash
-* For the given `source_config` find up to the last 5 harvest jobs with the same versions
+* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
+* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
 * If they are all failed, throw an exception (Refuse to run)
 
 ### Setup
-* Lock the `source_config` (NOWAIT)
+* Lock the `ingest_config` (NOWAIT)
   * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
-* Get or create `HarvestJob(source_config_id, version, harvester, date ranges...)`
+* Get or create `HarvestJob(ingest_config_id, harvester_version, date ranges...)`
   * if found and status is:
     * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
-    * STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts.
+    * STARTED: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
 * Set HarvestJob status to `STARTED`
 * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
   * Chunk the date range and spawn a harvest task for each chunk
   * Set status to `SPLIT` and exit
-* Load the harvester for the given `source_config`
+* Load the harvester for the given `ingest_config`
 
 ### Actually Harvest
-* Harvest data between the specified datetimes, respecting `limit` and `rate_limit`
+* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`
 
 ### Pass the data along
 * Begin catching any exceptions
 * For each piece of data recieved (Perferably in bulk/chunks)
-  * Get or create SourceUniqueIdentifier(suid, source_id)
+  * Get or create `SourceUniqueIdentifier(suid, source_id)`
+    * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
   * Get or create RawData(hash, suid)
 * For each piece of data (After saving to keep as transactional as possible)
-  * Get or create TransformLogs(raw_id, version)
+  * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)`
   * if the log already exists and superfluous is not set, exit
-  * Start the transform task(raw_id, version) unless `transform` is `False`
+  * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False`
 
 ### Clean up
 * If an exception was caught, set status to `FAILED` and insert the exception/traceback

From db3ad3827fbb8f3b375cbd61744cd2d012f4c855 Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Wed, 15 Feb 2017 17:00:07 -0500
Subject: [PATCH 3/6] Consistent table definitions

---
 whitepapers/Tables.md          | 107 ++++++++++++++++++++++++++-------
 whitepapers/tasks/Harvest.md   |  10 +--
 whitepapers/tasks/Transform.md |  32 +++++-----
 3 files changed, 107 insertions(+), 42 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index 7becb2384..7ed63c4a4 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -1,41 +1,106 @@
 # SQL Tables
 
+## Template
 
+### {ModelName}
+{Description}
 
-## Pipeline configuration
+#### Columns
+* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],])
+* ...
+
+#### Multi-column indices
+* `{column_name}`, `{column_name}`, ... [(unique)]
+* ...
+
+## Data
+
+### SourceUniqueIdentifier (SUID)
+Identifier for a specific document from a specific source.
+
+#### Columns
+* `source_doc_id` -- Identifier given to the document by the source (text)
+* `ingest_config_id` -- PK of the IngestConfig used to ingest the document (int)
+
+#### Multi-column indices
+* `source_doc_id`, `ingest_config_id` (unique)
+
+### RawData
+Raw data, exactly as it was given to SHARE.
+
+#### Columns
+* `suid_id` -- PK of the SUID for this datum (int)
+* `data` -- The raw data itself (text)
+* `sha256` -- SHA-256 hash of `data` (text)
+* `date_seen` -- The last time this exact data was harvested (datetime)
+* `date_harvested` -- The first time this exact data was harvested (datetime)
+
+## Ingest Configuration
+
+### IngestConfig
+Describes one way to harvest metadata from a Source, and how to transform the result.
+
+#### Columns
+* `source_id` -- PK of the source (int)
+* `base_url` -- URL of the API/endpoint where the metadata is available (text)
+* `earliest_date` -- Earliest date with available data (date, nullable)
+* `rate_limit_allowance` -- Number of requests allowed every `rate_limit_period` seconds (positive int, default=5)
+* `rate_limit_period` -- Number of seconds for every `rate_limit_allowance` requests (positive int, default=1)
+* `harvester_id` -- PK of the harvester to use (int)
+* `harvester_kwargs` -- JSON object passed to the harvester as kwargs (json, nullable)
+* `transformer_id` -- PK of the transformer to use (int)
+* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data (json, nullable)
+* `disabled` -- True if this ingest config should not be run automatically (boolean)
 
 ### Source
 A Source is a place metadata comes from.
 
 #### Columns
-* `name` -- Short, unique string
-* `long_title` -- full, human-friendly name
-* `home_page` -- URL
-* `icon` -- 
-* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff)
+* `name` -- Short name (text, unique)
+* `long_title` -- Full, human-friendly name (text, unique)
+* `home_page` -- URL (text, nullable)
+* `icon` -- Icon for the source (image, nullable)
+* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) (int)
 
 ### Harvester
+Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)
 
 #### Columns
-* `key` -- Unique key that can be used to get the corresponding Harvester subclass
-* `date_created` --
+* `key` -- Key that can be used to get the corresponding Harvester subclass (text, unique)
+* `date_created` -- Date created (datetime)
 
 ### Transformer
+Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
 
 #### Columns
-* `key` -- Unique key that can be used to get the corresponding Transformer subclass
-* `date_created` --
+* `key` -- Key that can be used to get the corresponding Transformer subclass (text, unique)
+* `date_created` -- Date created (datetime)
 
-### IngestConfig
-Describes one way to harvest metadata from a Source, and how to transform the result.
+## Logs
+
+### HarvestLog
+Log entries to track the status of a specific harvester run.
+
+#### Columns
+* `ingest_config_id` -- PK of the IngestConfig for this harvester run (int)
+* `harvester_version` -- Current version of the harvester in format 'x.x.x' (text)
+* `start_date` -- Beginning of the date range to harvest (datetime)
+* `end_date` -- End of the date range to harvest (datetime)
+* `started` -- Time this harvester run began (datetime)
+* `status` -- Status of the harvester run (string, choices={INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED}, default=INITIAL)
+
+#### Multi-column indices
+* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
+
+### TransformLog
+Log entries to track the status of a specific harvester run.
 
 #### Columns
-* `source_id` -- PK of the source
-* `base_url` -- URL of the API/endpoint where the metadata is available
-* `earliest_date` -- Earliest date with available data (nullable)
-* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited)
-* `harvester_id` -- PK of the harvester to use
-* `harvester_kwargs` -- JSON object passed to the harvester as kwargs
-* `transformer_id` -- PK of the transformer to use
-* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data
-* `disabled` -- Boolean
+* `raw_id` -- PK of the RawData to be transformed (int)
+* `ingest_config_id` -- PK of the IngestConfig (int)
+* `transformer_version` -- Current version of the transformer in format 'x.x.x' (text)
+* `started` -- Time this transform task began (datetime)
+* `status` -- Status of the transform task (string, choices={INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED}, default=INITIAL)
+
+#### Multi-column indices
+* `raw_id`, `transformer_version` (unique)
diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md
index cec1c8464..d5c2db00a 100644
--- a/whitepapers/tasks/Harvest.md
+++ b/whitepapers/tasks/Harvest.md
@@ -38,11 +38,11 @@
 ### Setup
 * Lock the `ingest_config` (NOWAIT)
   * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
-* Get or create `HarvestJob(ingest_config_id, harvester_version, date ranges...)`
+* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
   * if found and status is:
     * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
-    * STARTED: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
-* Set HarvestJob status to `STARTED`
+    * `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
+* Set HarvestLog status to `STARTED`
 * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
   * Chunk the date range and spawn a harvest task for each chunk
   * Set status to `SPLIT` and exit
@@ -58,9 +58,9 @@
     * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
   * Get or create RawData(hash, suid)
 * For each piece of data (After saving to keep as transactional as possible)
-  * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)`
+  * Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
   * if the log already exists and superfluous is not set, exit
-  * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False`
+  * Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`
 
 ### Clean up
 * If an exception was caught, set status to `FAILED` and insert the exception/traceback
diff --git a/whitepapers/tasks/Transform.md b/whitepapers/tasks/Transform.md
index bb5076a47..4a3429469 100644
--- a/whitepapers/tasks/Transform.md
+++ b/whitepapers/tasks/Transform.md
@@ -2,15 +2,15 @@
 
 
 ## Responsibilities
-* Parsing data using source specific parsers
+* Parsing data using source-specific parsers
 * Applying global cleaners to the data
-* Catching any extranious exceptions and storing them in the ProcessLog and marking the ProcessLog as failed
+* Catching any extraneous exceptions, storing them in the TransformLog, and marking the TransformLog `FAILED`
 
 
 ## Parameters
 * `raw_id` -- 
-* `processor_version` --
-* `cleaner_version` --
+* `transformer_version` --
+* `regulator_version` --
 * `superfluous` --
 
 
@@ -19,23 +19,23 @@
 ### Setup
 * Load RawData by id.
   * Crash, if not found.
-* If not defined set `processor_version` to the latest.
-* If not defined set `cleaner_version` to the latest.
-* Find and lock ProcessLog(`raw_id`, `processor_version`) (SELECT FOR UPDATE NOWAIT)
+* If not defined set `transformer_version` to the latest.
+* If not defined set `regulator_version` to the latest.
+* Find and lock TransformLog(`raw_id`, `transformer_version`) (SELECT FOR UPDATE NOWAIT)
   * If not found, log an error. Create, Commit, Lock.
-    * If the create fails, Log an error and exit.
+    * If the create fails, log an error and exit.
   * If the lock times out/isn't granted. Log an error and exit.
-* If the found ProcessLog's status is finished/done and `superfluous` is `False` exit.
-* Set the status of the ProcessLog to in-progress
+* If the found TransformLog's status is `SUCCEEDED` and `superfluous` is `False` exit.
+* Set the status of the TransformLog to `STARTED`
 
 ### Check for racing
-* Search for any equivilent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished processing
-  * If found set status to rescheduled and exit
+* Search for any equivalent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished transfoming
+  * If found set status to `RESCHEDULED` and exit
 
-### Actually process the data
+### Actually transform the data
 * Start a transaction
-* Load the processor
-* Process data
+* Load the transformer
+* Transform data
 * Load the cleaning suite
 * Clean data
 
@@ -52,4 +52,4 @@
 * Commit transaction
 * Release all locks
 * Start disambiguation tasks for updated states
-* Set ProcessLog status to Done
+* Set TransformLog status to `SUCCEEDED`

From 93547a9b33cb76e70f0d141b4742f72cb914a39c Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Thu, 16 Feb 2017 09:55:40 -0500
Subject: [PATCH 4/6] Define tables using tables.

---
 whitepapers/Tables.md | 110 ++++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 47 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index 7ed63c4a4..53c2518bc 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -5,11 +5,11 @@
 ### {ModelName}
 {Description}
 
-#### Columns
-* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],])
-* ...
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| | | ✓ | ✓ | ✓ | | |
 
-#### Multi-column indices
+#### Other indices
 * `{column_name}`, `{column_name}`, ... [(unique)]
 * ...
 
@@ -19,21 +19,25 @@
 Identifier for a specific document from a specific source.
 
 #### Columns
-* `source_doc_id` -- Identifier given to the document by the source (text)
-* `ingest_config_id` -- PK of the IngestConfig used to ingest the document (int)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `source_doc_id` | text |  |  |  |  | Identifier given to the document by the source |
+| `ingest_config_id` | int |  |  | ✓ |  | IngestConfig used to ingest the document |
 
-#### Multi-column indices
+#### Other indices
 * `source_doc_id`, `ingest_config_id` (unique)
 
 ### RawData
 Raw data, exactly as it was given to SHARE.
 
 #### Columns
-* `suid_id` -- PK of the SUID for this datum (int)
-* `data` -- The raw data itself (text)
-* `sha256` -- SHA-256 hash of `data` (text)
-* `date_seen` -- The last time this exact data was harvested (datetime)
-* `date_harvested` -- The first time this exact data was harvested (datetime)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `suid_id` | int |  |  | ✓ | | SUID for this datum |
+| `data` | text |  |  |  | | The raw data itself (typically JSON or XML string) |
+| `sha256` | text | unique |  |  | | SHA-256 hash of `data` |
+| `date_seen` | datetime |  |  |  | now (every update) | The last time this exact data was harvested |
+| `date_harvested` | datetime |  |  |  | now (on insert) | The first time this exact data was harvested |
 
 ## Ingest Configuration
 
@@ -41,40 +45,48 @@ Raw data, exactly as it was given to SHARE.
 Describes one way to harvest metadata from a Source, and how to transform the result.
 
 #### Columns
-* `source_id` -- PK of the source (int)
-* `base_url` -- URL of the API/endpoint where the metadata is available (text)
-* `earliest_date` -- Earliest date with available data (date, nullable)
-* `rate_limit_allowance` -- Number of requests allowed every `rate_limit_period` seconds (positive int, default=5)
-* `rate_limit_period` -- Number of seconds for every `rate_limit_allowance` requests (positive int, default=1)
-* `harvester_id` -- PK of the harvester to use (int)
-* `harvester_kwargs` -- JSON object passed to the harvester as kwargs (json, nullable)
-* `transformer_id` -- PK of the transformer to use (int)
-* `transformer_kwargs` -- JSON object passed to the transformer as kwargs, along with the harvested raw data (json, nullable)
-* `disabled` -- True if this ingest config should not be run automatically (boolean)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `source_id` | int |  |  | ✓ | | Source to harvest from |
+| `base_url` | text |  |  |  | | URL of the API or endpoint where the metadata is available |
+| `earliest_date` | date |  | ✓ |  | | Earliest date with available data |
+| `rate_limit_allowance` | int |  |  |  | 5 | Number of requests allowed every `rate_limit_period` seconds |
+| `rate_limit_period` | int |  |  |  | 1 | Number of seconds for every `rate_limit_allowance` requests |
+| `harvester_id` | int |  |  | ✓ | | Harvester to use |
+| `harvester_kwargs` | jsonb |  | ✓ |  | | JSON object passed to the harvester as kwargs |
+| `transformer_id` | int |  |  | ✓ | | Transformer to use |
+| `transformer_kwargs` | jsonb |  | ✓ |  | | JSON object passed to the transformer as kwargs, along with the harvested raw data |
+| `disabled` | bool |  |  |  | False | True if this ingest config should not be run automatically |
 
 ### Source
 A Source is a place metadata comes from.
 
 #### Columns
-* `name` -- Short name (text, unique)
-* `long_title` -- Full, human-friendly name (text, unique)
-* `home_page` -- URL (text, nullable)
-* `icon` -- Icon for the source (image, nullable)
-* `user_id` -- PK of the user with permission to submit data as this source (TODO: replace with django permissions stuff) (int)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `name` | text | unique |  |  | | Short name |
+| `long_title` | text | unique |  |  | | Full, human-friendly name |
+| `home_page` | text |  | ✓ |  | | URL |
+| `icon` | image |  | ✓ |  | | Recognizable icon for the source |
+| `user_id` | int |  |  | ✓ | | User with permission to submit data as this source (TODO: replace with django permissions stuff) |
 
 ### Harvester
 Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)
 
 #### Columns
-* `key` -- Key that can be used to get the corresponding Harvester subclass (text, unique)
-* `date_created` -- Date created (datetime)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Harvester subclass |
+| `date_created` | datetime |  |  |  | now (on insert) | |
 
 ### Transformer
 Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
 
 #### Columns
-* `key` -- Key that can be used to get the corresponding Transformer subclass (text, unique)
-* `date_created` -- Date created (datetime)
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Transformer subclass |
+| `date_created` | datetime |  |  |  | now (on insert) | |
 
 ## Logs
 
@@ -82,25 +94,29 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe
 Log entries to track the status of a specific harvester run.
 
 #### Columns
-* `ingest_config_id` -- PK of the IngestConfig for this harvester run (int)
-* `harvester_version` -- Current version of the harvester in format 'x.x.x' (text)
-* `start_date` -- Beginning of the date range to harvest (datetime)
-* `end_date` -- End of the date range to harvest (datetime)
-* `started` -- Time this harvester run began (datetime)
-* `status` -- Status of the harvester run (string, choices={INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED}, default=INITIAL)
-
-#### Multi-column indices
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `ingest_config_id` | int |  |  | ✓ | | IngestConfig for this harvester run |
+| `harvester_version` | text |  |  |  | | Current version of the harvester in format 'x.x.x' |
+| `start_date` | datetime |  |  |  | | Beginning of the date range to harvest |
+| `end_date` | datetime |  |  |  | | End of the date range to harvest |
+| `started` | datetime |  |  |  | | Time `status` was set to STARTED |
+| `status` | text |  |  |  | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} |
+
+#### Other indices
 * `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
 
 ### TransformLog
-Log entries to track the status of a specific harvester run.
+Log entries to track the status of a transform task
 
 #### Columns
-* `raw_id` -- PK of the RawData to be transformed (int)
-* `ingest_config_id` -- PK of the IngestConfig (int)
-* `transformer_version` -- Current version of the transformer in format 'x.x.x' (text)
-* `started` -- Time this transform task began (datetime)
-* `status` -- Status of the transform task (string, choices={INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED}, default=INITIAL)
-
-#### Multi-column indices
+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+| `raw_id` | int |  |  | ✓ | | RawData to be transformed |
+| `ingest_config_id` | int |  |  | ✓ | | IngestConfig used |
+| `transformer_version` | text |  |  |  | | Current version of the transformer in format 'x.x.x' |
+| `started` | datetime |  |  |  | | Time `status` was set to STARTED |
+| `status` | text |  |  |  | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |
+
+#### Other indices
 * `raw_id`, `transformer_version` (unique)

From 9e2b8c33d9258337e2901522a868a97562d1bdf8 Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Thu, 16 Feb 2017 10:05:06 -0500
Subject: [PATCH 5/6] Fix some table stuff.

---
 whitepapers/Tables.md | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index 53c2518bc..73d3ff55d 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -6,7 +6,7 @@
 {Description}
 
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | | | ✓ | ✓ | ✓ | | |
 
 #### Other indices
@@ -18,9 +18,8 @@
 ### SourceUniqueIdentifier (SUID)
 Identifier for a specific document from a specific source.
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `source_doc_id` | text |  |  |  |  | Identifier given to the document by the source |
 | `ingest_config_id` | int |  |  | ✓ |  | IngestConfig used to ingest the document |
 
@@ -30,9 +29,8 @@ Identifier for a specific document from a specific source.
 ### RawData
 Raw data, exactly as it was given to SHARE.
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `suid_id` | int |  |  | ✓ | | SUID for this datum |
 | `data` | text |  |  |  | | The raw data itself (typically JSON or XML string) |
 | `sha256` | text | unique |  |  | | SHA-256 hash of `data` |
@@ -44,9 +42,8 @@ Raw data, exactly as it was given to SHARE.
 ### IngestConfig
 Describes one way to harvest metadata from a Source, and how to transform the result.
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `source_id` | int |  |  | ✓ | | Source to harvest from |
 | `base_url` | text |  |  |  | | URL of the API or endpoint where the metadata is available |
 | `earliest_date` | date |  | ✓ |  | | Earliest date with available data |
@@ -61,9 +58,8 @@ Describes one way to harvest metadata from a Source, and how to transform the re
 ### Source
 A Source is a place metadata comes from.
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `name` | text | unique |  |  | | Short name |
 | `long_title` | text | unique |  |  | | Full, human-friendly name |
 | `home_page` | text |  | ✓ |  | | URL |
@@ -73,18 +69,16 @@ A Source is a place metadata comes from.
 ### Harvester
 Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `key` | text | unique |  |  | | Key that can be used to get the corresponding Harvester subclass |
 | `date_created` | datetime |  |  |  | now (on insert) | |
 
 ### Transformer
 Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `key` | text | unique |  |  | | Key that can be used to get the corresponding Transformer subclass |
 | `date_created` | datetime |  |  |  | now (on insert) | |
 
@@ -93,9 +87,8 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe
 ### HarvestLog
 Log entries to track the status of a specific harvester run.
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `ingest_config_id` | int |  |  | ✓ | | IngestConfig for this harvester run |
 | `harvester_version` | text |  |  |  | | Current version of the harvester in format 'x.x.x' |
 | `start_date` | datetime |  |  |  | | Beginning of the date range to harvest |
@@ -109,9 +102,8 @@ Log entries to track the status of a specific harvester run.
 ### TransformLog
 Log entries to track the status of a transform task
 
-#### Columns
 | Column | Type | Indexed | Nullable | FK | Default | Description |
-|:-------|:----:|:-------:|:---------|:--:|:-------:|:------------|
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `raw_id` | int |  |  | ✓ | | RawData to be transformed |
 | `ingest_config_id` | int |  |  | ✓ | | IngestConfig used |
 | `transformer_version` | text |  |  |  | | Current version of the transformer in format 'x.x.x' |

From 7ca98c4b2bcf629b989cd19d25c638fbf79d7845 Mon Sep 17 00:00:00 2001
From: Abram Booth <abram@cos.io>
Date: Thu, 16 Feb 2017 13:49:16 -0500
Subject: [PATCH 6/6] Updates

---
 whitepapers/Tables.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
index 73d3ff55d..36f8f2c81 100644
--- a/whitepapers/Tables.md
+++ b/whitepapers/Tables.md
@@ -20,7 +20,7 @@ Identifier for a specific document from a specific source.
 
 | Column | Type | Indexed | Nullable | FK | Default | Description |
 |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
-| `source_doc_id` | text |  |  |  |  | Identifier given to the document by the source |
+| `identifier` | text |  |  |  |  | Identifier given to the document by the source |
 | `ingest_config_id` | int |  |  | ✓ |  | IngestConfig used to ingest the document |
 
 #### Other indices
@@ -34,8 +34,7 @@ Raw data, exactly as it was given to SHARE.
 | `suid_id` | int |  |  | ✓ | | SUID for this datum |
 | `data` | text |  |  |  | | The raw data itself (typically JSON or XML string) |
 | `sha256` | text | unique |  |  | | SHA-256 hash of `data` |
-| `date_seen` | datetime |  |  |  | now (every update) | The last time this exact data was harvested |
-| `date_harvested` | datetime |  |  |  | now (on insert) | The first time this exact data was harvested |
+| `harvest_logs` | m2m |  |  |  |  | List of HarvestLogs for harvester runs that found this exact datum |
 
 ## Ingest Configuration
 
@@ -72,7 +71,8 @@ Each row corresponds to a Harvester implementation in python. (TODO: describe th
 | Column | Type | Indexed | Nullable | FK | Default | Description |
 |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `key` | text | unique |  |  | | Key that can be used to get the corresponding Harvester subclass |
-| `date_created` | datetime |  |  |  | now (on insert) | |
+| `date_created` | datetime |  |  |  | now | |
+| `date_modified` | datetime |  |  |  | now (on update) | |
 
 ### Transformer
 Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)
@@ -80,7 +80,8 @@ Each row corresponds to a Transformer implementation in python. (TODO: describe
 | Column | Type | Indexed | Nullable | FK | Default | Description |
 |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `key` | text | unique |  |  | | Key that can be used to get the corresponding Transformer subclass |
-| `date_created` | datetime |  |  |  | now (on insert) | |
+| `date_created` | datetime |  |  |  | now | |
+| `date_modified` | datetime |  |  |  | now (on update) | |
 
 ## Logs
 
@@ -90,7 +91,7 @@ Log entries to track the status of a specific harvester run.
 | Column | Type | Indexed | Nullable | FK | Default | Description |
 |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `ingest_config_id` | int |  |  | ✓ | | IngestConfig for this harvester run |
-| `harvester_version` | text |  |  |  | | Current version of the harvester in format 'x.x.x' |
+| `harvester_version` | text |  |  |  | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
 | `start_date` | datetime |  |  |  | | Beginning of the date range to harvest |
 | `end_date` | datetime |  |  |  | | End of the date range to harvest |
 | `started` | datetime |  |  |  | | Time `status` was set to STARTED |
@@ -106,7 +107,7 @@ Log entries to track the status of a transform task
 |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
 | `raw_id` | int |  |  | ✓ | | RawData to be transformed |
 | `ingest_config_id` | int |  |  | ✓ | | IngestConfig used |
-| `transformer_version` | text |  |  |  | | Current version of the transformer in format 'x.x.x' |
+| `transformer_version` | text |  |  |  | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
 | `started` | datetime |  |  |  | | Time `status` was set to STARTED |
 | `status` | text |  |  |  | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |