Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions whitepapers/Tables.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,115 @@
# SQL Tables

## Template

### {ModelName}
{Description}

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| | | ✓ | ✓ | ✓ | | |

#### Other indices
* `{column_name}`, `{column_name}`, ... [(unique)]
* ...

## Data

### SourceUniqueIdentifier (SUID)
Identifier for a specific document from a specific source.

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `identifier` | text | | | | | Identifier given to the document by the source |
| `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document |

#### Other indices
* `source_doc_id`, `ingest_config_id` (unique)

### RawData
Raw data, exactly as it was given to SHARE.

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `suid_id` | int | | | ✓ | | SUID for this datum |
| `data` | text | | | | | The raw data itself (typically JSON or XML string) |
| `sha256` | text | unique | | | | SHA-256 hash of `data` |
| `harvest_logs` | m2m | | | | | List of HarvestLogs for harvester runs that found this exact datum |

## Ingest Configuration

### IngestConfig
Describes one way to harvest metadata from a Source, and how to transform the result.

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `source_id` | int | | | ✓ | | Source to harvest from |
| `base_url` | text | | | | | URL of the API or endpoint where the metadata is available |
| `earliest_date` | date | | ✓ | | | Earliest date with available data |
| `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use a float here. Probably less understandable... Not sure how much I like it, just punting as an option.
1 req / 1 sec = 1.0
1 req / 5 sec = 0.2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, but two columns seemed like less work than reimplementing our rate limiting. Putting it in a float would lose the difference between 1 req/5 sec and 100 req/500 sec (the latter allowing larger bursts), though I don't know how much that matters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not think of that. Let's keep it this way in that case 👍 nice catch

| `rate_limit_period` | int | | | | 1 | Number of seconds for every `rate_limit_allowance` requests |
| `harvester_id` | int | | | ✓ | | Harvester to use |
| `harvester_kwargs` | jsonb | | ✓ | | | JSON object passed to the harvester as kwargs |
| `transformer_id` | int | | | ✓ | | Transformer to use |
| `transformer_kwargs` | jsonb | | ✓ | | | JSON object passed to the transformer as kwargs, along with the harvested raw data |
| `disabled` | bool | | | | False | True if this ingest config should not be run automatically |

### Source
A Source is a place metadata comes from.

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `name` | text | unique | | | | Short name |
| `long_title` | text | unique | | | | Full, human-friendly name |
| `home_page` | text | | ✓ | | | URL |
| `icon` | image | | ✓ | | | Recognizable icon for the source |
| `user_id` | int | | | ✓ | | User with permission to submit data as this source (TODO: replace with django permissions stuff) |

### Harvester
Each row corresponds to a Harvester implementation in python. (TODO: describe those somewhere)

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass |
| `date_created` | datetime | | | | now | |
| `date_modified` | datetime | | | | now (on update) | |

### Transformer
Each row corresponds to a Transformer implementation in python. (TODO: describe those somewhere)

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass |
| `date_created` | datetime | | | | now | |
| `date_modified` | datetime | | | | now (on update) | |

## Logs

### HarvestLog
Log entries to track the status of a specific harvester run.

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run |
| `harvester_version` | text | | | | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
| `start_date` | datetime | | | | | Beginning of the date range to harvest |
| `end_date` | datetime | | | | | End of the date range to harvest |
| `started` | datetime | | | | | Time `status` was set to STARTED |
| `status` | text | | | | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} |

#### Other indices
* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)

### TransformLog
Log entries to track the status of a transform task

| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `raw_id` | int | | | ✓ | | RawData to be transformed |
| `ingest_config_id` | int | | | ✓ | | IngestConfig used |
| `transformer_version` | text | | | | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010')
| `started` | datetime | | | | | Time `status` was set to STARTED |
| `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |

#### Other indices
* `raw_id`, `transformer_version` (unique)
30 changes: 15 additions & 15 deletions whitepapers/tasks/Harvest.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,50 +17,50 @@


## Parameters
* `source_id` -- The PK of the source to harvest from
* `ingest_config_id` -- The PK of the IngestConfig to use
* `start_date` --
* `end_date` --
* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimitted)
* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimitted)
* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
* `superfluous` -- Take certain actions that have previously suceeded
* `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
* `no_split` -- Should harvest jobs be split into multiple? Default to `False`
* `ignore_disabled` -- Run the task, even with disabled sources
* `ignore_disabled` -- Run the task, even with disabled ingest configs
* `force` -- Force the task to run, against all odds


## Steps

### Preventative measures
* If the specified `source` is disabled and `force` or `ignore_disabled` is not set, crash
* For the given `source` find up to the last 5 harvest jobs with the same versions
* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
* If they are all failed, throw an exception (Refuse to run)

### Setup
* Lock the `source` (NOWAIT)
* Lock the `ingest_config` (NOWAIT)
* On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
* Get or create HarvestJob(source_id, version, harvester, date ranges...)
* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
* if found and status is:
* `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
* STARTED: Log a warning (Should not have been able to lock the source) and update timestamps and/or counts.
* Set HarvestJob status to `STARTED`
* `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
* Set HarvestLog status to `STARTED`
* If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
* Chunk the date range and spawn a harvest task for each chunk
* Set status to `SPLIT` and exit
* Load the harvester for the given source
* Load the harvester for the given `ingest_config`

### Actually Harvest
* Harvest data between the specified datetimes, respecting `limit` and `rate_limit`
* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`

### Pass the data along
* Begin catching any exceptions
* For each piece of data recieved (Perferably in bulk/chunks)
* Get or create SourceUniqueIdentifier(suid, source_id)
* Get or create `SourceUniqueIdentifier(suid, source_id)`
* Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
* Get or create RawData(hash, suid)
* For each piece of data (After saving to keep as transactional as possible)
* Get or create TransformLogs(raw_id, version)
* Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
* if the log already exists and superfluous is not set, exit
* Start the transform task(raw_id, version) unless `transform` is `False`
* Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`

### Clean up
* If an exception was caught, set status to `FAILED` and insert the exception/traceback
Expand Down
32 changes: 16 additions & 16 deletions whitepapers/tasks/Transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@


## Responsibilities
* Parsing data using source specific parsers
* Parsing data using source-specific parsers
* Applying global cleaners to the data
* Catching any extranious exceptions and storing them in the ProcessLog and marking the ProcessLog as failed
* Catching any extraneous exceptions, storing them in the TransformLog, and marking the TransformLog `FAILED`


## Parameters
* `raw_id` --
* `processor_version` --
* `cleaner_version` --
* `transformer_version` --
* `regulator_version` --
* `superfluous` --


Expand All @@ -19,23 +19,23 @@
### Setup
* Load RawData by id.
* Crash, if not found.
* If not defined set `processor_version` to the latest.
* If not defined set `cleaner_version` to the latest.
* Find and lock ProcessLog(`raw_id`, `processor_version`) (SELECT FOR UPDATE NOWAIT)
* If not defined set `transformer_version` to the latest.
* If not defined set `regulator_version` to the latest.
* Find and lock TransformLog(`raw_id`, `transformer_version`) (SELECT FOR UPDATE NOWAIT)
* If not found, log an error. Create, Commit, Lock.
* If the create fails, Log an error and exit.
* If the create fails, log an error and exit.
* If the lock times out/isn't granted. Log an error and exit.
* If the found ProcessLog's status is finished/done and `superfluous` is `False` exit.
* Set the status of the ProcessLog to in-progress
* If the found TransformLog's status is `SUCCEEDED` and `superfluous` is `False` exit.
* Set the status of the TransformLog to `STARTED`

### Check for racing
* Search for any equivilent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished processing
* If found set status to rescheduled and exit
* Search for any equivalent RawData (`document_id`, `source_id`) with an earlier timestamp that has not finished transfoming
* If found set status to `RESCHEDULED` and exit

### Actually process the data
### Actually transform the data
* Start a transaction
* Load the processor
* Process data
* Load the transformer
* Transform data
* Load the cleaning suite
* Clean data

Expand All @@ -52,4 +52,4 @@
* Commit transaction
* Release all locks
* Start disambiguation tasks for updated states
* Set ProcessLog status to Done
* Set TransformLog status to `SUCCEEDED`