[Tamandua] Describe models for configuring sources#588
Conversation
| * `name` -- Short, unique string | ||
| * `long_title` -- full, human-friendly name | ||
| * `home_page` -- URL | ||
| * `favicon` -- |
There was a problem hiding this comment.
Can we change this to just be icon?
| * `key` -- Unique key that can be used to get the corresponding Transformer subclass | ||
| * `version` -- | ||
|
|
||
| ### SourceConfig(?) |
|
|
||
| #### Columns | ||
| * `key` -- Unique key that can be used to get the corresponding Harvester subclass | ||
| * `version` -- |
There was a problem hiding this comment.
Does it make sense to be keeping version here?
It'll get copied to the HarvestLog or which ever relevant log.
There was a problem hiding this comment.
Yeah, it makes more sense to keep version only on the class, so they can't get out of sync. It felt weird having a one-column table, but I guess there's nothing wrong with that.
There was a problem hiding this comment.
We can throw in some time stamps if that would make you feel better?
There was a problem hiding this comment.
Actually date_created wouldn't be a bad idea at all.
There was a problem hiding this comment.
Yay, no sad tables for one 💔
| * `source_id` -- PK of the source | ||
| * `base_url` -- URL of the API/endpoint where the metadata is available | ||
| * `earliest_date` -- Earliest date with available data (nullable) | ||
| * `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited) |
There was a problem hiding this comment.
Type and format?
How would I express 5 reqs / sec vs 1 reqs / 5 sec
There was a problem hiding this comment.
I split it into rate_limit_allowance and rate_limit_period.
| * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)` | ||
| * if the log already exists and superfluous is not set, exit | ||
| * Start the transform task(raw_id, version) unless `transform` is `False` | ||
| * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False` |
There was a problem hiding this comment.
Should version be passed here? I've been back and forth on it.
There was a problem hiding this comment.
If it's in TransformLog, I feel like it shouldn't be in the task.
|
|
||
|
|
||
|
|
||
| ## Pipeline configuration |
There was a problem hiding this comment.
Could you throw in HarvestLog here?
Or I can if you don't want to 👍
|
|
||
| #### Columns | ||
| * `name` -- Short, unique string | ||
| * `long_title` -- full, human-friendly name |
There was a problem hiding this comment.
As things stand long_title should be unique as well.
Side note: we should come up with a slightly better format for expressing sql rows.
Each should convey, if it's unique, indexed, unique with another value, datatype, nullable, default, etc.
Should we just make a table template that we copy for everything?
There was a problem hiding this comment.
I added a template and changed the rest to follow it, comments/criticisms welcome.
|
|
||
| ## Pipeline configuration | ||
|
|
||
| ### Source |
There was a problem hiding this comment.
Spit balling some ideas, for the admin we should throw in some aggregations.
Last harvest SELECT * FROM harvest_log WHERE source_id = me SORT BY started LIMIT 1
Total harvests SELECT COUNT(*) FROM harvest_log WHERE source_id = me
Success % ...
Etc.
| {Description} | ||
|
|
||
| #### Columns | ||
| * `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],]) |
There was a problem hiding this comment.
| Column | Type | Indexed | Nullable | FK | Default | Description |
|---|---|---|---|---|---|---|
| suid_id | int | X | X | Description as the end because it may be very very long | ||
| suid_id | int | Yes | No | Yes | No | Description as the end because it may be very very long |
| suid_id | int | ✅ | ❌ | ✅ | ❌ | Description as the end because it may be very very long |
| suid_id | int | ✓ | ⨉ | ✓ | ⨉ | Description as the end because it may be very very long |
?
There was a problem hiding this comment.
Oh right, tables are a thing! That's way better. I like blank for false and an icon for true (maybe ✓), so the eye is drawn to the true values.
There was a problem hiding this comment.
👍 I'm good with any of them. Well the red X is a bit too much...
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `source_doc_id` | text | | | | | Identifier given to the document by the source | | ||
| | `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document | |
There was a problem hiding this comment.
Could source_doc_id just be identifier, having non-FK's ending _id feels a bit hairy?
I think it makes sense to link to source here. If we extract the same identifier they will be the same record(s).
And many times when the harvester changes the identifier changes as well.
Thoughts?
There was a problem hiding this comment.
👍 identifier
I think linking suid to source makes sense if we only have one ingest config for that source enabled at a time. If we want to support multiple per source, though, we could harvest the same document in different formats (maybe from two complementary APIs). If we don't keep them separate, there could be a back-and-forth, deleting each other's states where the formats don't overlap.
There was a problem hiding this comment.
The transformers will keep everything in the correct format and a source should never have 2 documents with the same identifier out of sync.
There was a problem hiding this comment.
Stop me if I'm worrying too much about an absurdly unlikely case (or am just confused), but what if a source has two APIs (or their OAI endpoint supports two metadata prefixes, or whatever) which largely overlap, but each has information the other lacks:
- A has good contributor affiliations
- B has good funding information
If we harvest from both, the Consolidator will look at freshly transformed data from A and think "oh, that funding information must have been deleted", then look at data from B and think, "oh, these affiliations must have been removed", and we'll never have both at the same time.
There was a problem hiding this comment.
That's a very good point but I would want to fix that at the consolidator level. This revolves more around the issue of how to tell partial data vs subtractive data.
I think it really comes down to detecting omission of fields vs empty fields.
This may be a bit of a hack. Let's say we have the following for an agent:
{
"@id": "_:122",
"@type": "agent",
"name": "Bill",
"affiliations": []
}vs
{
"@id": "_:122",
"@type": "agent",
"name": "Bill",
}The first is explicitly stating there is a lack of affiliations while the second just doesn't have affiliation data.
There was a problem hiding this comment.
That makes sense... just need to make sure the Regulator is in on it and preserves empty m2ms.
It could still be an issue if (stretching my example even deeper into unlikelihood) A has simple affiliation data (to a university, say), while B has affiliations to university, department, school, and lab. A's short list would overwrite B's long one.
There was a problem hiding this comment.
Wouldn't we just disable A at that point?
There was a problem hiding this comment.
Oops, I flipped them.
- A has thorough affiliation data (university, school, department, lab...)
- B has good funding info and some basic affiliation data
How could we combine them without losing info?
There was a problem hiding this comment.
An IngestConfig sort of defines a particular view for a document, a lens which refracts arbitrary information into focus in a way SHARE can understand. SUIDs seem more useful if they denote a particular document through a particular lens, and everyone before the Deduplicator need only worry about looking through one lens at a time.
There was a problem hiding this comment.
Yep, you got me. 👍 IngestConfig makes the most sense.
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass | | ||
| | `date_created` | datetime | | | | now (on insert) | | |
There was a problem hiding this comment.
NOW() is a postgres function.
You can remove the "on insert" (I think) it's implied as the default?
There was a problem hiding this comment.
If we implement this with django's auto_now_add, NOW() isn't used and there's no actual default in postgres. Does that matter here?
Or in another case, Source.icon is technically a text field, with a value which corresponds to where the image is stored, because that's how django's ImageField works. That seems like irrelevant/extraneous detail for a white paper, and all that actually matters is that each Source has an (optional) image named icon.
How much should this file be considered exact specs for SQL tables, vs. a more abstract view of the objects and their attributes, with implementation details left up to the reader? I feel like we need to find a balance where these papers accurately describe a system, and our code is but one possible implementation of that system. But I'm not quite sure where that balance lies.
| | `source_id` | int | | | ✓ | | Source to harvest from | | ||
| | `base_url` | text | | | | | URL of the API or endpoint where the metadata is available | | ||
| | `earliest_date` | date | | ✓ | | | Earliest date with available data | | ||
| | `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds | |
There was a problem hiding this comment.
Could also use a float here. Probably less understandable... Not sure how much I like it, just punting as an option.
1 req / 1 sec = 1.0
1 req / 5 sec = 0.2
There was a problem hiding this comment.
Yeah, I thought about that, but two columns seemed like less work than reimplementing our rate limiting. Putting it in a float would lose the difference between 1 req/5 sec and 100 req/500 sec (the latter allowing larger bursts), though I don't know how much that matters.
There was a problem hiding this comment.
I did not think of that. Let's keep it this way in that case 👍 nice catch
| | `suid_id` | int | | | ✓ | | SUID for this datum | | ||
| | `data` | text | | | | | The raw data itself (typically JSON or XML string) | | ||
| | `sha256` | text | unique | | | | SHA-256 hash of `data` | | ||
| | `date_seen` | datetime | | | | now (every update) | The last time this exact data was harvested | |
There was a problem hiding this comment.
If weFK to HarvestLog do we even need the date fields here?
There was a problem hiding this comment.
Probably not. If it's only pointing to the HarvestLog that created it, we'd lose date_seen, but I don't know if anyone cares about that.
There was a problem hiding this comment.
If we make it a m2m (which I think we should?) there's no information lost.
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass | | ||
| | `date_created` | datetime | | | | now (on insert) | | |
There was a problem hiding this comment.
Let's go ahead and throw date_modified on here as well, might be handy for debugging syncing of harvesters/transformers and seems like a good practice?
| | Column | Type | Indexed | Nullable | FK | Default | Description | | ||
| |:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------| | ||
| | `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run | | ||
| | `harvester_version` | text | | | | | Current version of the harvester in format 'x.x.x' | |
There was a problem hiding this comment.
Version might need to have a different format to it. I'd like it to be sortable. It fails to order 0.0.10 and 0.0.9 correctly. We could just pad the version up to 3 digits or something and not worry about it?
There was a problem hiding this comment.
Yeah, padding each segment seems like the best way to go.
3c3fa78 to
7ca98c4
Compare
* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates
* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates
* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates
* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates
No description provided.