Skip to content

[Tamandua] Describe models for configuring sources#588

Merged
aaxelb merged 6 commits intoCenterForOpenScience:feature/project-tamanduafrom
aaxelb:feature/project-tamandua
Feb 16, 2017
Merged

[Tamandua] Describe models for configuring sources#588
aaxelb merged 6 commits intoCenterForOpenScience:feature/project-tamanduafrom
aaxelb:feature/project-tamandua

Conversation

@aaxelb
Copy link
Copy Markdown
Contributor

@aaxelb aaxelb commented Feb 15, 2017

No description provided.

Comment thread whitepapers/Tables.md Outdated
* `name` -- Short, unique string
* `long_title` -- full, human-friendly name
* `home_page` -- URL
* `favicon` --
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to just be icon?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread whitepapers/Tables.md Outdated
* `key` -- Unique key that can be used to get the corresponding Transformer subclass
* `version` --

### SourceConfig(?)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe IngestConfig?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, good call.

Comment thread whitepapers/Tables.md Outdated

#### Columns
* `key` -- Unique key that can be used to get the corresponding Harvester subclass
* `version` --
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to be keeping version here?
It'll get copied to the HarvestLog or which ever relevant log.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes more sense to keep version only on the class, so they can't get out of sync. It felt weird having a one-column table, but I guess there's nothing wrong with that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can throw in some time stamps if that would make you feel better?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually date_created wouldn't be a bad idea at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, no sad tables for one 💔

Comment thread whitepapers/Tables.md Outdated
* `source_id` -- PK of the source
* `base_url` -- URL of the API/endpoint where the metadata is available
* `earliest_date` -- Earliest date with available data (nullable)
* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type and format?

How would I express 5 reqs / sec vs 1 reqs / 5 sec

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split it into rate_limit_allowance and rate_limit_period.

Comment thread whitepapers/tasks/Harvest.md Outdated
* Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)`
* if the log already exists and superfluous is not set, exit
* Start the transform task(raw_id, version) unless `transform` is `False`
* Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should version be passed here? I've been back and forth on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's in TransformLog, I feel like it shouldn't be in the task.

Comment thread whitepapers/Tables.md Outdated



## Pipeline configuration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you throw in HarvestLog here?
Or I can if you don't want to 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread whitepapers/Tables.md Outdated

#### Columns
* `name` -- Short, unique string
* `long_title` -- full, human-friendly name
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As things stand long_title should be unique as well.

Side note: we should come up with a slightly better format for expressing sql rows.
Each should convey, if it's unique, indexed, unique with another value, datatype, nullable, default, etc.

Should we just make a table template that we copy for everything?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a template and changed the rest to follow it, comments/criticisms welcome.

Comment thread whitepapers/Tables.md

## Pipeline configuration

### Source
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spit balling some ideas, for the admin we should throw in some aggregations.

Last harvest SELECT * FROM harvest_log WHERE source_id = me SORT BY started LIMIT 1
Total harvests SELECT COUNT(*) FROM harvest_log WHERE source_id = me
Success % ...
Etc.

Comment thread whitepapers/Tables.md Outdated
{Description}

#### Columns
* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column Type Indexed Nullable FK Default Description
suid_id int X X Description as the end because it may be very very long
suid_id int Yes No Yes No Description as the end because it may be very very long
suid_id int ✅  Description as the end because it may be very very long
suid_id int Description as the end because it may be very very long

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, tables are a thing! That's way better. I like blank for false and an icon for true (maybe ✓), so the eye is drawn to the true values.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm good with any of them. Well the red X is a bit too much...

Comment thread whitepapers/Tables.md
| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `source_doc_id` | text | | | | | Identifier given to the document by the source |
| `ingest_config_id` | int | | | ✓ | | IngestConfig used to ingest the document |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could source_doc_id just be identifier, having non-FK's ending _id feels a bit hairy?

I think it makes sense to link to source here. If we extract the same identifier they will be the same record(s).
And many times when the harvester changes the identifier changes as well.

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 identifier

I think linking suid to source makes sense if we only have one ingest config for that source enabled at a time. If we want to support multiple per source, though, we could harvest the same document in different formats (maybe from two complementary APIs). If we don't keep them separate, there could be a back-and-forth, deleting each other's states where the formats don't overlap.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformers will keep everything in the correct format and a source should never have 2 documents with the same identifier out of sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop me if I'm worrying too much about an absurdly unlikely case (or am just confused), but what if a source has two APIs (or their OAI endpoint supports two metadata prefixes, or whatever) which largely overlap, but each has information the other lacks:

  • A has good contributor affiliations
  • B has good funding information

If we harvest from both, the Consolidator will look at freshly transformed data from A and think "oh, that funding information must have been deleted", then look at data from B and think, "oh, these affiliations must have been removed", and we'll never have both at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point but I would want to fix that at the consolidator level. This revolves more around the issue of how to tell partial data vs subtractive data.

I think it really comes down to detecting omission of fields vs empty fields.

This may be a bit of a hack. Let's say we have the following for an agent:

{
    "@id": "_:122",
    "@type": "agent",
    "name": "Bill",
    "affiliations": []
}

vs

{
    "@id": "_:122",
    "@type": "agent",
    "name": "Bill",
}

The first is explicitly stating there is a lack of affiliations while the second just doesn't have affiliation data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense... just need to make sure the Regulator is in on it and preserves empty m2ms.

It could still be an issue if (stretching my example even deeper into unlikelihood) A has simple affiliation data (to a university, say), while B has affiliations to university, department, school, and lab. A's short list would overwrite B's long one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we just disable A at that point?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I flipped them.

  • A has thorough affiliation data (university, school, department, lab...)
  • B has good funding info and some basic affiliation data

How could we combine them without losing info?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IngestConfig sort of defines a particular view for a document, a lens which refracts arbitrary information into focus in a way SHARE can understand. SUIDs seem more useful if they denote a particular document through a particular lens, and everyone before the Deduplicator need only worry about looking through one lens at a time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you got me. 👍 IngestConfig makes the most sense.

Comment thread whitepapers/Tables.md Outdated
| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `key` | text | unique | | | | Key that can be used to get the corresponding Harvester subclass |
| `date_created` | datetime | | | | now (on insert) | |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOW() is a postgres function.
You can remove the "on insert" (I think) it's implied as the default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we implement this with django's auto_now_add, NOW() isn't used and there's no actual default in postgres. Does that matter here?

Or in another case, Source.icon is technically a text field, with a value which corresponds to where the image is stored, because that's how django's ImageField works. That seems like irrelevant/extraneous detail for a white paper, and all that actually matters is that each Source has an (optional) image named icon.

How much should this file be considered exact specs for SQL tables, vs. a more abstract view of the objects and their attributes, with implementation details left up to the reader? I feel like we need to find a balance where these papers accurately describe a system, and our code is but one possible implementation of that system. But I'm not quite sure where that balance lies.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstract view > tedium.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread whitepapers/Tables.md
| `source_id` | int | | | ✓ | | Source to harvest from |
| `base_url` | text | | | | | URL of the API or endpoint where the metadata is available |
| `earliest_date` | date | | ✓ | | | Earliest date with available data |
| `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use a float here. Probably less understandable... Not sure how much I like it, just punting as an option.
1 req / 1 sec = 1.0
1 req / 5 sec = 0.2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, but two columns seemed like less work than reimplementing our rate limiting. Putting it in a float would lose the difference between 1 req/5 sec and 100 req/500 sec (the latter allowing larger bursts), though I don't know how much that matters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not think of that. Let's keep it this way in that case 👍 nice catch

Comment thread whitepapers/Tables.md Outdated
| `suid_id` | int | | | ✓ | | SUID for this datum |
| `data` | text | | | | | The raw data itself (typically JSON or XML string) |
| `sha256` | text | unique | | | | SHA-256 hash of `data` |
| `date_seen` | datetime | | | | now (every update) | The last time this exact data was harvested |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If weFK to HarvestLog do we even need the date fields here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. If it's only pointing to the HarvestLog that created it, we'd lose date_seen, but I don't know if anyone cares about that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make it a m2m (which I think we should?) there's no information lost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call 👍

Comment thread whitepapers/Tables.md Outdated
| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `key` | text | unique | | | | Key that can be used to get the corresponding Transformer subclass |
| `date_created` | datetime | | | | now (on insert) | |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and throw date_modified on here as well, might be handy for debugging syncing of harvesters/transformers and seems like a good practice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread whitepapers/Tables.md Outdated
| Column | Type | Indexed | Nullable | FK | Default | Description |
|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
| `ingest_config_id` | int | | | ✓ | | IngestConfig for this harvester run |
| `harvester_version` | text | | | | | Current version of the harvester in format 'x.x.x' |
Copy link
Copy Markdown
Member

@chrisseto chrisseto Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version might need to have a different format to it. I'd like it to be sortable. It fails to order 0.0.10 and 0.0.9 correctly. We could just pad the version up to 3 digits or something and not worry about it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, padding each segment seems like the best way to go.

@aaxelb aaxelb force-pushed the feature/project-tamandua branch from 3c3fa78 to 7ca98c4 Compare February 16, 2017 18:50
@aaxelb aaxelb merged commit 504085f into CenterForOpenScience:feature/project-tamandua Feb 16, 2017
@aaxelb aaxelb deleted the feature/project-tamandua branch February 16, 2017 18:50
chrisseto pushed a commit that referenced this pull request Feb 23, 2017
* Describe models for configuring sources

* SourceConfig => IngestConfig

* Consistent table definitions

* Define tables using tables.

* Fix some table stuff.

* Updates
chrisseto pushed a commit that referenced this pull request Mar 6, 2017
* Describe models for configuring sources

* SourceConfig => IngestConfig

* Consistent table definitions

* Define tables using tables.

* Fix some table stuff.

* Updates
chrisseto pushed a commit that referenced this pull request Mar 8, 2017
* Describe models for configuring sources

* SourceConfig => IngestConfig

* Consistent table definitions

* Define tables using tables.

* Fix some table stuff.

* Updates
chrisseto pushed a commit that referenced this pull request Mar 10, 2017
* Describe models for configuring sources

* SourceConfig => IngestConfig

* Consistent table definitions

* Define tables using tables.

* Fix some table stuff.

* Updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants