[Tamandua] Describe models for configuring sources by aaxelb · Pull Request #588 · CenterForOpenScience/SHARE

aaxelb · 2017-02-15T16:18:06Z

No description provided.

chrisseto · 2017-02-15T16:30:35Z

+* `name` -- Short, unique string
+* `long_title` -- full, human-friendly name
+* `home_page` -- URL
+* `favicon` -- 


Can we change this to just be icon?

chrisseto · 2017-02-15T16:40:26Z

+* `key` -- Unique key that can be used to get the corresponding Transformer subclass
+* `version` --
+
+### SourceConfig(?)


Maybe IngestConfig?

Ooh, good call.

chrisseto · 2017-02-15T17:12:41Z

+
+#### Columns
+* `key` -- Unique key that can be used to get the corresponding Harvester subclass
+* `version` --


Does it make sense to be keeping version here?
It'll get copied to the HarvestLog or which ever relevant log.

Yeah, it makes more sense to keep version only on the class, so they can't get out of sync. It felt weird having a one-column table, but I guess there's nothing wrong with that.

We can throw in some time stamps if that would make you feel better?

Actually date_created wouldn't be a bad idea at all.

Yay, no sad tables for one 💔

chrisseto · 2017-02-15T18:46:58Z

+* `source_id` -- PK of the source
+* `base_url` -- URL of the API/endpoint where the metadata is available
+* `earliest_date` -- Earliest date with available data (nullable)
+* `rate_limit` -- Rate limit for network requests. Defaults to `None` (Unlimited)


Type and format?

How would I express 5 reqs / sec vs 1 reqs / 5 sec

I split it into rate_limit_allowance and rate_limit_period.

chrisseto · 2017-02-15T18:47:58Z

+  * Get or create `TransformLogs(raw_id, ingest_config_id, transformer_version)`
  * if the log already exists and superfluous is not set, exit
-  * Start the transform task(raw_id, version) unless `transform` is `False`
+  * Start the `TransformTask(raw_id, ingest_config_id, version)` unless `transform` is `False`


Should version be passed here? I've been back and forth on it.

If it's in TransformLog, I feel like it shouldn't be in the task.

chrisseto · 2017-02-15T18:49:14Z




+## Pipeline configuration


Could you throw in HarvestLog here?
Or I can if you don't want to 👍

chrisseto · 2017-02-15T18:51:06Z

+
+#### Columns
+* `name` -- Short, unique string
+* `long_title` -- full, human-friendly name


As things stand long_title should be unique as well.

Side note: we should come up with a slightly better format for expressing sql rows.
Each should convey, if it's unique, indexed, unique with another value, datatype, nullable, default, etc.

Should we just make a table template that we copy for everything?

I added a template and changed the rest to follow it, comments/criticisms welcome.

chrisseto · 2017-02-15T18:55:51Z


+## Pipeline configuration

+### Source


Spit balling some ideas, for the admin we should throw in some aggregations.

Last harvest SELECT * FROM harvest_log WHERE source_id = me SORT BY started LIMIT 1
Total harvests SELECT COUNT(*) FROM harvest_log WHERE source_id = me
Success % ...
Etc.

chrisseto · 2017-02-15T22:14:55Z

+{Description}

+#### Columns
+* `{column_name}` -- {description} ({datatype}, [unique,] [indexed,] [nullable,] [default={value},] [choices={choices],])


Column Type Indexed Nullable FK Default Description

suid_id int X X Description as the end because it may be very very long

suid_id int Yes No Yes No Description as the end because it may be very very long

suid_id int ✅ ❌ ✅ ❌ Description as the end because it may be very very long

suid_id int ✓ ⨉ ✓ ⨉ Description as the end because it may be very very long

?

Oh right, tables are a thing! That's way better. I like blank for false and an icon for true (maybe ✓), so the eye is drawn to the true values.

👍 I'm good with any of them. Well the red X is a bit too much...

chrisseto · 2017-02-16T15:42:02Z

+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `source_doc_id` | text |  |  |  |  | Identifier given to the document by the source |
+| `ingest_config_id` | int |  |  | ✓ |  | IngestConfig used to ingest the document |


Could source_doc_id just be identifier, having non-FK's ending _id feels a bit hairy?

I think it makes sense to link to source here. If we extract the same identifier they will be the same record(s).
And many times when the harvester changes the identifier changes as well.

Thoughts?

👍 identifier

I think linking suid to source makes sense if we only have one ingest config for that source enabled at a time. If we want to support multiple per source, though, we could harvest the same document in different formats (maybe from two complementary APIs). If we don't keep them separate, there could be a back-and-forth, deleting each other's states where the formats don't overlap.

The transformers will keep everything in the correct format and a source should never have 2 documents with the same identifier out of sync.

Stop me if I'm worrying too much about an absurdly unlikely case (or am just confused), but what if a source has two APIs (or their OAI endpoint supports two metadata prefixes, or whatever) which largely overlap, but each has information the other lacks:

A has good contributor affiliations

B has good funding information

If we harvest from both, the Consolidator will look at freshly transformed data from A and think "oh, that funding information must have been deleted", then look at data from B and think, "oh, these affiliations must have been removed", and we'll never have both at the same time.

That's a very good point but I would want to fix that at the consolidator level. This revolves more around the issue of how to tell partial data vs subtractive data.

I think it really comes down to detecting omission of fields vs empty fields.

This may be a bit of a hack. Let's say we have the following for an agent:

{ "@id": "_:122", "@type": "agent", "name": "Bill", "affiliations": [] }

vs

{ "@id": "_:122", "@type": "agent", "name": "Bill", }

The first is explicitly stating there is a lack of affiliations while the second just doesn't have affiliation data.

That makes sense... just need to make sure the Regulator is in on it and preserves empty m2ms.

It could still be an issue if (stretching my example even deeper into unlikelihood) A has simple affiliation data (to a university, say), while B has affiliations to university, department, school, and lab. A's short list would overwrite B's long one.

Wouldn't we just disable A at that point?

Oops, I flipped them.

A has thorough affiliation data (university, school, department, lab...)

B has good funding info and some basic affiliation data

How could we combine them without losing info?

An IngestConfig sort of defines a particular view for a document, a lens which refracts arbitrary information into focus in a way SHARE can understand. SUIDs seem more useful if they denote a particular document through a particular lens, and everyone before the Deduplicator need only worry about looking through one lens at a time.

Yep, you got me. 👍 IngestConfig makes the most sense.

chrisseto · 2017-02-16T15:44:48Z

+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Harvester subclass |
+| `date_created` | datetime |  |  |  | now (on insert) | |


NOW() is a postgres function.
You can remove the "on insert" (I think) it's implied as the default?

If we implement this with django's auto_now_add, NOW() isn't used and there's no actual default in postgres. Does that matter here?

Or in another case, Source.icon is technically a text field, with a value which corresponds to where the image is stored, because that's how django's ImageField works. That seems like irrelevant/extraneous detail for a white paper, and all that actually matters is that each Source has an (optional) image named icon.

How much should this file be considered exact specs for SQL tables, vs. a more abstract view of the objects and their attributes, with implementation details left up to the reader? I feel like we need to find a balance where these papers accurately describe a system, and our code is but one possible implementation of that system. But I'm not quite sure where that balance lies.

Abstract view > tedium.

chrisseto · 2017-02-16T15:48:05Z

+| `source_id` | int |  |  | ✓ | | Source to harvest from |
+| `base_url` | text |  |  |  | | URL of the API or endpoint where the metadata is available |
+| `earliest_date` | date |  | ✓ |  | | Earliest date with available data |
+| `rate_limit_allowance` | int |  |  |  | 5 | Number of requests allowed every `rate_limit_period` seconds |


Could also use a float here. Probably less understandable... Not sure how much I like it, just punting as an option.
1 req / 1 sec = 1.0
1 req / 5 sec = 0.2

Yeah, I thought about that, but two columns seemed like less work than reimplementing our rate limiting. Putting it in a float would lose the difference between 1 req/5 sec and 100 req/500 sec (the latter allowing larger bursts), though I don't know how much that matters.

I did not think of that. Let's keep it this way in that case 👍 nice catch

chrisseto · 2017-02-16T15:50:05Z

+| `suid_id` | int |  |  | ✓ | | SUID for this datum |
+| `data` | text |  |  |  | | The raw data itself (typically JSON or XML string) |
+| `sha256` | text | unique |  |  | | SHA-256 hash of `data` |
+| `date_seen` | datetime |  |  |  | now (every update) | The last time this exact data was harvested |


If weFK to HarvestLog do we even need the date fields here?

Probably not. If it's only pointing to the HarvestLog that created it, we'd lose date_seen, but I don't know if anyone cares about that.

If we make it a m2m (which I think we should?) there's no information lost.

Good call 👍

chrisseto · 2017-02-16T18:33:57Z

+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `key` | text | unique |  |  | | Key that can be used to get the corresponding Transformer subclass |
+| `date_created` | datetime |  |  |  | now (on insert) | |


Let's go ahead and throw date_modified on here as well, might be handy for debugging syncing of harvesters/transformers and seems like a good practice?

chrisseto · 2017-02-16T18:42:45Z

+| Column | Type | Indexed | Nullable | FK | Default | Description |
+|:-------|:----:|:-------:|:--------:|:--:|:-------:|:------------|
+| `ingest_config_id` | int |  |  | ✓ | | IngestConfig for this harvester run |
+| `harvester_version` | text |  |  |  | | Current version of the harvester in format 'x.x.x' |


Version might need to have a different format to it. I'd like it to be sortable. It fails to order 0.0.10 and 0.0.9 correctly. We could just pad the version up to 3 digits or something and not worry about it?

Yeah, padding each segment seems like the best way to go.

* Describe models for configuring sources * SourceConfig => IngestConfig * Consistent table definitions * Define tables using tables. * Fix some table stuff. * Updates

chrisseto reviewed Feb 15, 2017

View reviewed changes

chrisseto reviewed Feb 16, 2017

View reviewed changes

aaxelb added 5 commits February 16, 2017 13:33

Describe models for configuring sources

8f5dd5a

SourceConfig => IngestConfig

62916df

Consistent table definitions

db3ad38

Define tables using tables.

93547a9

Fix some table stuff.

9e2b8c3

chrisseto reviewed Feb 16, 2017

View reviewed changes

Updates

7ca98c4

aaxelb force-pushed the feature/project-tamandua branch from 3c3fa78 to 7ca98c4 Compare February 16, 2017 18:50

aaxelb merged commit 504085f into CenterForOpenScience:feature/project-tamandua Feb 16, 2017

aaxelb deleted the feature/project-tamandua branch February 16, 2017 18:50

Column	Type	Indexed	Nullable	FK	Default	Description
suid_id	int	X		X		Description as the end because it may be very very long
suid_id	int	Yes	No	Yes	No	Description as the end because it may be very very long
suid_id	int	✅	❌	✅	❌	Description as the end because it may be very very long
suid_id	int	✓	⨉	✓	⨉	Description as the end because it may be very very long

Conversation

aaxelb commented Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaxelb commented Feb 15, 2017 •

edited

Loading

chrisseto Feb 16, 2017 •

edited

Loading