[SHARE-686][Feature] IngestJob and Regulator #684

aaxelb · 2017-06-16T21:07:17Z

In broad strokes:

Jobs
- Rename HarvestLog to HarvestJob (lots of this PR is just replacing log with job)
- Add IngestJob, which keeps track of the ingestion status of a given raw datum
- Make harvest task logic reusable by refactoring into JobConsumer, with children HarvestJobConsumer, IngestJobConsumer
- Replace transform task with ingest, which grabs an available (or given) IngestJob, then transforms, regulates, and (in a future PR) saves the result as a single-source graph
  - For now, preserves legacy behavior by saving the regulated graph as a NormalizedData and spawning a disambiguate task for it
Regulator
- Add a basic framework for regulation, allowing new regulation steps that operate on a MutableGraph from a transformer
- Right now, just one validation step which runs JSONLDValidator
sharectl ingest
- Create IngestJobs for specific raw data by ID, or for all data from one or more source configs.

chrisseto

Stopping here, I'll need to be a bit more awake.

chrisseto · 2017-07-20T05:33:31Z

setup.py

@@ -93,6 +93,13 @@
            'org.neurovault = share.harvesters.org_neurovault:NeuroVaultHarvester',
            'org.plos = share.harvesters.org_plos:PLOSHarvester',
            'org.swbiodiversity = share.harvesters.org_swbiodiversity:SWHarvester',
-        ]
+        ],
+        'share.regulate.node_steps': [


share.regulate.steps.node
share.regulate.steps.graph
share.regulate.steps.validation
?

Yeah, that's better.

chrisseto · 2017-07-20T05:35:30Z

share/bin/harvest.py

@@ -107,18 +107,18 @@ def harvest(args, argv):
        print(datum)


-@command('Create harvestlogs for the specified SourceConfig')
+@command('Create harvestjobs for the specified SourceConfig')


HarvestJobs

chrisseto · 2017-07-20T05:36:20Z

share/bin/ingest.py

@@ -33,3 +34,49 @@ def transform(args, argv):
            print('Parsed raw data "{}" into'.format(name))
            pprint(transformer.transform(data))
            print('\n')
+
+
+@command('Create IngestJobs for the specified raw data/um')


s/raw data\/um/RawDatum(s)/?

chrisseto · 2017-07-20T05:37:29Z

share/bin/ingest.py

+
+    Options:
+        -i, --ids           Provide RawDatum IDs to ingest specifically
+        -s, --superfluous   Reingest datums that already have an IngestJob


RawDatums or data?

chrisseto · 2017-07-20T05:39:42Z

share/harvest/base.py

@@ -115,7 +114,7 @@ def fetch_date(self, date: datetime.date, **kwargs):
        """
        return self.fetch_date_range(date - datetime.timedelta(days=1), date, **kwargs)

-    def fetch_date_range(self, start, end, limit=None, **kwargs):
+    def fetch_date_range(self, start, end, limit=None):


The kwargs are there to allow passing arbitrary keyword arguments to _do_fetch without having to override any other functions.
If additional kwargs are passed in, they should eventually get rejected by _do_fetch.
Is there any reason to remove them?

Right now this is only used for set_spec, which is already in harvester_kwargs, so the OAI harvester needs extra logic to manually handle the override (set_spec = set_spec or self.set_spec). I think the BaseHarvester API would be better and easier to work with if harvester-specific kwargs enter in only one place (__init__, stored in self.kwargs by default), not two.

When using a harvester, it's just a change from

harvester = config.get_harvester() harvester.fetch(set_spec='my-crazy-setspec')

to

harvester = config.get_harvester(set_spec='my-crazy-setspec') harvester.fetch()

which seems like no sacrifice. It'd also be nice to guarantee that any kwargs you can provide manually can also be specified in harvester_kwargs, and vice versa.

chrisseto · 2017-07-20T05:44:58Z

share/models/ingest.py

+
+        Passed kwargs override or add to harvester_kwargs.
+        """
+        return self.harvester.get_class()(self, **{**(self.harvester_kwargs or {}), **kwargs})


I'm not sure how I feel about passing options here rather to the fetch/harvest methods. I think it make sense to have both.
set_spec feels like it would be more appropriate as a per call parameter rather than a per instance parameter.

I think I found a good compromise -- preserve passing kwargs to fetch/harvest methods, but also pass the config's harvester_kwargs into _do_fetch the same way (and let passed kwargs override those in the config). It unifies the kwargs, lets all of them be per-call, and gets rid of self.kwargs, which made me feel dirty and full of regret.

chrisseto · 2017-07-20T05:45:51Z

share/models/ingest.py

-        'This may be, but is not limitted to, a deletion, modification, publication, or creation datestamp. '
-        'Ideally, this datetime should be appropriate for determining the chronological order it\'s data will be applied.'
+        'This may be, but is not limited to, a deletion, modification, publication, or creation datestamp. '
+        'Ideally, this datetime should be appropriate for determining the chronological order its data will be applied. '


Trailing space?

I have no idea how that happened.

chrisseto · 2017-07-20T05:47:38Z

share/models/ingest.py

@@ -372,10 +380,10 @@ class RawDatum(models.Model):
    # The sha256 of the datum
    sha256 = models.TextField(validators=[validators.MaxLengthValidator(64)])

-    datestamp = models.DateTimeField(null=True, help_text=(
+    datestamp = models.DateTimeField(default=pendulum.now, help_text=(


Should this just be auto_now=True or auto_add_now=True? It will take into account the system's timezone, or is that being ignored on purpose? pendulum.now defaults to UTC, correct?

auto_now* will override values set before saving, which is bad. The default here is just a fallback if it somehow got this far without a sensible datestamp.

pendulum.now defaults to UTC, yes, but does that matter? 08:00-05 is the same as 13:00+00 is the same as 17:00+04

chrisseto · 2017-07-20T05:48:36Z

share/models/jobs.py

@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum):
    date_created = models.DateTimeField(auto_now_add=True, editable=False)
    date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True)

-    source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE)
+    source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE)


:( no way to make this jobtype_jobs?

If you can figure out a way, please let me know. harvestjobs makes me sad.

(a custom foreign key field could probably do it, but that feels gross too)

Move this to subclasses for naming consistency?

chrisseto · 2017-07-20T05:52:58Z

share/regulate/steps/__init__.py

+from share.regulate.errors import RegulatorError
+
+
+class BaseStep:


aaxelb · 2017-11-03T20:02:05Z

api/normalizeddata/views.py

@@ -61,6 +61,7 @@ def create(self, request, *args, **kwargs):
        serializer = self.get_serializer_class()(data=request.data, context={'request': request})
        if serializer.is_valid(raise_exception=True):
            nm_instance = serializer.save()
+            # TODO create an IngestJob, respond with a link to a job detail endpoint


Add a ticket for this

aaxelb · 2017-11-03T20:14:38Z

share/bin/ingest.py

+
+    Options:
+        -i, --ids           Provide RawDatum IDs to ingest specifically
+        -s, --superfluous   Reingest RawDatums that already have an IngestJob


Add a ticket to get rid of superfluous. It's superfluous (with force)

aaxelb · 2017-11-10T20:20:08Z

share/harvest/base.py


        if not isinstance(data_gen, types.GeneratorType) and len(data_gen) != 0:
            raise TypeError('{!r}._do_fetch must return a GeneratorType for optimal performance and memory usage'.format(self))

        for i, blob in enumerate(data_gen):
            result = FetchResult(blob[0], self.serializer.serialize(blob[1]), *blob[2:])

-            if result.datestamp and (result.datestamp.date() < start.date() or result.datestamp.date() > end.date()):
+            if result.datestamp is None:
+                result.datestamp = end


aaxelb · 2017-11-10T20:30:54Z

share/harvesters/oai.py

@@ -114,7 +100,7 @@ def fetch_page(self, url: furl, token: str=None) -> (list, str):
    def fetch_by_id(self, provider_id):
        url = furl(self.config.base_url)
        url.args['verb'] = 'GetRecord'
-        url.args['metadataPrefix'] = self.metadata_prefix
+        url.args['metadataPrefix'] = self.config.harvester_kwargs['metadata_prefix']


Look at this, should it use a passed-in kwarg?

aaxelb · 2017-11-10T20:48:48Z

share/models/jobs.py

@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum):
    date_created = models.DateTimeField(auto_now_add=True, editable=False)
    date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True)

-    source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE)
+    source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE)


Move this to subclasses for naming consistency?

…rForOpenScience#723) * Add lineage to works in the search index * Add some explanatory comments

coveralls · 2017-11-13T21:15:52Z

Coverage increased (+0.1%) to 78.737% when pulling d6e82da on aaxelb:feature/ingregulate into ec0eee8 on CenterForOpenScience:develop.

coveralls · 2017-11-22T19:11:01Z

Coverage increased (+0.1%) to 78.782% when pulling 581e442 on aaxelb:feature/ingregulate into b6bf925 on CenterForOpenScience:develop.

This reverts commit 5625efd.

* IngestJob and Regulator * Fix tests * Fix some api errors * Avoid a stack overflow. * Explain myself. * Every raw datum gets an ingest job! * Job admin * Bulk IngestJob creation * Update raw data janitor * Move migration to the end. * Fix weird migration thing. * sharectl ingest * Better things * Regulator tests * Plac8 flake8 * Fix migration * Fix failing tests * Tasks package * Tests and stuff * Responding to review * Single source of harvester kwargs * Single source of transformer kwargs * Add merge migration * Fix failing tests * Respond to review * [SHARE-1000][Feature] Add lineage to works in the search index (#723) * Add lineage to works in the search index * Add some explanatory comments * Set `source_config_version` without assumptions

aaxelb force-pushed the feature/ingregulate branch 7 times, most recently from 0ba0536 to c0fbbed Compare June 23, 2017 19:11

aaxelb force-pushed the feature/ingregulate branch 4 times, most recently from 9e1143c to 6a3ba1b Compare July 17, 2017 18:13

aaxelb changed the title ~~[WIP] IngestJob and Regulator~~ [Feature] IngestJob and Regulator Jul 17, 2017

aaxelb changed the title ~~[Feature] IngestJob and Regulator~~ [SHARE-686][Feature] IngestJob and Regulator Jul 17, 2017

chrisseto reviewed Jul 20, 2017

View reviewed changes

aaxelb force-pushed the feature/ingregulate branch from 09a7bed to 4b00a6e Compare July 20, 2017 16:58

aaxelb force-pushed the feature/ingregulate branch 6 times, most recently from 869d946 to a1f33d3 Compare August 3, 2017 18:16

aaxelb added 9 commits October 4, 2017 11:12

IngestJob and Regulator

0a72023

Fix tests

542cb2c

Fix some api errors

e42d9f4

Avoid a stack overflow.

1c91705

Explain myself.

bb599bf

Every raw datum gets an ingest job!

9263112

Job admin

8bfe532

Bulk IngestJob creation

46b69ce

Update raw data janitor

b48797e

aaxelb added 10 commits October 4, 2017 11:12

Better things

fa42d2d

Regulator tests

88675c0

Plac8 flake8

84eb5e3

Fix migration

601768b

Fix failing tests

810e3ae

Tasks package

52d2a5c

Tests and stuff

1128459

Responding to review

eddfb36

Single source of harvester kwargs

ecb04b4

Single source of transformer kwargs

29dcf93

aaxelb force-pushed the feature/ingregulate branch from a1f33d3 to 29dcf93 Compare October 4, 2017 15:13

aaxelb added 4 commits November 3, 2017 15:47

Merge branch 'develop' into feature/ingregulate

5202c49

Add merge migration

19b684e

Fix failing tests

9529581

Merge remote-tracking branch 'upstream/develop' into feature/ingregulate

6f427d5

aaxelb commented Nov 10, 2017

View reviewed changes

aaxelb and others added 2 commits November 13, 2017 12:01

Respond to review

7acba06

[SHARE-1000][Feature] Add lineage to works in the search index (Cente…

c45de42

…rForOpenScience#723) * Add lineage to works in the search index * Add some explanatory comments

aaxelb force-pushed the feature/ingregulate branch from 2fc39fc to c45de42 Compare November 13, 2017 17:01

aaxelb added 2 commits November 13, 2017 13:37

Merge remote-tracking branch 'upstream/develop' into feature/ingregulate

f598215

Set source_config_version without assumptions

d6e82da

Merge branch 'develop' into feature/ingregulate

581e442

aaxelb force-pushed the feature/ingregulate branch from 1f722dc to 581e442 Compare November 22, 2017 18:58

aaxelb changed the base branch from develop to future-release/2.14.0 December 14, 2017 21:43

aaxelb merged commit 5625efd into CenterForOpenScience:future-release/2.14.0 Dec 15, 2017

aaxelb deleted the feature/ingregulate branch December 15, 2017 16:52

aaxelb added a commit that referenced this pull request Dec 21, 2017

Revert "[SHARE-686][Feature] IngestJob and Regulator (#684)"

ac158d0

This reverts commit 5625efd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SHARE-686][Feature] IngestJob and Regulator #684

[SHARE-686][Feature] IngestJob and Regulator #684

aaxelb commented Jun 16, 2017 •

edited

Loading

chrisseto left a comment

chrisseto Jul 20, 2017

aaxelb Jul 20, 2017

chrisseto Jul 20, 2017

chrisseto Jul 20, 2017

chrisseto Jul 20, 2017

chrisseto Jul 20, 2017

aaxelb Jul 20, 2017

chrisseto Jul 20, 2017

aaxelb Jul 28, 2017

chrisseto Jul 20, 2017

aaxelb Jul 20, 2017

chrisseto Jul 20, 2017

aaxelb Jul 20, 2017

chrisseto Jul 20, 2017

aaxelb Jul 20, 2017

aaxelb Jul 20, 2017

aaxelb Nov 10, 2017 •

edited

Loading

chrisseto Jul 20, 2017

aaxelb Nov 3, 2017

aaxelb Nov 3, 2017

aaxelb Nov 10, 2017

aaxelb Nov 10, 2017

aaxelb Nov 10, 2017 •

edited

Loading

coveralls commented Nov 13, 2017

coveralls commented Nov 22, 2017

		from share.regulate.errors import RegulatorError


		class BaseStep:

[SHARE-686][Feature] IngestJob and Regulator #684

[SHARE-686][Feature] IngestJob and Regulator #684

Conversation

aaxelb commented Jun 16, 2017 • edited Loading

chrisseto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaxelb Nov 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaxelb Nov 10, 2017 • edited Loading

Choose a reason for hiding this comment

coveralls commented Nov 13, 2017

coveralls commented Nov 22, 2017

aaxelb commented Jun 16, 2017 •

edited

Loading

aaxelb Nov 10, 2017 •

edited

Loading

aaxelb Nov 10, 2017 •

edited

Loading