Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SHARE-686][Feature] IngestJob and Regulator #684

Merged

Conversation

aaxelb
Copy link
Contributor

@aaxelb aaxelb commented Jun 16, 2017

In broad strokes:

  • Jobs
    • Rename HarvestLog to HarvestJob (lots of this PR is just replacing log with job)
    • Add IngestJob, which keeps track of the ingestion status of a given raw datum
    • Make harvest task logic reusable by refactoring into JobConsumer, with children HarvestJobConsumer, IngestJobConsumer
    • Replace transform task with ingest, which grabs an available (or given) IngestJob, then transforms, regulates, and (in a future PR) saves the result as a single-source graph
      • For now, preserves legacy behavior by saving the regulated graph as a NormalizedData and spawning a disambiguate task for it
  • Regulator
    • Add a basic framework for regulation, allowing new regulation steps that operate on a MutableGraph from a transformer
    • Right now, just one validation step which runs JSONLDValidator
  • sharectl ingest
    • Create IngestJobs for specific raw data by ID, or for all data from one or more source configs.

@aaxelb aaxelb force-pushed the feature/ingregulate branch 7 times, most recently from 0ba0536 to c0fbbed Compare June 23, 2017 19:11
@aaxelb aaxelb force-pushed the feature/ingregulate branch 4 times, most recently from 9e1143c to 6a3ba1b Compare July 17, 2017 18:13
@aaxelb aaxelb changed the title [WIP] IngestJob and Regulator [Feature] IngestJob and Regulator Jul 17, 2017
@aaxelb aaxelb changed the title [Feature] IngestJob and Regulator [SHARE-686][Feature] IngestJob and Regulator Jul 17, 2017
Copy link
Member

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping here, I'll need to be a bit more awake.

setup.py Outdated
@@ -93,6 +93,13 @@
'org.neurovault = share.harvesters.org_neurovault:NeuroVaultHarvester',
'org.plos = share.harvesters.org_plos:PLOSHarvester',
'org.swbiodiversity = share.harvesters.org_swbiodiversity:SWHarvester',
]
],
'share.regulate.node_steps': [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

share.regulate.steps.node
share.regulate.steps.graph
share.regulate.steps.validation
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's better.

@@ -107,18 +107,18 @@ def harvest(args, argv):
print(datum)


@command('Create harvestlogs for the specified SourceConfig')
@command('Create harvestjobs for the specified SourceConfig')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HarvestJobs

@@ -33,3 +34,49 @@ def transform(args, argv):
print('Parsed raw data "{}" into'.format(name))
pprint(transformer.transform(data))
print('\n')


@command('Create IngestJobs for the specified raw data/um')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/raw data\/um/RawDatum(s)/?


Options:
-i, --ids Provide RawDatum IDs to ingest specifically
-s, --superfluous Reingest datums that already have an IngestJob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawDatums or data?

@@ -115,7 +114,7 @@ def fetch_date(self, date: datetime.date, **kwargs):
"""
return self.fetch_date_range(date - datetime.timedelta(days=1), date, **kwargs)

def fetch_date_range(self, start, end, limit=None, **kwargs):
def fetch_date_range(self, start, end, limit=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kwargs are there to allow passing arbitrary keyword arguments to _do_fetch without having to override any other functions.
If additional kwargs are passed in, they should eventually get rejected by _do_fetch.
Is there any reason to remove them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this is only used for set_spec, which is already in harvester_kwargs, so the OAI harvester needs extra logic to manually handle the override (set_spec = set_spec or self.set_spec). I think the BaseHarvester API would be better and easier to work with if harvester-specific kwargs enter in only one place (__init__, stored in self.kwargs by default), not two.

When using a harvester, it's just a change from

harvester = config.get_harvester()
harvester.fetch(set_spec='my-crazy-setspec')

to

harvester = config.get_harvester(set_spec='my-crazy-setspec')
harvester.fetch()

which seems like no sacrifice. It'd also be nice to guarantee that any kwargs you can provide manually can also be specified in harvester_kwargs, and vice versa.


Passed kwargs override or add to harvester_kwargs.
"""
return self.harvester.get_class()(self, **{**(self.harvester_kwargs or {}), **kwargs})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about passing options here rather to the fetch/harvest methods. I think it make sense to have both.
set_spec feels like it would be more appropriate as a per call parameter rather than a per instance parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a good compromise -- preserve passing kwargs to fetch/harvest methods, but also pass the config's harvester_kwargs into _do_fetch the same way (and let passed kwargs override those in the config). It unifies the kwargs, lets all of them be per-call, and gets rid of self.kwargs, which made me feel dirty and full of regret.

'This may be, but is not limitted to, a deletion, modification, publication, or creation datestamp. '
'Ideally, this datetime should be appropriate for determining the chronological order it\'s data will be applied.'
'This may be, but is not limited to, a deletion, modification, publication, or creation datestamp. '
'Ideally, this datetime should be appropriate for determining the chronological order its data will be applied. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how that happened.

@@ -372,10 +380,10 @@ class RawDatum(models.Model):
# The sha256 of the datum
sha256 = models.TextField(validators=[validators.MaxLengthValidator(64)])

datestamp = models.DateTimeField(null=True, help_text=(
datestamp = models.DateTimeField(default=pendulum.now, help_text=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be auto_now=True or auto_add_now=True? It will take into account the system's timezone, or is that being ignored on purpose? pendulum.now defaults to UTC, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_now* will override values set before saving, which is bad. The default here is just a fallback if it somehow got this far without a sensible datestamp.

pendulum.now defaults to UTC, yes, but does that matter? 08:00-05 is the same as 13:00+00 is the same as 17:00+04

@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum):
date_created = models.DateTimeField(auto_now_add=True, editable=False)
date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True)

source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE)
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( no way to make this jobtype_jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can figure out a way, please let me know. harvestjobs makes me sad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a custom foreign key field could probably do it, but that feels gross too)

Copy link
Contributor Author

@aaxelb aaxelb Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to subclasses for naming consistency?

from share.regulate.errors import RegulatorError


class BaseStep:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+9001

@aaxelb aaxelb force-pushed the feature/ingregulate branch 6 times, most recently from 869d946 to a1f33d3 Compare August 3, 2017 18:16
@@ -61,6 +61,7 @@ def create(self, request, *args, **kwargs):
serializer = self.get_serializer_class()(data=request.data, context={'request': request})
if serializer.is_valid(raise_exception=True):
nm_instance = serializer.save()
# TODO create an IngestJob, respond with a link to a job detail endpoint
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a ticket for this


Options:
-i, --ids Provide RawDatum IDs to ingest specifically
-s, --superfluous Reingest RawDatums that already have an IngestJob
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a ticket to get rid of superfluous. It's superfluous (with force)


if not isinstance(data_gen, types.GeneratorType) and len(data_gen) != 0:
raise TypeError('{!r}._do_fetch must return a GeneratorType for optimal performance and memory usage'.format(self))

for i, blob in enumerate(data_gen):
result = FetchResult(blob[0], self.serializer.serialize(blob[1]), *blob[2:])

if result.datestamp and (result.datestamp.date() < start.date() or result.datestamp.date() > end.date()):
if result.datestamp is None:
result.datestamp = end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes start

@@ -114,7 +100,7 @@ def fetch_page(self, url: furl, token: str=None) -> (list, str):
def fetch_by_id(self, provider_id):
url = furl(self.config.base_url)
url.args['verb'] = 'GetRecord'
url.args['metadataPrefix'] = self.metadata_prefix
url.args['metadataPrefix'] = self.config.harvester_kwargs['metadata_prefix']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at this, should it use a passed-in kwarg?

@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum):
date_created = models.DateTimeField(auto_now_add=True, editable=False)
date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True)

source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE)
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE)
Copy link
Contributor Author

@aaxelb aaxelb Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to subclasses for naming consistency?

aaxelb and others added 2 commits November 13, 2017 12:01
@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.737% when pulling d6e82da on aaxelb:feature/ingregulate into ec0eee8 on CenterForOpenScience:develop.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 78.782% when pulling 581e442 on aaxelb:feature/ingregulate into b6bf925 on CenterForOpenScience:develop.

@aaxelb aaxelb changed the base branch from develop to future-release/2.14.0 December 14, 2017 21:43
@aaxelb aaxelb merged commit 5625efd into CenterForOpenScience:future-release/2.14.0 Dec 15, 2017
@aaxelb aaxelb deleted the feature/ingregulate branch December 15, 2017 16:52
aaxelb added a commit that referenced this pull request Dec 21, 2017
aaxelb added a commit that referenced this pull request Dec 21, 2017
* IngestJob and Regulator

* Fix tests

* Fix some api errors

* Avoid a stack overflow.

* Explain myself.

* Every raw datum gets an ingest job!

* Job admin

* Bulk IngestJob creation

* Update raw data janitor

* Move migration to the end.

* Fix weird migration thing.

* sharectl ingest

* Better things

* Regulator tests

* Plac8 flake8

* Fix migration

* Fix failing tests

* Tasks package

* Tests and stuff

* Responding to review

* Single source of harvester kwargs

* Single source of transformer kwargs

* Add merge migration

* Fix failing tests

* Respond to review

* [SHARE-1000][Feature] Add lineage to works in the search index (#723)

* Add lineage to works in the search index

* Add some explanatory comments

* Set `source_config_version` without assumptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants