-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SHARE-686][Feature] IngestJob and Regulator #684
[SHARE-686][Feature] IngestJob and Regulator #684
Conversation
0ba0536
to
c0fbbed
Compare
9e1143c
to
6a3ba1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping here, I'll need to be a bit more awake.
setup.py
Outdated
@@ -93,6 +93,13 @@ | |||
'org.neurovault = share.harvesters.org_neurovault:NeuroVaultHarvester', | |||
'org.plos = share.harvesters.org_plos:PLOSHarvester', | |||
'org.swbiodiversity = share.harvesters.org_swbiodiversity:SWHarvester', | |||
] | |||
], | |||
'share.regulate.node_steps': [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
share.regulate.steps.node
share.regulate.steps.graph
share.regulate.steps.validation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's better.
share/bin/harvest.py
Outdated
@@ -107,18 +107,18 @@ def harvest(args, argv): | |||
print(datum) | |||
|
|||
|
|||
@command('Create harvestlogs for the specified SourceConfig') | |||
@command('Create harvestjobs for the specified SourceConfig') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HarvestJobs
share/bin/ingest.py
Outdated
@@ -33,3 +34,49 @@ def transform(args, argv): | |||
print('Parsed raw data "{}" into'.format(name)) | |||
pprint(transformer.transform(data)) | |||
print('\n') | |||
|
|||
|
|||
@command('Create IngestJobs for the specified raw data/um') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/raw data\/um/RawDatum(s)/
?
share/bin/ingest.py
Outdated
|
||
Options: | ||
-i, --ids Provide RawDatum IDs to ingest specifically | ||
-s, --superfluous Reingest datums that already have an IngestJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawDatums or data?
share/harvest/base.py
Outdated
@@ -115,7 +114,7 @@ def fetch_date(self, date: datetime.date, **kwargs): | |||
""" | |||
return self.fetch_date_range(date - datetime.timedelta(days=1), date, **kwargs) | |||
|
|||
def fetch_date_range(self, start, end, limit=None, **kwargs): | |||
def fetch_date_range(self, start, end, limit=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kwargs are there to allow passing arbitrary keyword arguments to _do_fetch
without having to override any other functions.
If additional kwargs are passed in, they should eventually get rejected by _do_fetch
.
Is there any reason to remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this is only used for set_spec
, which is already in harvester_kwargs
, so the OAI harvester needs extra logic to manually handle the override (set_spec = set_spec or self.set_spec
). I think the BaseHarvester API would be better and easier to work with if harvester-specific kwargs enter in only one place (__init__
, stored in self.kwargs
by default), not two.
When using a harvester, it's just a change from
harvester = config.get_harvester()
harvester.fetch(set_spec='my-crazy-setspec')
to
harvester = config.get_harvester(set_spec='my-crazy-setspec')
harvester.fetch()
which seems like no sacrifice. It'd also be nice to guarantee that any kwargs you can provide manually can also be specified in harvester_kwargs
, and vice versa.
share/models/ingest.py
Outdated
|
||
Passed kwargs override or add to harvester_kwargs. | ||
""" | ||
return self.harvester.get_class()(self, **{**(self.harvester_kwargs or {}), **kwargs}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about passing options here rather to the fetch/harvest methods. I think it make sense to have both.
set_spec
feels like it would be more appropriate as a per call parameter rather than a per instance parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a good compromise -- preserve passing kwargs to fetch/harvest methods, but also pass the config's harvester_kwargs into _do_fetch
the same way (and let passed kwargs override those in the config). It unifies the kwargs, lets all of them be per-call, and gets rid of self.kwargs
, which made me feel dirty and full of regret.
share/models/ingest.py
Outdated
'This may be, but is not limitted to, a deletion, modification, publication, or creation datestamp. ' | ||
'Ideally, this datetime should be appropriate for determining the chronological order it\'s data will be applied.' | ||
'This may be, but is not limited to, a deletion, modification, publication, or creation datestamp. ' | ||
'Ideally, this datetime should be appropriate for determining the chronological order its data will be applied. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how that happened.
@@ -372,10 +380,10 @@ class RawDatum(models.Model): | |||
# The sha256 of the datum | |||
sha256 = models.TextField(validators=[validators.MaxLengthValidator(64)]) | |||
|
|||
datestamp = models.DateTimeField(null=True, help_text=( | |||
datestamp = models.DateTimeField(default=pendulum.now, help_text=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be auto_now=True
or auto_add_now=True
? It will take into account the system's timezone, or is that being ignored on purpose? pendulum.now
defaults to UTC, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_now*
will override values set before saving, which is bad. The default here is just a fallback if it somehow got this far without a sensible datestamp.
pendulum.now
defaults to UTC, yes, but does that matter? 08:00-05 is the same as 13:00+00 is the same as 17:00+04
share/models/jobs.py
Outdated
@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum): | |||
date_created = models.DateTimeField(auto_now_add=True, editable=False) | |||
date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True) | |||
|
|||
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE) | |||
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( no way to make this jobtype_jobs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can figure out a way, please let me know. harvestjobs
makes me sad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a custom foreign key field could probably do it, but that feels gross too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to subclasses for naming consistency?
from share.regulate.errors import RegulatorError | ||
|
||
|
||
class BaseStep: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+9001
09a7bed
to
4b00a6e
Compare
869d946
to
a1f33d3
Compare
a1f33d3
to
29dcf93
Compare
api/normalizeddata/views.py
Outdated
@@ -61,6 +61,7 @@ def create(self, request, *args, **kwargs): | |||
serializer = self.get_serializer_class()(data=request.data, context={'request': request}) | |||
if serializer.is_valid(raise_exception=True): | |||
nm_instance = serializer.save() | |||
# TODO create an IngestJob, respond with a link to a job detail endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a ticket for this
|
||
Options: | ||
-i, --ids Provide RawDatum IDs to ingest specifically | ||
-s, --superfluous Reingest RawDatums that already have an IngestJob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a ticket to get rid of superfluous. It's superfluous (with force
)
share/harvest/base.py
Outdated
|
||
if not isinstance(data_gen, types.GeneratorType) and len(data_gen) != 0: | ||
raise TypeError('{!r}._do_fetch must return a GeneratorType for optimal performance and memory usage'.format(self)) | ||
|
||
for i, blob in enumerate(data_gen): | ||
result = FetchResult(blob[0], self.serializer.serialize(blob[1]), *blob[2:]) | ||
|
||
if result.datestamp and (result.datestamp.date() < start.date() or result.datestamp.date() > end.date()): | ||
if result.datestamp is None: | ||
result.datestamp = end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes start
share/harvesters/oai.py
Outdated
@@ -114,7 +100,7 @@ def fetch_page(self, url: furl, token: str=None) -> (list, str): | |||
def fetch_by_id(self, provider_id): | |||
url = furl(self.config.base_url) | |||
url.args['verb'] = 'GetRecord' | |||
url.args['metadataPrefix'] = self.metadata_prefix | |||
url.args['metadataPrefix'] = self.config.harvester_kwargs['metadata_prefix'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at this, should it use a passed-in kwarg?
share/models/jobs.py
Outdated
@@ -198,12 +205,12 @@ class SkipReasons(enum.Enum): | |||
date_created = models.DateTimeField(auto_now_add=True, editable=False) | |||
date_modified = models.DateTimeField(auto_now=True, editable=False, db_index=True) | |||
|
|||
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='harvest_logs', on_delete=models.CASCADE) | |||
source_config = models.ForeignKey('SourceConfig', editable=False, related_name='%(class)ss', on_delete=models.CASCADE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to subclasses for naming consistency?
…rForOpenScience#723) * Add lineage to works in the search index * Add some explanatory comments
2fc39fc
to
c45de42
Compare
1f722dc
to
581e442
Compare
* IngestJob and Regulator * Fix tests * Fix some api errors * Avoid a stack overflow. * Explain myself. * Every raw datum gets an ingest job! * Job admin * Bulk IngestJob creation * Update raw data janitor * Move migration to the end. * Fix weird migration thing. * sharectl ingest * Better things * Regulator tests * Plac8 flake8 * Fix migration * Fix failing tests * Tasks package * Tests and stuff * Responding to review * Single source of harvester kwargs * Single source of transformer kwargs * Add merge migration * Fix failing tests * Respond to review * [SHARE-1000][Feature] Add lineage to works in the search index (#723) * Add lineage to works in the search index * Add some explanatory comments * Set `source_config_version` without assumptions
In broad strokes:
HarvestLog
toHarvestJob
(lots of this PR is just replacinglog
withjob
)IngestJob
, which keeps track of the ingestion status of a given raw datumharvest
task logic reusable by refactoring intoJobConsumer
, with childrenHarvestJobConsumer
,IngestJobConsumer
transform
task withingest
, which grabs an available (or given) IngestJob, then transforms, regulates, and (in a future PR) saves the result as a single-source graphdisambiguate
task for itJSONLDValidator
sharectl ingest