-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 406 - add provenance data to transformed records #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: Transitioning to a parquet dataset architecture for TIMDEX ETL provides additional data related to each transformed record as part of that record's row in the dataset. But this data is only helpful if you tether the record you encounter in Opensearch with a row in the dataset. Certainly related, but not dependent on the parquet dataset change, was the desire for more information about a record in TIMDEX, e.g. when was it transformed and indexed. We might consider this information "provenance" about the TIMDEX record as encountered in Opensearch and/or the TIMDEX API. How this addresses that need: A new "timdex_provenance" field is added to the TIMDEX data model that includes information about the origins of the TIMDEX record. As it pertains to the parquet dataset, this provenance data includes fields like "run_id" and "run_record_offset" which help pinpoint the row in the parquet dataset for this record. With this linkage, it becomes possible to very quickly retrieve the original source record for a transformed record. In addition to support random access reads of the dataset, this provenance data provides some metadata about the TIMDEX record that is immediately informative like "run_date". Side effects of this change: * None, really. TIM will need to be updated to include this new field in the Opensearch mapping, but until then, it's just extra data in the transformed record. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-406
transformed_record.timdex_provenance = timdex.TimdexProvenance( | ||
source=self.run_data["source"], | ||
run_date=self.run_data["run_date"], | ||
run_id=self.run_data["run_id"], | ||
run_record_offset=self.run_record_offset, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered locating this in the Transformer.transform()
method, where the TimdexRecord
instance is built, but opted not to for two reasons:
- the
transform
method is still v1 and v2 compatible, whereas this would introduce feature flagging - the better of the two reasons, I think it's more associated with the orchestration of the run, the
__next__
iterator, etc., and makes sense in this part of the code vs thetransform
method which feels primarily focused on the metadata fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite - the model is updated to allow it (see this related response), but only the v2 feature flagged method _etl_v2_next_iter_method()
will actually add it to the record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Looks good to me! Just one clarification question.
timdex_provenance: TimdexProvenance | None = field( | ||
default=None, validator=optional(instance_of(TimdexProvenance)) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this field optional? 🤔 In what cases is this field set to None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad answer: if it were a "v1" record, it wouldn't have it. So this simplifies backwards cmpatibility.
Better answer: it allows for creating the TimdexRecord
incrementally. You don't have to initialize a record with it set, which allows for setting that property/field more dynamically as the object moves through it's init and handling. Even better, we somehow validate required fields during serialization. But doesn't feel that critical.
Purpose and background context
This PR adds the field
timdex_provenance
to the TIMDEX data model. As outlined in the original engineering plan, this addition to TIMDEX records has multiple uses:run_date
)As outlined in a recent
timdex-dataset-api
PR, we need to know a few things to quickly retrieve a record from the dataset:[run_date, run_type, timdex_record_id, run_record_offset]
wheretimdex_record_id
is actually kind of optional, but helpful for confirmation. This new field contains this information, which is known briefly during write to the dataset in Transmogrifier.How can a reviewer manually see the effects of these changes?
Run a couple of transformations to establish a toy dataset at
/tmp/dataset-provenance
:Load the dataset:
Look at a couple of
timdex_provenance
sections from runrun-abc-123
(3 total records):Note the absence of
run_record_offset=1
. The methodread_transformed_records_iter()
only returns records with a transformed record, where we're expecting to see this provenance object, so it didn't show up.But
run_record_offset=1
actually does exist in the dataset, just for a record that wasaction="delete"
:This kind of a nicely demonstrates how the offset is calculated relative to the source record encountered during iteration in Transmogrifier. If records are skipped, they "consume" an offset number, and it keeps incrementing. In this way, the offset number is wholly artibrary, but by being present in the dataset and the provenance object, it provides a link.
Observe the way
run_record_offset
is written and incremented per run:Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)