TIMX 406 - add provenance data to transformed records #233

ghukill · 2025-01-16T21:30:09Z

Purpose and background context

This PR adds the field timdex_provenance to the TIMDEX data model. As outlined in the original engineering plan, this addition to TIMDEX records has multiple uses:

providing quick, human readable information about the record's origins (e.g. run_date)
embed enough information in the indexed Opensearch record to efficiently find the record in the parquet dataset

As outlined in a recent timdex-dataset-api PR, we need to know a few things to quickly retrieve a record from the dataset: [run_date, run_type, timdex_record_id, run_record_offset] where timdex_record_id is actually kind of optional, but helpful for confirmation. This new field contains this information, which is known briefly during write to the dataset in Transmogrifier.

How can a reviewer manually see the effects of these changes?

Run a couple of transformations to establish a toy dataset at /tmp/dataset-provenance:

pipenv run transform \
 -s libguides \
 -i tests/fixtures/dataset/libguides-2024-06-03-full-extracted-records-to-index.xml \
 -o /tmp/dataset-provenance \
 --run-id run-abc-123

 pipenv run transform \
 -s libguides \
 -i tests/fixtures/dataset/libguides-2025-01-09-full-extracted-records-to-index.xml \
 -o /tmp/dataset-provenance \
 --run-id run-def-4456

Load the dataset:

import json
from timdex_dataset_api import  TIMDEXDataset

td = TIMDEXDataset('/tmp/dataset-provenance')
td.load()

Look at a couple of timdex_provenance sections from run run-abc-123 (3 total records):

for record in td.read_transformed_records_iter(run_id='run-abc-123'):
    print(record['timdex_provenance'])
"""
{'source': 'libguides', 'run_date': '2024-06-03', 'run_id': 'run-abc-123', 'run_record_offset': 0}
{'source': 'libguides', 'run_date': '2024-06-03', 'run_id': 'run-abc-123', 'run_record_offset': 2}
{'source': 'libguides', 'run_date': '2024-06-03', 'run_id': 'run-abc-123', 'run_record_offset': 3}
"""

Note the absence of run_record_offset=1. The method read_transformed_records_iter() only returns records with a transformed record, where we're expecting to see this provenance object, so it didn't show up.

But run_record_offset=1 actually does exist in the dataset, just for a record that was action="delete":

td.read_dataframe(columns=['timdex_record_id','action','run_record_offset'], run_id='run-abc-123')
"""
Out[14]: 
          timdex_record_id  action  run_record_offset
0  libguides:guides-175846   index                  0
1  libguides:guides-175849  delete                  1              <-------------
2  libguides:guides-175853   index                  2
3  libguides:guides-175855   index                  3
"""

This kind of a nicely demonstrates how the offset is calculated relative to the source record encountered during iteration in Transmogrifier. If records are skipped, they "consume" an offset number, and it keeps incrementing. In this way, the offset number is wholly artibrary, but by being present in the dataset and the provenance object, it provides a link.

Observe the way run_record_offset is written and incremented per run:

# get dataframe of all records in dataset
df = td.read_dataframe()

# get counts of column run_record_offset
# note that offsets 0-3 are repeated twice, this is because the first run only had
# 4 records written, and so offsets 0-3 were repeated for BOTH runs
df.run_record_offset.value_counts()
"""
Out[11]: 
run_record_offset
0      2   <-----
2      2   <-----
3      2   <-----
1      2   <-----
665    1
      ..
338    1
339    1
340    1
341    1
999    1
Name: count, Length: 1001, dtype: int64
"""

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-406

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: Transitioning to a parquet dataset architecture for TIMDEX ETL provides additional data related to each transformed record as part of that record's row in the dataset. But this data is only helpful if you tether the record you encounter in Opensearch with a row in the dataset. Certainly related, but not dependent on the parquet dataset change, was the desire for more information about a record in TIMDEX, e.g. when was it transformed and indexed. We might consider this information "provenance" about the TIMDEX record as encountered in Opensearch and/or the TIMDEX API. How this addresses that need: A new "timdex_provenance" field is added to the TIMDEX data model that includes information about the origins of the TIMDEX record. As it pertains to the parquet dataset, this provenance data includes fields like "run_id" and "run_record_offset" which help pinpoint the row in the parquet dataset for this record. With this linkage, it becomes possible to very quickly retrieve the original source record for a transformed record. In addition to support random access reads of the dataset, this provenance data provides some metadata about the TIMDEX record that is immediately informative like "run_date". Side effects of this change: * None, really. TIM will need to be updated to include this new field in the Opensearch mapping, but until then, it's just extra data in the transformed record. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-406

ghukill · 2025-01-16T21:46:40Z

transmogrifier/sources/transformer.py

+                transformed_record.timdex_provenance = timdex.TimdexProvenance(
+                    source=self.run_data["source"],
+                    run_date=self.run_data["run_date"],
+                    run_id=self.run_data["run_id"],
+                    run_record_offset=self.run_record_offset,
+                )


I had considered locating this in the Transformer.transform() method, where the TimdexRecord instance is built, but opted not to for two reasons:

the transform method is still v1 and v2 compatible, whereas this would introduce feature flagging

the better of the two reasons, I think it's more associated with the orchestration of the run, the __next__ iterator, etc., and makes sense in this part of the code vs the transform method which feels primarily focused on the metadata fields

Not quite - the model is updated to allow it (see this related response), but only the v2 feature flagged method _etl_v2_next_iter_method() will actually add it to the record.

ehanson8

Looks good!

jonavellecuerdo

@ghukill Looks good to me! Just one clarification question.

jonavellecuerdo · 2025-01-24T17:36:44Z

transmogrifier/models.py

+    timdex_provenance: TimdexProvenance | None = field(
+        default=None, validator=optional(instance_of(TimdexProvenance))
+    )


Why is this field optional? 🤔 In what cases is this field set to None?

Bad answer: if it were a "v1" record, it wouldn't have it. So this simplifies backwards cmpatibility.

Better answer: it allows for creating the TimdexRecord incrementally. You don't have to initialize a record with it set, which allows for setting that property/field more dynamically as the object moves through it's init and handling. Even better, we somehow validate required fields during serialization. But doesn't feel that critical.

ghukill requested review from ehanson8 and jonavellecuerdo January 16, 2025 21:38

ghukill commented Jan 16, 2025

View reviewed changes

ehanson8 approved these changes Jan 21, 2025

View reviewed changes

jonavellecuerdo approved these changes Jan 24, 2025

View reviewed changes

ghukill merged commit 419a468 into main Jan 24, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIMX 406 - add provenance data to transformed records #233

TIMX 406 - add provenance data to transformed records #233

Uh oh!

ghukill commented Jan 16, 2025 •

edited

Loading

Uh oh!

ghukill Jan 16, 2025

Uh oh!

ghukill Jan 24, 2025

Uh oh!

ehanson8 left a comment

Uh oh!

jonavellecuerdo left a comment

Uh oh!

jonavellecuerdo Jan 24, 2025

Uh oh!

ghukill Jan 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TIMX 406 - add provenance data to transformed records #233

TIMX 406 - add provenance data to transformed records #233

Uh oh!

Conversation

ghukill commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

ghukill Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghukill commented Jan 16, 2025 •

edited

Loading