Content-hash-based change tracking for data imports#3199
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3199 +/- ##
==========================================
- Coverage 93.31% 93.31% -0.01%
==========================================
Files 502 502
Lines 46178 46211 +33
Branches 6315 6314 -1
==========================================
+ Hits 43093 43123 +30
Misses 2001 2001
- Partials 1084 1087 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
f58ea44 to
8135dd0
Compare
f5a7e26 to
d17bf96
Compare
|
Claude finished @dbernstein's task in 7m 6s —— View job PR Review: Content-hash-based change tracking for data imports
SummaryAll bugs flagged in previous review rounds have been correctly addressed:
The two inline comments above are minor observations only — no blocking issues remain. The implementation is correct, the test coverage is thorough (including the equal-timestamp/changed-content and stale-force-apply edge cases), and the migration is clearly documented with the expected first-import performance caveat. |
| # The data we have is strictly older than what is stored, no update needed. | ||
| return False | ||
|
|
||
| return self.calculate_hash() != db_object.updated_at_data_hash |
There was a problem hiding this comment.
Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.
There was a problem hiding this comment.
This issue will be addressed: Fix by moving the hash update inside the same conditional as the timestamp
update, so both fields are always kept in sync.
| as_of=self.as_of_timestamp, | ||
| ) | ||
|
|
||
| pool.updated_at = self.as_of_timestamp |
There was a problem hiding this comment.
Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.
| # bool must come before int because bool is a subclass of int in Python. | ||
| _CANONICALIZE_TYPE_ORDER: dict[type, int] = { | ||
| bool: 0, | ||
| int: 1, |
There was a problem hiding this comment.
Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.
| if edition.updated_at is None or edition.updated_at < updated_at: | ||
| edition.updated_at = updated_at | ||
|
|
||
| edition.updated_at_data_hash = self.calculate_hash() |
There was a problem hiding this comment.
Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.
| self.last_checked if self.last_checked else utc_now() | ||
| ) | ||
| license_pool.last_checked = None | ||
| license_pool.created_at = self.created_at |
There was a problem hiding this comment.
Minor: Setting license_pool.created_at = self.created_at means the "created at" timestamp for a new pool is the time the CirculationData Python object was constructed during this import run, not the data source's own updated_at. For Boundless, the previous code set availability_time to title.availability.update_date (via last_checked), allowing the "Added to Collection" feed sort to reflect the source's timestamp. With this change, availability_time always comes from utc_now() in LicensePool.for_foreign_id and created_at records import-time. The behaviour change is arguably more correct semantically, but it will alter the "Added to Collection" ordering for any existing data source that previously set last_checked.
| and apply_circulation is not None | ||
| and ( | ||
| import_even_if_unchanged | ||
| or bibliographic.circulation.needs_apply(session, collection) |
There was a problem hiding this comment.
For ODL pools (self.licenses is not None), needs_apply always returns True, so a circulation task is queued on every feed import for every ODL title whose bibliographic data is unchanged. This is intentional (license expiry is time-dependent and can't be detected by hashing), but it means the elif branch effectively opts ODL pools out of the hash-skipping optimisation entirely. Worth a brief comment here so the next reader understands the ODL path is not a missed skip opportunity.
| # we set the edition's updated_at to the data source's last updated time. | ||
| # Both fields are updated together so updated_at_data_hash always reflects | ||
| # the content as of updated_at. | ||
| if edition.updated_at is None or edition.updated_at < updated_at: |
There was a problem hiding this comment.
When as_of_timestamp == edition.updated_at but the hash has changed (i.e. a publisher re-publishes with the same timestamp but different content), should_apply_to correctly returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never updated. Every subsequent import with the same timestamp and the now-stable content will still see H_new != H_old and re-apply indefinitely. Fix: also update the hash when the timestamps are equal (edition.updated_at <= updated_at instead of <).
| results[identifier] = PublicationImportResult( | ||
| bibliographic=bibliographic, | ||
| changed=has_changed, | ||
| changed=needs_apply, |
There was a problem hiding this comment.
For ODL pools, bibliographic.circulation.needs_apply() always returns True (licence expiry is time-dependent), so called_circulation_apply is set to True whenever bib data is stable. However, changed is still False here, making FeedImportResult.found_unchanged_publication return True after the first page. opds_import_task then stops paginating, so ODL titles on pages 2+ never have their licenses refreshed. Fix: changed=needs_apply or called_circulation_apply so that queuing a circulation task counts as "the publication needed work".
Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.
…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.
…erate vastly few apply queue tasks, we can afford to pull more books per page.
When stale data was force-applied via even_if_not_apparently_updated=True, _update_edition_timestamp would skip advancing edition.updated_at (correct) but still overwrite updated_at_data_hash with the stale content hash (incorrect). This broke the invariant that updated_at_data_hash reflects the content as of updated_at, and could cause future imports with timestamps between the stale and stored timestamps to compare against the wrong hash. Fix by moving the hash update inside the same conditional as the timestamp update, so both fields are always kept in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pool.updated_at is set unconditionally (even when stale data is force-applied via even_if_not_apparently_updated=True), rolling both updated_at and updated_at_data_hash together. This ensures future imports at any timestamp after as_of_timestamp are not blocked by the strict less-than check in should_apply_to() and can reach the hash comparison. This is intentionally different from the Edition path, which leaves both fields unchanged when as_of_timestamp does not advance updated_at. Add a comment to prevent a well-meaning refactor from "fixing" this asymmetry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
self.created_at is the import-time timestamp of the CirculationData object itself, not the data source availability date. Using it for license_pool.created_at changes the Added to Collection sort order for sources like Boundless that populate updated_at from the feed update_date. as_of_timestamp resolves to updated_at when set (the data source own availability timestamp) and falls back to created_at (import time) when not, mirroring the previous behavior of: availability_time = self.last_checked if self.last_checked else utc_now() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ODL pools (licenses is not None) always return True from needs_apply, so they are exempt from the hash-based skip optimization. License expiry is time-dependent and cannot be detected by content hashing alone. Add a call-site comment so this is visible to readers of the importer loop without having to trace into needs_apply. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
784c6e7 to
361c999
Compare
| # we set the edition's updated_at to the data source's last updated time. | ||
| # Both fields are updated together so updated_at_data_hash always reflects | ||
| # the content as of updated_at. | ||
| if edition.updated_at is None or edition.updated_at < updated_at: |
There was a problem hiding this comment.
When as_of_timestamp == edition.updated_at but the hash has changed (publisher re-published identical timestamp, different content), should_apply_to returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never written. Every subsequent import with the same timestamp and the now-stable content will still see a hash mismatch and re-apply indefinitely.
Fix: update the hash when timestamps are equal — either edition.updated_at <= updated_at for the outer guard (advancing updated_at only when strictly newer), or write the hash in a separate branch for the equality case.
| results[identifier] = PublicationImportResult( | ||
| bibliographic=bibliographic, | ||
| changed=has_changed, | ||
| changed=needs_apply, |
There was a problem hiding this comment.
For an ODL title where bibliographic data is stable (needs_apply = False) but called_circulation_apply = True (because CirculationData.needs_apply always returns True for ODL pools), changed is still False. FeedImportResult.found_unchanged_publication therefore returns True after the first page, and opds_import_task stops paginating — so ODL titles on pages 2+ never get their licenses refreshed.
Fix: changed=needs_apply or called_circulation_apply.
When as_of_timestamp equals edition.updated_at but content has changed, should_apply_to correctly returns True (hash differs) and the update runs — but the old < comparison in _update_edition_timestamp skipped storing the new hash. On every subsequent re-import the hash still mismatched, causing the update to re-apply indefinitely. Changing to <= ensures equal-timestamp imports with changed content still record the new hash, breaking the loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When bibliographic data is unchanged but an ODL circulation task is queued (needs_apply always returns True for ODL pools due to license expiry), PublicationImportResult.changed was set to needs_apply (False). This caused found_unchanged_publication to return True, halting pagination after the first page. Fix by setting changed=needs_apply or called_circulation_apply so that queuing a circulation task counts as work done for the publication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit changed license_pool.created_at from self.created_at to self.as_of_timestamp. test_apply_wont_overwrite_if_its_data_is_stale was asserting pool.created_at == circulation.created_at, but for a CirculationData with updated_at set, as_of_timestamp resolves to updated_at (not created_at). Update all three assertions and their comments to use circulation.as_of_timestamp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The method and its call site were made obsolete by the lastUpdateTime parameter already being passed to Overdrive's API, which performs server-side filtering so only relevant books are returned in the first place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| # The data we have is strictly older than what is stored, no update needed. | ||
| return False | ||
|
|
||
| return self.calculate_hash() != db_object.updated_at_data_hash |
There was a problem hiding this comment.
Minor: When as_of_timestamp > db_object.updated_at but the hash matches (content unchanged, timestamp advanced), should_apply_to returns False and db_object.updated_at is never advanced. So updated_at semantically means "timestamp of last content change", not "most recent timestamp seen". This is consistent with the Edition path (_update_edition_timestamp is never called when we skip) and the non-ODL pool path (early return, no fields written), but differs from the ODL pool path (unconditionally advances pool.updated_at). The asymmetry is harmless for correctness — the hash is the authority — but may be surprising if updated_at is ever queried as an activity metric.
| _CANONICALIZE_TYPE_ORDER: dict[type, int] = { | ||
| bool: 0, | ||
| int: 1, | ||
| float: 1, |
There was a problem hiding this comment.
Nit: int and float share precedence 1. When a sequence contains 1 (int) and 1.0 (float), 1 == 1.0 in Python so the stable sort preserves input order, and json.dumps([1, 1.0]) vs json.dumps([1.0, 1]) differ as strings. In practice this is benign — model_dump(mode="json") produces list elements of uniform types — but it means json_canonical is not fully independent of the int/float distinction within the same list position.
dbernstein
left a comment
There was a problem hiding this comment.
I think this one is good to go. @jonathangreen I thoroughly tested our combined changes. It's working as expected. Claude made a bunch of suggestions which I thought were useful so those went in. I also was able to remove some effectively dead and confusing code (ie the removal of the early exit routine - last commit). Since we both worked on it, you may want to take a look at it before I merge it?
|
Sounds good @dbernstein. I'll take a run through it today before you merge. |
jonathangreen
left a comment
There was a problem hiding this comment.
Looks good @dbernstein. I'm okay with this getting merged. I am concerned about the queue management we will have to do when this goes in though, as a full re-import of everything will probably fill all of our queues.
| _primary_identifier: Identifier | None = PrivateAttr(default=None) | ||
| # Lazily computed and cached for the lifetime of this object. Cleared by | ||
| # __setattr__ whenever a public field is mutated so the hash stays consistent. | ||
| _hash_cache: str | None = PrivateAttr(default=None) |
There was a problem hiding this comment.
This cache is a nice addition.
| # Skip circulation data update if the content hasn't changed, UNLESS we | ||
| # have individual license objects that may have expired since the last | ||
| # import (ODL-style pools). License expiry is time-dependent and cannot | ||
| # be detected by content hashing alone. |
There was a problem hiding this comment.
I'd really like to see the expiration be handled in its own process instead of tangled up in change tracking. I think this carve out makes it kind of hard to reason able ODL-style pools, and means that they cannot participate in change tracking which I think would be beneficial for them.
I'm okay with this going in, but I'd like to make a follow up to have an async task to process license expiry, so its not tied together here.
There was a problem hiding this comment.
That seems reasonable. Here's the ticket: https://ebce-lyrasis.atlassian.net/browse/PP-4177
Description
Replaces the timestamp-only change-detection logic in the data import pipeline with a content-hash-based system. Previously,
BibliographicDataandCirculationDataused only the data source's "last updated" timestamp to decide whether to re-apply incoming data to anEditionorLicensePool. This caused two problems:This branch reverts the original timestamp-throttling PR (#3198) and replaces it with a proper content-hash approach. A SHA-256 hash of the canonical, serialized form of the incoming data is stored on the database record after each import. Subsequent imports compare both the timestamp and the hash before deciding whether to apply an update.
Key changes:
json_hash()/json_canonical()utilities (util/json.py) produce a stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.BaseMutableDatagainsupdated_at,created_at,as_of_timestamp,calculate_hash(), andshould_apply_to(). Theshould_apply_to()method is now the single decision point for both bibliographic and circulation data.BibliographicData.has_changed()andCirculationData.has_changed()are removed and replaced by the sharedshould_apply_to()logic.EditionandLicensePooleach gain anupdated_at_data_hashcolumn.LicensePoolalso gainscreated_atandupdated_atcolumns to track when itsCirculationDatawas first and most recently imported.LicensePool.created_atis set toas_of_timestamp(the data source's own availability date) rather than the import-time timestamp, preserving the "Added to Collection" feed sort order.f98e4049c87dadds all four new columns.Bug fixes (addressed during review):
_update_edition_timestampnow updatesupdated_atandupdated_at_data_hashtogether (or not at all), preserving the invariant that the hash always reflects the content as ofupdated_at. Previously, force-applying stale data viaeven_if_not_apparently_updated=Trueoverwrote the hash without advancing the timestamp, leaving the two fields inconsistent._update_edition_timestampnow uses<=instead of<when comparing timestamps. The previous strict less-than caused an infinite re-apply loop: when incoming data had the same timestamp as the stored record but different content, the update ran on every import but the new hash was never persisted.PublicationImportResult.changedis nowneeds_apply or called_circulation_apply. Previously, when bibliographic data was unchanged but an ODL circulation task was queued,changedwasFalse, causingfound_unchanged_publicationto returnTrueand halt feed pagination after the first page.apply_circulationtoimporter.import_feed, restoring the fallback path for "bibliographic unchanged, circulation changed". Previously this path was dead code, causing circulation-only updates to be silently skipped.Motivation and Context
https://ebce-lyrasis.atlassian.net/browse/PP-3997
The original
has_changed()implementation only compared timestamps, which is insufficient: a data source can re-publish identical content with a newer timestamp, or publish changed content with the same timestamp. Content hashing is the correct primitive for detecting genuine data changes and avoiding redundant imports.How Has This Been Tested?
BibliographicDataandCirculationDatacover the newshould_apply_to()logic, including the null-hash bootstrap case, the timestamp-is-older short-circuit, the hash-match skip, the equal-timestamp/changed-content case, and the stale force-apply invariant.json_canonical()andjson_hash()verify ordering stability across dict keys, list items, and float precision.found_unchanged_publicationstaysFalsewhen only a circulation task is queued.updated_atin place ofdata_source_last_updated).tox -e py312-docker -- --no-cov.Manually tested OverDrive locally by me (@dbernstein) with a non-advantage collection.. It seems to be working as advertised. Before merging I will confirm that it is working with a parent + two child advantage accounts.
Checklist