Skip to content

Content-hash-based change tracking for data imports#3199

Merged
dbernstein merged 20 commits intomainfrom
feature/change-tracking
Apr 23, 2026
Merged

Content-hash-based change tracking for data imports#3199
dbernstein merged 20 commits intomainfrom
feature/change-tracking

Conversation

@jonathangreen
Copy link
Copy Markdown
Member

@jonathangreen jonathangreen commented Apr 2, 2026

Description

Replaces the timestamp-only change-detection logic in the data import pipeline with a content-hash-based system. Previously, BibliographicData and CirculationData used only the data source's "last updated" timestamp to decide whether to re-apply incoming data to an Edition or LicensePool. This caused two problems:

  • Re-publishing the same content with a newer timestamp triggered unnecessary writes and work creation.
  • A real change arriving with the same (or missing) timestamp was silently skipped.

This branch reverts the original timestamp-throttling PR (#3198) and replaces it with a proper content-hash approach. A SHA-256 hash of the canonical, serialized form of the incoming data is stored on the database record after each import. Subsequent imports compare both the timestamp and the hash before deciding whether to apply an update.

Key changes:

  • New json_hash() / json_canonical() utilities (util/json.py) produce a stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
  • BaseMutableData gains updated_at, created_at, as_of_timestamp, calculate_hash(), and should_apply_to(). The should_apply_to() method is now the single decision point for both bibliographic and circulation data.
  • BibliographicData.has_changed() and CirculationData.has_changed() are removed and replaced by the shared should_apply_to() logic.
  • Edition and LicensePool each gain an updated_at_data_hash column. LicensePool also gains created_at and updated_at columns to track when its CirculationData was first and most recently imported.
  • Individual-license pools (e.g. ODL) always re-apply availability even when the hash matches, because license availability can change as licenses expire independently of feed content.
  • LicensePool.created_at is set to as_of_timestamp (the data source's own availability date) rather than the import-time timestamp, preserving the "Added to Collection" feed sort order.
  • Database migration f98e4049c87d adds all four new columns.

Bug fixes (addressed during review):

  • _update_edition_timestamp now updates updated_at and updated_at_data_hash together (or not at all), preserving the invariant that the hash always reflects the content as of updated_at. Previously, force-applying stale data via even_if_not_apparently_updated=True overwrote the hash without advancing the timestamp, leaving the two fields inconsistent.
  • _update_edition_timestamp now uses <= instead of < when comparing timestamps. The previous strict less-than caused an infinite re-apply loop: when incoming data had the same timestamp as the stored record but different content, the update ran on every import but the new hash was never persisted.
  • PublicationImportResult.changed is now needs_apply or called_circulation_apply. Previously, when bibliographic data was unchanged but an ODL circulation task was queued, changed was False, causing found_unchanged_publication to return True and halt feed pagination after the first page.
  • OPDS import task now passes apply_circulation to importer.import_feed, restoring the fallback path for "bibliographic unchanged, circulation changed". Previously this path was dead code, causing circulation-only updates to be silently skipped.

Motivation and Context

https://ebce-lyrasis.atlassian.net/browse/PP-3997

The original has_changed() implementation only compared timestamps, which is insufficient: a data source can re-publish identical content with a newer timestamp, or publish changed content with the same timestamp. Content hashing is the correct primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

  • Updated unit tests for BibliographicData and CirculationData cover the new should_apply_to() logic, including the null-hash bootstrap case, the timestamp-is-older short-circuit, the hash-match skip, the equal-timestamp/changed-content case, and the stale force-apply invariant.
  • New unit tests for json_canonical() and json_hash() verify ordering stability across dict keys, list items, and float precision.
  • New test for the ODL pagination fix verifies that found_unchanged_publication stays False when only a circulation task is queued.
  • All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers pass with the updated field names (updated_at in place of data_source_last_updated).
  • Full test suite run via tox -e py312-docker -- --no-cov.

Manually tested OverDrive locally by me (@dbernstein) with a non-advantage collection.. It seems to be working as advertised. Before merging I will confirm that it is working with a parent + two child advantage accounts.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@jonathangreen jonathangreen added the feature New feature label Apr 2, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 99.02913% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.31%. Comparing base (45176e0) to head (07765ca).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
.../manager/integration/license/overdrive/importer.py 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3199      +/-   ##
==========================================
- Coverage   93.31%   93.31%   -0.01%     
==========================================
  Files         502      502              
  Lines       46178    46211      +33     
  Branches     6315     6314       -1     
==========================================
+ Hits        43093    43123      +30     
  Misses       2001     2001              
- Partials     1084     1087       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein dbernstein changed the title WIP: Content-hash-based change tracking for data imports Content-hash-based change tracking for data imports Apr 6, 2026
@dbernstein dbernstein force-pushed the feature/change-tracking branch 6 times, most recently from f58ea44 to 8135dd0 Compare April 13, 2026 16:50
@dbernstein dbernstein force-pushed the feature/change-tracking branch 2 times, most recently from f5a7e26 to d17bf96 Compare April 18, 2026 01:39
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 18, 2026

Claude finished @dbernstein's task in 7m 6s —— View job


PR Review: Content-hash-based change tracking for data imports

  • Gather context and understand the request
  • Read core changed files (mutable.py, bibliographic.py, circulation.py, json.py)
  • Read importer and model files
  • Read test files
  • Post inline comments for specific findings
  • Write final summary

Summary

All bugs flagged in previous review rounds have been correctly addressed:

  • _update_edition_timestamp hash/timestamp invariant (f269229): Both edition.updated_at and edition.updated_at_data_hash are now updated together inside a single <= guard, preserving the invariant that the hash always reflects content as of the stored timestamp. The test at line 829–834 of test_bibliographic.py confirms stale force-applies leave the hash untouched.

  • <= instead of < (65214a6): Equal-timestamp imports with changed content now store the new hash, preventing the infinite re-apply loop where a hash mismatch was seen on every import but never persisted.

  • ODL pagination fix (02d141d): PublicationImportResult.changed = needs_apply or called_circulation_apply ensures that queuing a circulation-only task counts as "changed", so found_unchanged_publication stays False and pagination continues past page 1 for ODL feeds.

  • apply_circulation wired up in opds_import_task: The fallback path for "bibliographic unchanged, circulation changed" is no longer dead code. Verified in celery/opds.py and celery/tasks/opds_odl.py.

  • ODL needs_apply always True (361c999): CirculationData.needs_apply short-circuits to True when self.licenses is not None, mirroring the existing guard in apply().

The two inline comments above are minor observations only — no blocking issues remain. The implementation is correct, the test coverage is thorough (including the equal-timestamp/changed-content and stale-force-apply edge cases), and the migration is clearly documented with the expected first-import performance caveat.

# The data we have is strictly older than what is stored, no update needed.
return False

return self.calculate_hash() != db_object.updated_at_data_hash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue will be addressed: Fix by moving the hash update inside the same conditional as the timestamp
update, so both fields are always kept in sync.

as_of=self.as_of_timestamp,
)

pool.updated_at = self.as_of_timestamp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.

# bool must come before int because bool is a subclass of int in Python.
_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
bool: 0,
int: 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.

if edition.updated_at is None or edition.updated_at < updated_at:
edition.updated_at = updated_at

edition.updated_at_data_hash = self.calculate_hash()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.

self.last_checked if self.last_checked else utc_now()
)
license_pool.last_checked = None
license_pool.created_at = self.created_at
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Setting license_pool.created_at = self.created_at means the "created at" timestamp for a new pool is the time the CirculationData Python object was constructed during this import run, not the data source's own updated_at. For Boundless, the previous code set availability_time to title.availability.update_date (via last_checked), allowing the "Added to Collection" feed sort to reflect the source's timestamp. With this change, availability_time always comes from utc_now() in LicensePool.for_foreign_id and created_at records import-time. The behaviour change is arguably more correct semantically, but it will alter the "Added to Collection" ordering for any existing data source that previously set last_checked.

and apply_circulation is not None
and (
import_even_if_unchanged
or bibliographic.circulation.needs_apply(session, collection)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ODL pools (self.licenses is not None), needs_apply always returns True, so a circulation task is queued on every feed import for every ODL title whose bibliographic data is unchanged. This is intentional (license expiry is time-dependent and can't be detected by hashing), but it means the elif branch effectively opts ODL pools out of the hash-skipping optimisation entirely. Worth a brief comment here so the next reader understands the ODL path is not a missed skip opportunity.

# we set the edition's updated_at to the data source's last updated time.
# Both fields are updated together so updated_at_data_hash always reflects
# the content as of updated_at.
if edition.updated_at is None or edition.updated_at < updated_at:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When as_of_timestamp == edition.updated_at but the hash has changed (i.e. a publisher re-publishes with the same timestamp but different content), should_apply_to correctly returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never updated. Every subsequent import with the same timestamp and the now-stable content will still see H_new != H_old and re-apply indefinitely. Fix: also update the hash when the timestamps are equal (edition.updated_at <= updated_at instead of <).

results[identifier] = PublicationImportResult(
bibliographic=bibliographic,
changed=has_changed,
changed=needs_apply,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ODL pools, bibliographic.circulation.needs_apply() always returns True (licence expiry is time-dependent), so called_circulation_apply is set to True whenever bib data is stable. However, changed is still False here, making FeedImportResult.found_unchanged_publication return True after the first page. opds_import_task then stops paginating, so ODL titles on pages 2+ never have their licenses refreshed. Fix: changed=needs_apply or called_circulation_apply so that queuing a circulation task counts as "the publication needed work".

dbernstein and others added 10 commits April 20, 2026 12:38
Fixes all broken tests, mypy errors, and incomplete source changes from
the initial WIP commit (bde0829).

This commit contains all Claude authored work.

- LicensePool model was missing `updated_at` and `created_at` columns
  referenced by new circulation code, causing 49 test failures
- 31 mypy errors across json.py, bibliographic.py, circulation.py,
  and integration importers
- Incomplete rename of `has_changed` → `needs_apply` left stale calls
  in bibliographic.py, circulation.py, and three integration importers
- `data_source_last_updated` still referenced in bibliographic.py,
  two OPDS extractors, and the Boundless parser/conftest
- Missing alembic migration for all new DB columns
- `LinkData.content` (bytes | str field) caused UnicodeDecodeError when
  hashing bibliographic data containing embedded binary images
- `_canonicalize` / `_canonicalize_sort_key` lacked type annotations
- ODL reimport of expired licenses was incorrectly skipped because
  license expiry is time-dependent, not detectable by content hash
src/palace/manager/sqlalchemy/model/licensing.py
- Add `created_at` and `updated_at` columns to LicensePool
src/palace/manager/data_layer/base/mutable.py
- Fix `should_apply_to` condition: `<=` → `<` so equal timestamps
  still trigger a hash check rather than an unconditional skip
src/palace/manager/data_layer/link.py
- Add `@field_serializer("content", when_used="json")` to base64-encode
  binary bytes in the `bytes | str | None` union field
src/palace/manager/data_layer/bibliographic.py
- Replace `data_source_last_updated` with `updated_at` throughout
- Replace `has_changed` calls with `should_apply_to` in apply() /
  apply_edition_only(); `_update_edition_timestamp` now also stores
  `updated_at_data_hash` on the edition
src/palace/manager/data_layer/circulation.py
- Replace remaining `has_changed` / `last_checked` references
- Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply
- Early-return skip is bypassed when `self.licenses is not None`
  (ODL-style pools) so time-expired licenses are always reprocessed;
  inner availability block gets the same treatment
src/palace/manager/util/json.py
- Add `int` type annotations to all `float_precision` parameters
src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py
- `has_changed` → `needs_apply`
src/palace/manager/integration/license/{opds1,odl}/extractor.py
src/palace/manager/integration/license/boundless/parser.py
- `data_source_last_updated=` → `updated_at=`
alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py
- New migration: `updated_at_data_hash` on editions and licensepools,
  `created_at` / `updated_at` on licensepools
tests/manager/data_layer/test_bibliographic.py
- Replace `data_source_last_updated` with `updated_at`; rewrite
  test_apply_no_changes_needed for hash-based semantics; rename
  test_data_source_last_updated_updates_timestamp
tests/manager/data_layer/test_measurement.py
- Update test_taken_at: taken_at now defaults to None
tests/manager/integration/license/{opds,overdrive}/test_importer.py
tests/manager/integration/license/boundless/conftest.py
- Update mock/fixture references from has_changed / last_checked
  to needs_apply / updated_at
- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash`
  so that identical content with different timestamps does not trigger
  spurious re-imports.
- Fix `_canonicalize_sort_key` crash when sorting sequences containing
  multiple `None` values (`None < None` raises TypeError in Python).
  Use a stable sentinel `""` as the second element of the sort key instead.
- Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid
  rebuilding the dict on every recursive call.
- Cache `calculate_hash()` result on the instance via `PrivateAttr` and
  invalidate on field mutation, avoiding a redundant SHA-256 computation
  per `apply()` cycle.
- Remove redundant `should_apply_to` guard inside `CirculationData.apply`;
  the early-return path already handles all the same conditions.
- Fix misleading log message when skipping a circulation data update.
- Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and
  `CirculationData.needs_apply`.
- Add tests for `json_hash`, multiple-None sequence sorting, and unsupported
  type errors in `_canonicalize_sort_key`.
- Add a note to the migration explaining the first-import-after-deploy
  performance impact.
…ction

The `opds_import_task` was not passing `apply_circulation` to
`importer.import_feed`, making the fallback path for "bibliographic
unchanged, circulation changed" completely dead code. Pass
`apply.circulation_apply.delay` to restore that path.
Add a `needs_apply` guard to the `elif` branch in
`import_feed_from_response` so `apply_circulation` is only queued when
the circulation data has actually changed, preventing redundant tasks on
every re-import of unchanged content.
Fix `CirculationData.needs_apply` to always return `True` when
`self.licenses` is not None (ODL-style pools). License expiry is
time-dependent and cannot be detected by content hashing alone; this
mirrors the existing exception already present in the `apply()` early-
return guard.
dbernstein and others added 6 commits April 20, 2026 12:38
…erate vastly

few apply queue tasks, we can afford to pull more books per page.
When stale data was force-applied via even_if_not_apparently_updated=True,
_update_edition_timestamp would skip advancing edition.updated_at (correct)
but still overwrite updated_at_data_hash with the stale content hash
(incorrect). This broke the invariant that updated_at_data_hash reflects
the content as of updated_at, and could cause future imports with
timestamps between the stale and stored timestamps to compare against the
wrong hash.

Fix by moving the hash update inside the same conditional as the timestamp
update, so both fields are always kept in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pool.updated_at is set unconditionally (even when stale data is
force-applied via even_if_not_apparently_updated=True), rolling both
updated_at and updated_at_data_hash together. This ensures future imports
at any timestamp after as_of_timestamp are not blocked by the strict
less-than check in should_apply_to() and can reach the hash comparison.

This is intentionally different from the Edition path, which leaves both
fields unchanged when as_of_timestamp does not advance updated_at. Add a
comment to prevent a well-meaning refactor from "fixing" this asymmetry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
self.created_at is the import-time timestamp of the CirculationData
object itself, not the data source availability date. Using it for
license_pool.created_at changes the Added to Collection sort order
for sources like Boundless that populate updated_at from the feed
update_date.

as_of_timestamp resolves to updated_at when set (the data source own
availability timestamp) and falls back to created_at (import time) when
not, mirroring the previous behavior of:
  availability_time = self.last_checked if self.last_checked else utc_now()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ODL pools (licenses is not None) always return True from needs_apply,
so they are exempt from the hash-based skip optimization. License
expiry is time-dependent and cannot be detected by content hashing
alone. Add a call-site comment so this is visible to readers of the
importer loop without having to trace into needs_apply.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbernstein dbernstein force-pushed the feature/change-tracking branch from 784c6e7 to 361c999 Compare April 20, 2026 19:39
# we set the edition's updated_at to the data source's last updated time.
# Both fields are updated together so updated_at_data_hash always reflects
# the content as of updated_at.
if edition.updated_at is None or edition.updated_at < updated_at:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When as_of_timestamp == edition.updated_at but the hash has changed (publisher re-published identical timestamp, different content), should_apply_to returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never written. Every subsequent import with the same timestamp and the now-stable content will still see a hash mismatch and re-apply indefinitely.

Fix: update the hash when timestamps are equal — either edition.updated_at <= updated_at for the outer guard (advancing updated_at only when strictly newer), or write the hash in a separate branch for the equality case.

results[identifier] = PublicationImportResult(
bibliographic=bibliographic,
changed=has_changed,
changed=needs_apply,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an ODL title where bibliographic data is stable (needs_apply = False) but called_circulation_apply = True (because CirculationData.needs_apply always returns True for ODL pools), changed is still False. FeedImportResult.found_unchanged_publication therefore returns True after the first page, and opds_import_task stops paginating — so ODL titles on pages 2+ never get their licenses refreshed.

Fix: changed=needs_apply or called_circulation_apply.

dbernstein and others added 3 commits April 20, 2026 12:51
When as_of_timestamp equals edition.updated_at but content has changed,
should_apply_to correctly returns True (hash differs) and the update
runs — but the old < comparison in _update_edition_timestamp skipped
storing the new hash. On every subsequent re-import the hash still
mismatched, causing the update to re-apply indefinitely.

Changing to <= ensures equal-timestamp imports with changed content
still record the new hash, breaking the loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When bibliographic data is unchanged but an ODL circulation task is
queued (needs_apply always returns True for ODL pools due to license
expiry), PublicationImportResult.changed was set to needs_apply (False).
This caused found_unchanged_publication to return True, halting
pagination after the first page.

Fix by setting changed=needs_apply or called_circulation_apply so that
queuing a circulation task counts as work done for the publication.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit changed license_pool.created_at from self.created_at
to self.as_of_timestamp. test_apply_wont_overwrite_if_its_data_is_stale
was asserting pool.created_at == circulation.created_at, but for a
CirculationData with updated_at set, as_of_timestamp resolves to
updated_at (not created_at). Update all three assertions and their
comments to use circulation.as_of_timestamp.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbernstein dbernstein marked this pull request as ready for review April 20, 2026 20:24
@dbernstein dbernstein requested a review from a team April 20, 2026 21:23
@dbernstein dbernstein enabled auto-merge (squash) April 20, 2026 21:26
The method and its call site were made obsolete by the lastUpdateTime
parameter already being passed to Overdrive's API, which performs
server-side filtering so only relevant books are returned in the
first place.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# The data we have is strictly older than what is stored, no update needed.
return False

return self.calculate_hash() != db_object.updated_at_data_hash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: When as_of_timestamp > db_object.updated_at but the hash matches (content unchanged, timestamp advanced), should_apply_to returns False and db_object.updated_at is never advanced. So updated_at semantically means "timestamp of last content change", not "most recent timestamp seen". This is consistent with the Edition path (_update_edition_timestamp is never called when we skip) and the non-ODL pool path (early return, no fields written), but differs from the ODL pool path (unconditionally advances pool.updated_at). The asymmetry is harmless for correctness — the hash is the authority — but may be surprising if updated_at is ever queried as an activity metric.

_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
bool: 0,
int: 1,
float: 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: int and float share precedence 1. When a sequence contains 1 (int) and 1.0 (float), 1 == 1.0 in Python so the stable sort preserves input order, and json.dumps([1, 1.0]) vs json.dumps([1.0, 1]) differ as strings. In practice this is benign — model_dump(mode="json") produces list elements of uniform types — but it means json_canonical is not fully independent of the int/float distinction within the same list position.

@dbernstein dbernstein disabled auto-merge April 21, 2026 05:30
Copy link
Copy Markdown
Contributor

@dbernstein dbernstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is good to go. @jonathangreen I thoroughly tested our combined changes. It's working as expected. Claude made a bunch of suggestions which I thought were useful so those went in. I also was able to remove some effectively dead and confusing code (ie the removal of the early exit routine - last commit). Since we both worked on it, you may want to take a look at it before I merge it?

@dbernstein dbernstein requested a review from a team April 21, 2026 21:26
@jonathangreen
Copy link
Copy Markdown
Member Author

Sounds good @dbernstein. I'll take a run through it today before you merge.

Copy link
Copy Markdown
Member Author

@jonathangreen jonathangreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @dbernstein. I'm okay with this getting merged. I am concerned about the queue management we will have to do when this goes in though, as a full re-import of everything will probably fill all of our queues.

_primary_identifier: Identifier | None = PrivateAttr(default=None)
# Lazily computed and cached for the lifetime of this object. Cleared by
# __setattr__ whenever a public field is mutated so the hash stays consistent.
_hash_cache: str | None = PrivateAttr(default=None)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cache is a nice addition.

# Skip circulation data update if the content hasn't changed, UNLESS we
# have individual license objects that may have expired since the last
# import (ODL-style pools). License expiry is time-dependent and cannot
# be detected by content hashing alone.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to see the expiration be handled in its own process instead of tangled up in change tracking. I think this carve out makes it kind of hard to reason able ODL-style pools, and means that they cannot participate in change tracking which I think would be beneficial for them.

I'm okay with this going in, but I'd like to make a follow up to have an async task to process license expiry, so its not tied together here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable. Here's the ticket: https://ebce-lyrasis.atlassian.net/browse/PP-4177

@dbernstein dbernstein merged commit 7fb2e0c into main Apr 23, 2026
21 checks passed
@dbernstein dbernstein deleted the feature/change-tracking branch April 23, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants