Content-hash-based change tracking for data imports by jonathangreen · Pull Request #3199 · ThePalaceProject/circulation

jonathangreen · 2026-04-02T19:42:33Z

Description

Replaces the timestamp-only change-detection logic in the data import pipeline with a content-hash-based system. Previously, BibliographicData and CirculationData used only the data source's "last updated" timestamp to decide whether to re-apply incoming data to an Edition or LicensePool. This caused two problems:

Re-publishing the same content with a newer timestamp triggered unnecessary writes and work creation.
A real change arriving with the same (or missing) timestamp was silently skipped.

This branch reverts the original timestamp-throttling PR (#3198) and replaces it with a proper content-hash approach. A SHA-256 hash of the canonical, serialized form of the incoming data is stored on the database record after each import. Subsequent imports compare both the timestamp and the hash before deciding whether to apply an update.

Key changes:

New json_hash() / json_canonical() utilities (util/json.py) produce a stable, order-independent SHA-256 fingerprint of any JSON-serializable structure.
BaseMutableData gains updated_at, created_at, as_of_timestamp, calculate_hash(), and should_apply_to(). The should_apply_to() method is now the single decision point for both bibliographic and circulation data.
BibliographicData.has_changed() and CirculationData.has_changed() are removed and replaced by the shared should_apply_to() logic.
Edition and LicensePool each gain an updated_at_data_hash column. LicensePool also gains created_at and updated_at columns to track when its CirculationData was first and most recently imported.
Individual-license pools (e.g. ODL) always re-apply availability even when the hash matches, because license availability can change as licenses expire independently of feed content.
LicensePool.created_at is set to as_of_timestamp (the data source's own availability date) rather than the import-time timestamp, preserving the "Added to Collection" feed sort order.
Database migration f98e4049c87d adds all four new columns.

Bug fixes (addressed during review):

_update_edition_timestamp now updates updated_at and updated_at_data_hash together (or not at all), preserving the invariant that the hash always reflects the content as of updated_at. Previously, force-applying stale data via even_if_not_apparently_updated=True overwrote the hash without advancing the timestamp, leaving the two fields inconsistent.
_update_edition_timestamp now uses <= instead of < when comparing timestamps. The previous strict less-than caused an infinite re-apply loop: when incoming data had the same timestamp as the stored record but different content, the update ran on every import but the new hash was never persisted.
PublicationImportResult.changed is now needs_apply or called_circulation_apply. Previously, when bibliographic data was unchanged but an ODL circulation task was queued, changed was False, causing found_unchanged_publication to return True and halt feed pagination after the first page.
OPDS import task now passes apply_circulation to importer.import_feed, restoring the fallback path for "bibliographic unchanged, circulation changed". Previously this path was dead code, causing circulation-only updates to be silently skipped.

Motivation and Context

https://ebce-lyrasis.atlassian.net/browse/PP-3997

The original has_changed() implementation only compared timestamps, which is insufficient: a data source can re-publish identical content with a newer timestamp, or publish changed content with the same timestamp. Content hashing is the correct primitive for detecting genuine data changes and avoiding redundant imports.

How Has This Been Tested?

Updated unit tests for BibliographicData and CirculationData cover the new should_apply_to() logic, including the null-hash bootstrap case, the timestamp-is-older short-circuit, the hash-match skip, the equal-timestamp/changed-content case, and the stale force-apply invariant.
New unit tests for json_canonical() and json_hash() verify ordering stability across dict keys, list items, and float precision.
New test for the ODL pagination fix verifies that found_unchanged_publication stays False when only a circulation task is queued.
All existing integration tests for Boundless, OPDS, ODL, and Overdrive importers pass with the updated field names (updated_at in place of data_source_last_updated).
Full test suite run via tox -e py312-docker -- --no-cov.

Manually tested OverDrive locally by me (@dbernstein) with a non-advantage collection.. It seems to be working as advertised. Before merging I will confirm that it is working with a parent + two child advantage accounts.

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2026-04-03T02:43:08Z

Codecov Report

❌ Patch coverage is 99.02913% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 93.31%. Comparing base (45176e0) to head (07765ca).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
.../manager/integration/license/overdrive/importer.py	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3199      +/-   ##
==========================================
- Coverage   93.31%   93.31%   -0.01%     
==========================================
  Files         502      502              
  Lines       46178    46211      +33     
  Branches     6315     6314       -1     
==========================================
+ Hits        43093    43123      +30     
  Misses       2001     2001              
- Partials     1084     1087       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-04-18T01:39:15Z

Claude finished @dbernstein's task in 7m 6s —— View job

PR Review: Content-hash-based change tracking for data imports

Gather context and understand the request
Read core changed files (mutable.py, bibliographic.py, circulation.py, json.py)
Read importer and model files
Read test files
Post inline comments for specific findings
Write final summary

Summary

All bugs flagged in previous review rounds have been correctly addressed:

_update_edition_timestamp hash/timestamp invariant (f269229): Both edition.updated_at and edition.updated_at_data_hash are now updated together inside a single <= guard, preserving the invariant that the hash always reflects content as of the stored timestamp. The test at line 829–834 of test_bibliographic.py confirms stale force-applies leave the hash untouched.
<= instead of < (65214a6): Equal-timestamp imports with changed content now store the new hash, preventing the infinite re-apply loop where a hash mismatch was seen on every import but never persisted.
ODL pagination fix (02d141d): PublicationImportResult.changed = needs_apply or called_circulation_apply ensures that queuing a circulation-only task counts as "changed", so found_unchanged_publication stays False and pagination continues past page 1 for ODL feeds.
apply_circulation wired up in opds_import_task: The fallback path for "bibliographic unchanged, circulation changed" is no longer dead code. Verified in celery/opds.py and celery/tasks/opds_odl.py.
ODL needs_apply always True (361c999): CirculationData.needs_apply short-circuits to True when self.licenses is not None, mirroring the existing guard in apply().

The two inline comments above are minor observations only — no blocking issues remain. The implementation is correct, the test coverage is thorough (including the equal-timestamp/changed-content and stale-force-apply edge cases), and the migration is clearly documented with the expected first-import performance caveat.

claude · 2026-04-18T01:51:36Z

+            # The data we have is strictly older than what is stored, no update needed.
+            return False
+
+        return self.calculate_hash() != db_object.updated_at_data_hash


Minor: When even_if_not_apparently_updated=True is used with data whose as_of_timestamp is older than db_object.updated_at (i.e. force-applying stale bibliographic data), _update_edition_timestamp will not advance updated_at but will still overwrite updated_at_data_hash with the stale hash. Subsequent imports with a timestamp in the range (as_of_timestamp, db_object.updated_at) will be rejected by the < check here even though their hash would differ. The LicensePool path in circulation.apply does not have this issue because it always overwrites pool.updated_at. The inconsistency is harmless in practice (data sources advance timestamps monotonically) but worth documenting.

This issue will be addressed: Fix by moving the hash update inside the same conditional as the timestamp
update, so both fields are always kept in sync.

claude · 2026-04-18T01:51:38Z

+                    as_of=self.as_of_timestamp,
                )

+            pool.updated_at = self.as_of_timestamp


Minor: pool.updated_at is unconditionally overwritten with self.as_of_timestamp, which can regress the stored timestamp when even_if_not_apparently_updated=True is combined with stale data (verified by the test). This differs from _update_edition_timestamp which only advances updated_at. The regression is actually beneficial here — it ensures subsequent imports with intermediate timestamps pass the < check — but the asymmetry with the Edition path is worth documenting as intentional.

claude · 2026-04-18T01:51:45Z

+# bool must come before int because bool is a subclass of int in Python.
+_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
+    bool: 0,
+    int: 1,


Nit: int and float share the same precedence (1). When a list contains both integers and floats, their relative order is determined by Python's native int/float comparison, which is well-defined. This is fine for the current use case, but it means [1, 1.0] and [1.0, 1] canonicalise to the same sorted output [1, 1.0] (since 1 == 1.0 in Python, the sort is stable and preserves input order). If that distinction ever matters, a separate precedence for float would be needed.

claude · 2026-04-18T01:51:50Z

        if edition.updated_at is None or edition.updated_at < updated_at:
            edition.updated_at = updated_at

+        edition.updated_at_data_hash = self.calculate_hash()


Minor: updated_at_data_hash is unconditionally overwritten here, even when edition.updated_at was not advanced (i.e. as_of_timestamp <= edition.updated_at). This is correct — the hash should always reflect the data most recently applied — but it means the (updated_at, updated_at_data_hash) pair does not represent a coherent snapshot from a single import when old data is force-applied. Consider a brief comment to make this intentional decoupling explicit.

claude · 2026-04-20T18:07:16Z

-                self.last_checked if self.last_checked else utc_now()
-            )
-            license_pool.last_checked = None
+            license_pool.created_at = self.created_at


Minor: Setting license_pool.created_at = self.created_at means the "created at" timestamp for a new pool is the time the CirculationData Python object was constructed during this import run, not the data source's own updated_at. For Boundless, the previous code set availability_time to title.availability.update_date (via last_checked), allowing the "Added to Collection" feed sort to reflect the source's timestamp. With this change, availability_time always comes from utc_now() in LicensePool.for_foreign_id and created_at records import-time. The behaviour change is arguably more correct semantically, but it will alter the "Added to Collection" ordering for any existing data source that previously set last_checked.

claude · 2026-04-20T18:07:29Z

+                and apply_circulation is not None
+                and (
+                    import_even_if_unchanged
+                    or bibliographic.circulation.needs_apply(session, collection)


For ODL pools (self.licenses is not None), needs_apply always returns True, so a circulation task is queued on every feed import for every ODL title whose bibliographic data is unchanged. This is intentional (license expiry is time-dependent and can't be detected by hashing), but it means the elif branch effectively opts ODL pools out of the hash-skipping optimisation entirely. Worth a brief comment here so the next reader understands the ODL path is not a missed skip opportunity.

claude · 2026-04-20T19:12:19Z

        # we set the edition's updated_at to the data source's last updated time.
+        # Both fields are updated together so updated_at_data_hash always reflects
+        # the content as of updated_at.
        if edition.updated_at is None or edition.updated_at < updated_at:


When as_of_timestamp == edition.updated_at but the hash has changed (i.e. a publisher re-publishes with the same timestamp but different content), should_apply_to correctly returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never updated. Every subsequent import with the same timestamp and the now-stable content will still see H_new != H_old and re-apply indefinitely. Fix: also update the hash when the timestamps are equal (edition.updated_at <= updated_at instead of <).

claude · 2026-04-20T19:12:23Z

            results[identifier] = PublicationImportResult(
                bibliographic=bibliographic,
-                changed=has_changed,
+                changed=needs_apply,


For ODL pools, bibliographic.circulation.needs_apply() always returns True (licence expiry is time-dependent), so called_circulation_apply is set to True whenever bib data is stable. However, changed is still False here, making FeedImportResult.found_unchanged_publication return True after the first page. opds_import_task then stops paginating, so ODL titles on pages 2+ never have their licenses refreshed. Fix: changed=needs_apply or called_circulation_apply so that queuing a circulation task counts as "the publication needed work".

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

Fixes all broken tests, mypy errors, and incomplete source changes from the initial WIP commit (bde0829). This commit contains all Claude authored work. - LicensePool model was missing `updated_at` and `created_at` columns referenced by new circulation code, causing 49 test failures - 31 mypy errors across json.py, bibliographic.py, circulation.py, and integration importers - Incomplete rename of `has_changed` → `needs_apply` left stale calls in bibliographic.py, circulation.py, and three integration importers - `data_source_last_updated` still referenced in bibliographic.py, two OPDS extractors, and the Boundless parser/conftest - Missing alembic migration for all new DB columns - `LinkData.content` (bytes | str field) caused UnicodeDecodeError when hashing bibliographic data containing embedded binary images - `_canonicalize` / `_canonicalize_sort_key` lacked type annotations - ODL reimport of expired licenses was incorrectly skipped because license expiry is time-dependent, not detectable by content hash src/palace/manager/sqlalchemy/model/licensing.py - Add `created_at` and `updated_at` columns to LicensePool src/palace/manager/data_layer/base/mutable.py - Fix `should_apply_to` condition: `<=` → `<` so equal timestamps still trigger a hash check rather than an unconditional skip src/palace/manager/data_layer/link.py - Add `@field_serializer("content", when_used="json")` to base64-encode binary bytes in the `bytes | str | None` union field src/palace/manager/data_layer/bibliographic.py - Replace `data_source_last_updated` with `updated_at` throughout - Replace `has_changed` calls with `should_apply_to` in apply() / apply_edition_only(); `_update_edition_timestamp` now also stores `updated_at_data_hash` on the edition src/palace/manager/data_layer/circulation.py - Replace remaining `has_changed` / `last_checked` references - Set `pool.updated_at` alongside `pool.updated_at_data_hash` after apply - Early-return skip is bypassed when `self.licenses is not None` (ODL-style pools) so time-expired licenses are always reprocessed; inner availability block gets the same treatment src/palace/manager/util/json.py - Add `int` type annotations to all `float_precision` parameters src/palace/manager/integration/license/{opds,boundless,overdrive}/importer.py - `has_changed` → `needs_apply` src/palace/manager/integration/license/{opds1,odl}/extractor.py src/palace/manager/integration/license/boundless/parser.py - `data_source_last_updated=` → `updated_at=` alembic/versions/20260402_57d824b34167_add_change_tracking_hash_columns.py - New migration: `updated_at_data_hash` on editions and licensepools, `created_at` / `updated_at` on licensepools tests/manager/data_layer/test_bibliographic.py - Replace `data_source_last_updated` with `updated_at`; rewrite test_apply_no_changes_needed for hash-based semantics; rename test_data_source_last_updated_updates_timestamp tests/manager/data_layer/test_measurement.py - Update test_taken_at: taken_at now defaults to None tests/manager/integration/license/{opds,overdrive}/test_importer.py tests/manager/integration/license/boundless/conftest.py - Update mock/fixture references from has_changed / last_checked to needs_apply / updated_at

…ion tool.

…g alembic.

- Exclude `updated_at` from hash calculation in `fields_excluded_from_hash` so that identical content with different timestamps does not trigger spurious re-imports. - Fix `_canonicalize_sort_key` crash when sorting sequences containing multiple `None` values (`None < None` raises TypeError in Python). Use a stable sentinel `""` as the second element of the sort key instead. - Move `_CANONICALIZE_TYPE_ORDER` to a module-level constant to avoid rebuilding the dict on every recursive call. - Cache `calculate_hash()` result on the instance via `PrivateAttr` and invalidate on field mutation, avoiding a redundant SHA-256 computation per `apply()` cycle. - Remove redundant `should_apply_to` guard inside `CirculationData.apply`; the early-return path already handles all the same conditions. - Fix misleading log message when skipping a circulation data update. - Add docstrings to `json_hash`, `BibliographicData.needs_apply`, and `CirculationData.needs_apply`. - Add tests for `json_hash`, multiple-None sequence sorting, and unsupported type errors in `_canonicalize_sort_key`. - Add a note to the migration explaining the first-import-after-deploy performance impact.

…ction The `opds_import_task` was not passing `apply_circulation` to `importer.import_feed`, making the fallback path for "bibliographic unchanged, circulation changed" completely dead code. Pass `apply.circulation_apply.delay` to restore that path. Add a `needs_apply` guard to the `elif` branch in `import_feed_from_response` so `apply_circulation` is only queued when the circulation data has actually changed, preventing redundant tasks on every re-import of unchanged content. Fix `CirculationData.needs_apply` to always return `True` when `self.licenses` is not None (ODL-style pools). License expiry is time-dependent and cannot be detected by content hashing alone; this mirrors the existing exception already present in the `apply()` early- return guard.

…erate vastly few apply queue tasks, we can afford to pull more books per page.

When stale data was force-applied via even_if_not_apparently_updated=True, _update_edition_timestamp would skip advancing edition.updated_at (correct) but still overwrite updated_at_data_hash with the stale content hash (incorrect). This broke the invariant that updated_at_data_hash reflects the content as of updated_at, and could cause future imports with timestamps between the stale and stored timestamps to compare against the wrong hash. Fix by moving the hash update inside the same conditional as the timestamp update, so both fields are always kept in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pool.updated_at is set unconditionally (even when stale data is force-applied via even_if_not_apparently_updated=True), rolling both updated_at and updated_at_data_hash together. This ensures future imports at any timestamp after as_of_timestamp are not blocked by the strict less-than check in should_apply_to() and can reach the hash comparison. This is intentionally different from the Edition path, which leaves both fields unchanged when as_of_timestamp does not advance updated_at. Add a comment to prevent a well-meaning refactor from "fixing" this asymmetry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

self.created_at is the import-time timestamp of the CirculationData object itself, not the data source availability date. Using it for license_pool.created_at changes the Added to Collection sort order for sources like Boundless that populate updated_at from the feed update_date. as_of_timestamp resolves to updated_at when set (the data source own availability timestamp) and falls back to created_at (import time) when not, mirroring the previous behavior of: availability_time = self.last_checked if self.last_checked else utc_now() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ODL pools (licenses is not None) always return True from needs_apply, so they are exempt from the hash-based skip optimization. License expiry is time-dependent and cannot be detected by content hashing alone. Add a call-site comment so this is visible to readers of the importer loop without having to trace into needs_apply. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-20T19:43:08Z

        # we set the edition's updated_at to the data source's last updated time.
+        # Both fields are updated together so updated_at_data_hash always reflects
+        # the content as of updated_at.
        if edition.updated_at is None or edition.updated_at < updated_at:


When as_of_timestamp == edition.updated_at but the hash has changed (publisher re-published identical timestamp, different content), should_apply_to returns True and the data is applied, but this branch is skipped so updated_at_data_hash is never written. Every subsequent import with the same timestamp and the now-stable content will still see a hash mismatch and re-apply indefinitely.

Fix: update the hash when timestamps are equal — either edition.updated_at <= updated_at for the outer guard (advancing updated_at only when strictly newer), or write the hash in a separate branch for the equality case.

claude · 2026-04-20T19:43:15Z

            results[identifier] = PublicationImportResult(
                bibliographic=bibliographic,
-                changed=has_changed,
+                changed=needs_apply,


For an ODL title where bibliographic data is stable (needs_apply = False) but called_circulation_apply = True (because CirculationData.needs_apply always returns True for ODL pools), changed is still False. FeedImportResult.found_unchanged_publication therefore returns True after the first page, and opds_import_task stops paginating — so ODL titles on pages 2+ never get their licenses refreshed.

Fix: changed=needs_apply or called_circulation_apply.

When as_of_timestamp equals edition.updated_at but content has changed, should_apply_to correctly returns True (hash differs) and the update runs — but the old < comparison in _update_edition_timestamp skipped storing the new hash. On every subsequent re-import the hash still mismatched, causing the update to re-apply indefinitely. Changing to <= ensures equal-timestamp imports with changed content still record the new hash, breaking the loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When bibliographic data is unchanged but an ODL circulation task is queued (needs_apply always returns True for ODL pools due to license expiry), PublicationImportResult.changed was set to needs_apply (False). This caused found_unchanged_publication to return True, halting pagination after the first page. Fix by setting changed=needs_apply or called_circulation_apply so that queuing a circulation task counts as work done for the publication. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous commit changed license_pool.created_at from self.created_at to self.as_of_timestamp. test_apply_wont_overwrite_if_its_data_is_stale was asserting pool.created_at == circulation.created_at, but for a CirculationData with updated_at set, as_of_timestamp resolves to updated_at (not created_at). Update all three assertions and their comments to use circulation.as_of_timestamp. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The method and its call site were made obsolete by the lastUpdateTime parameter already being passed to Overdrive's API, which performs server-side filtering so only relevant books are returned in the first place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-21T05:26:22Z

+            # The data we have is strictly older than what is stored, no update needed.
+            return False
+
+        return self.calculate_hash() != db_object.updated_at_data_hash


Minor: When as_of_timestamp > db_object.updated_at but the hash matches (content unchanged, timestamp advanced), should_apply_to returns False and db_object.updated_at is never advanced. So updated_at semantically means "timestamp of last content change", not "most recent timestamp seen". This is consistent with the Edition path (_update_edition_timestamp is never called when we skip) and the non-ODL pool path (early return, no fields written), but differs from the ODL pool path (unconditionally advances pool.updated_at). The asymmetry is harmless for correctness — the hash is the authority — but may be surprising if updated_at is ever queried as an activity metric.

claude · 2026-04-21T05:26:29Z

+_CANONICALIZE_TYPE_ORDER: dict[type, int] = {
+    bool: 0,
+    int: 1,
+    float: 1,


Nit: int and float share precedence 1. When a sequence contains 1 (int) and 1.0 (float), 1 == 1.0 in Python so the stable sort preserves input order, and json.dumps([1, 1.0]) vs json.dumps([1.0, 1]) differ as strings. In practice this is benign — model_dump(mode="json") produces list elements of uniform types — but it means json_canonical is not fully independent of the int/float distinction within the same list position.

dbernstein

I think this one is good to go. @jonathangreen I thoroughly tested our combined changes. It's working as expected. Claude made a bunch of suggestions which I thought were useful so those went in. I also was able to remove some effectively dead and confusing code (ie the removal of the early exit routine - last commit). Since we both worked on it, you may want to take a look at it before I merge it?

jonathangreen · 2026-04-22T13:08:16Z

Sounds good @dbernstein. I'll take a run through it today before you merge.

jonathangreen

Looks good @dbernstein. I'm okay with this getting merged. I am concerned about the queue management we will have to do when this goes in though, as a full re-import of everything will probably fill all of our queues.

jonathangreen · 2026-04-22T13:11:38Z

+    _primary_identifier: Identifier | None = PrivateAttr(default=None)
+    # Lazily computed and cached for the lifetime of this object. Cleared by
+    # __setattr__ whenever a public field is mutated so the hash stays consistent.
+    _hash_cache: str | None = PrivateAttr(default=None)


This cache is a nice addition.

jonathangreen · 2026-04-23T14:46:10Z

+            # Skip circulation data update if the content hasn't changed, UNLESS we
+            # have individual license objects that may have expired since the last
+            # import (ODL-style pools). License expiry is time-dependent and cannot
+            # be detected by content hashing alone.


I'd really like to see the expiration be handled in its own process instead of tangled up in change tracking. I think this carve out makes it kind of hard to reason able ODL-style pools, and means that they cannot participate in change tracking which I think would be beneficial for them.

I'm okay with this going in, but I'd like to make a follow up to have an async task to process license expiry, so its not tied together here.

That seems reasonable. Here's the ticket: https://ebce-lyrasis.atlassian.net/browse/PP-4177

jonathangreen added the feature New feature label Apr 2, 2026

dbernstein changed the title ~~WIP: Content-hash-based change tracking for data imports~~ Content-hash-based change tracking for data imports Apr 6, 2026

dbernstein force-pushed the feature/change-tracking branch 6 times, most recently from f58ea44 to 8135dd0 Compare April 13, 2026 16:50

dbernstein force-pushed the feature/change-tracking branch 2 times, most recently from f5a7e26 to d17bf96 Compare April 18, 2026 01:39

claude Bot reviewed Apr 18, 2026

View reviewed changes

claude Bot reviewed Apr 20, 2026

View reviewed changes

dbernstein and others added 10 commits April 20, 2026 12:38

Revert "Fix BibliographicData.has_changed to throttle updates when da…

a8b2abf

…ta sourc… (#3198)" This reverts commit 7dd9ec5.

First work on tracking changes in a hash

364df55

Redo the revision: CLAUDE did not generate it using the alembic revis…

d71292f

…ion tool.

Update CLAUDE.md to include directive about generating revisions usin…

7edc88e

…g alembic.

Address multiple heads in alembic revision.

d520027

Address broken tests.

26a328c

Fix imports.

d406207

dbernstein and others added 6 commits April 20, 2026 12:38

Fix imports.

551537e

Increase page size to 100. Since hash-based change detection will gen…

9b2ce80

…erate vastly few apply queue tasks, we can afford to pull more books per page.

dbernstein force-pushed the feature/change-tracking branch from 784c6e7 to 361c999 Compare April 20, 2026 19:39

claude Bot reviewed Apr 20, 2026

View reviewed changes

dbernstein and others added 3 commits April 20, 2026 12:51

dbernstein marked this pull request as ready for review April 20, 2026 20:24

dbernstein requested a review from a team April 20, 2026 21:23

dbernstein enabled auto-merge (squash) April 20, 2026 21:26

claude Bot reviewed Apr 21, 2026

View reviewed changes

dbernstein disabled auto-merge April 21, 2026 05:30

dbernstein approved these changes Apr 21, 2026

View reviewed changes

dbernstein requested a review from a team April 21, 2026 21:26

jonathangreen commented Apr 23, 2026

View reviewed changes

dbernstein merged commit 7fb2e0c into main Apr 23, 2026
21 checks passed

dbernstein deleted the feature/change-tracking branch April 23, 2026 16:57

Conversation

jonathangreen commented Apr 2, 2026 • edited by dbernstein Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Checklist

Uh oh!

codecov Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Content-hash-based change tracking for data imports

Summary

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

dbernstein Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

dbernstein left a comment

Choose a reason for hiding this comment

Uh oh!

jonathangreen commented Apr 22, 2026

Uh oh!

jonathangreen left a comment

Choose a reason for hiding this comment

Uh oh!

jonathangreen Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jonathangreen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

dbernstein Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jonathangreen commented Apr 2, 2026 •

edited by dbernstein

Loading

codecov Bot commented Apr 3, 2026 •

edited

Loading

claude Bot commented Apr 18, 2026 •

edited

Loading