-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy between total works / pages as reported by database and solr #567
Comments
As part of testing the new option for the page indexing script (#565 ) I have a list of db/solr page count mismatches from QA. Hopefully a helpful starting point for investigation. It looks like these are all Hathi ids, so there must be some discrepancy between how we count pages for the db vs how we actually get pages for indexing. |
I did a little spot-checking based on this list, and what I found for every record I tried is a discrepancy between the page count value stored in the database and the number I get when I recalculate the page count based on the hathitrust data. I'm wondering if this is another place where we're not accounting for updates to the hathitrust data and so things are getting out of sync. I'm going to create a quick utility script we can run to update the page counts in the database, but we should keep this mind along with the other related items (rsync, excerpt page ranges) - maybe we can consolidate the updates somehow. |
Currently the page count method saves the record if the count has changed - but only for non-excerpted works. This was surprising behavior to me when I was incorporating it into the script. I think we should refactor so it does not save, and adjust the calling code to save changes where needed. It also seems that there may be some cases where saving a record clears out a page count that was previously set - see #591 (comment) and #596 |
Ran the new script in staging; it output the following summary:
When I run it a second time, it reports that it didn't have to make any updates. @mnaydan I set the script up to create log entries documenting the page count change, you can see how they look in the log entry section of the admin site https://test-prosody.cdh.princeton.edu/admin/admin/logentry/ |
Now that page counts have been updated, I ran the page index script in the mode where it just updates records where page count doesn't match between solr and the database. The first time it reported 1 work and 159 pages not indexed in solr; the second time it reported 1 work and 52 pages not indexed. When I run it in verbose mode, these are the two that still have discrepancies:
I'm guessing that one with missing data is one of the excerpt cases we know about already... |
@rlskoeser I think |
I tested adjusting the excerpt range for existing excerpt coo.31924051399685 and it successfully changed the page count from 30 to 31 upon save. I also tested changing newly added hvd.32044106208028 from a full work to an excerpt and it successfully recalculated the page count upon save. Log entries appear in the link posted above as well as on individual history pages and provide the information I would expect. While I could imagine a scenario where an excerpt would be converted into a full work, that hasn't happened yet, and I think it would be easier to just suppress+delete and re-add if that was what was needed, rather than building functionality to support it. |
Notes pulled from our testing notes:
Checked totals via django console.
Database reports: 6751 public works, 1938565 pages (based on digitized work page count)
Solr reports: 6751 works, 1950698 pages
Export is based on Solr, so we expect it to match the solr count
% wc ppa_pages.jsonl
1950698 597452820 4145091420 ppa_pages.jsonl
Discrepancy between database page count and solr page count is a concern but separate from testing the corpus export script, which is pulling page content from solr as expected
testing notes
(currently based on notes in the code we don't expect page count to update if you remove a page range; we could revisit that in future if it's a possible scenario)
The text was updated successfully, but these errors were encountered: