Discrepancy between total works / pages as reported by database and solr #567

jerielizabeth · 2023-12-20T20:36:43Z

Notes pulled from our testing notes:

Checked totals via django console.

Database reports: 6751 public works, 1938565 pages (based on digitized work page count)
Solr reports: 6751 works, 1950698 pages

from ppa.archive.models import DigitizedWork, Page
DigitizedWork.objects.filter(status=DigitizedWork.PUBLIC).count()
6751
Page.total_to_index()
1938565
from parasolr.django import SolrQuerySet
SolrQuerySet().all().facet("item_type").get_facets().facet_fields.item_type
OrderedDict([('page', 1950698), ('work', 6751)])

Export is based on Solr, so we expect it to match the solr count
% wc ppa_pages.jsonl
1950698 597452820 4145091420 ppa_pages.jsonl

Discrepancy between database page count and solr page count is a concern but separate from testing the corpus export script, which is pulling page content from solr as expected

testing notes

check the admin log entries for works with updated page counts; are they present? do they provide the right information?
try editing an excerpt and changing the digital page range; confirm the page count is updated when you save

(currently based on notes in the code we don't expect page count to update if you remove a page range; we could revisit that in future if it's a possible scenario)

rlskoeser · 2024-02-02T21:41:39Z

As part of testing the new option for the page indexing script (#565 ) I have a list of db/solr page count mismatches from QA. Hopefully a helpful starting point for investigation. It looks like these are all Hathi ids, so there must be some discrepancy between how we count pages for the db vs how we actually get pages for indexing.

ppa-pagecount-mismatches.txt

rlskoeser · 2024-02-29T20:08:03Z

I did a little spot-checking based on this list, and what I found for every record I tried is a discrepancy between the page count value stored in the database and the number I get when I recalculate the page count based on the hathitrust data. I'm wondering if this is another place where we're not accounting for updates to the hathitrust data and so things are getting out of sync.

I'm going to create a quick utility script we can run to update the page counts in the database, but we should keep this mind along with the other related items (rsync, excerpt page ranges) - maybe we can consolidate the updates somehow.

rlskoeser · 2024-03-01T14:02:49Z

Currently the page count method saves the record if the count has changed - but only for non-excerpted works. This was surprising behavior to me when I was incorporating it into the script. I think we should refactor so it does not save, and adjust the calling code to save changes where needed.

It also seems that there may be some cases where saving a record clears out a page count that was previously set - see #591 (comment) and #596

rlskoeser · 2024-03-05T20:39:41Z

Ran the new script in staging; it output the following summary:

Volumes with updated page count: 1,347
	Page count unchanged: 3,408
	Missing pairtree data: 0

When I run it a second time, it reports that it didn't have to make any updates.

@mnaydan I set the script up to create log entries documenting the page count change, you can see how they look in the log entry section of the admin site https://test-prosody.cdh.princeton.edu/admin/admin/logentry/

rlskoeser · 2024-03-05T20:47:18Z

Now that page counts have been updated, I ran the page index script in the mode where it just updates records where page count doesn't match between solr and the database. The first time it reported 1 work and 159 pages not indexed in solr; the second time it reported 1 work and 52 pages not indexed.

When I run it in verbose mode, these are the two that still have discrepancies:

coo1.ark:/13960/t3st84m4q (201-254) : missing 54 (db: 54, solr: 0)
Indexing pages for 2 works with page count mismatches
ERROR:ppa.archive.models:Pairtree data for coo1.ark:/13960/t3st84m4q not found but status is Public

I'm guessing that one with missing data is one of the excerpt cases we know about already...

mnaydan · 2024-03-06T16:45:25Z

@rlskoeser I think coo1.ark:/13960/t3st84m4q is actually a newly identified case, but the same problem we caught on #591 -- me trying to add two excerpts from the same source, and then deleting the second one once I realized it didn't index, not realizing it would delete the pairtree data for the other excerpt as well.

mnaydan · 2024-03-06T18:57:39Z

I tested adjusting the excerpt range for existing excerpt coo.31924051399685 and it successfully changed the page count from 30 to 31 upon save. I also tested changing newly added hvd.32044106208028 from a full work to an excerpt and it successfully recalculated the page count upon save. Log entries appear in the link posted above as well as on individual history pages and provide the information I would expect.

While I could imagine a scenario where an excerpt would be converted into a full work, that hasn't happened yet, and I think it would be easier to just suppress+delete and re-add if that was what was needed, rather than building functionality to support it.

jerielizabeth added the bug label Dec 20, 2023

jerielizabeth assigned rlskoeser Dec 20, 2023

rlskoeser mentioned this issue Feb 2, 2024

add an option to make page indexing more efficient #565

Closed

rlskoeser changed the title ~~Discrepancy between total works / pages as reported by database and solr:~~ Discrepancy between total works / pages as reported by database and solr Feb 29, 2024

This was referenced Feb 29, 2024

Manage command to update hathitrust page counts #594

Merged

Newly added excerpts not indexing when there are multiple excerpts from a single source #591

Closed

rlskoeser added the awaiting testing label Mar 5, 2024

mnaydan closed this as completed Mar 6, 2024

mnaydan removed the awaiting testing label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between total works / pages as reported by database and solr #567

Discrepancy between total works / pages as reported by database and solr #567

jerielizabeth commented Dec 20, 2023 •

edited by mnaydan

rlskoeser commented Feb 2, 2024 •

edited

rlskoeser commented Feb 29, 2024

rlskoeser commented Mar 1, 2024

rlskoeser commented Mar 5, 2024

rlskoeser commented Mar 5, 2024

mnaydan commented Mar 6, 2024

mnaydan commented Mar 6, 2024

Discrepancy between total works / pages as reported by database and solr #567

Discrepancy between total works / pages as reported by database and solr #567

Comments

jerielizabeth commented Dec 20, 2023 • edited by mnaydan

testing notes

rlskoeser commented Feb 2, 2024 • edited

rlskoeser commented Feb 29, 2024

rlskoeser commented Mar 1, 2024

rlskoeser commented Mar 5, 2024

rlskoeser commented Mar 5, 2024

mnaydan commented Mar 6, 2024

mnaydan commented Mar 6, 2024

jerielizabeth commented Dec 20, 2023 •

edited by mnaydan

rlskoeser commented Feb 2, 2024 •

edited