Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between total works / pages as reported by database and solr #567

Closed
2 tasks done
jerielizabeth opened this issue Dec 20, 2023 · 7 comments
Closed
2 tasks done
Assignees
Labels

Comments

@jerielizabeth
Copy link
Contributor

jerielizabeth commented Dec 20, 2023

Notes pulled from our testing notes:

Checked totals via django console.

Database reports: 6751 public works, 1938565 pages (based on digitized work page count)
Solr reports: 6751 works, 1950698 pages

from ppa.archive.models import DigitizedWork, Page
DigitizedWork.objects.filter(status=DigitizedWork.PUBLIC).count()
6751
Page.total_to_index()
1938565
from parasolr.django import SolrQuerySet
SolrQuerySet().all().facet("item_type").get_facets().facet_fields.item_type
OrderedDict([('page', 1950698), ('work', 6751)])

Export is based on Solr, so we expect it to match the solr count
% wc ppa_pages.jsonl
1950698 597452820 4145091420 ppa_pages.jsonl

Discrepancy between database page count and solr page count is a concern but separate from testing the corpus export script, which is pulling page content from solr as expected

testing notes

  • check the admin log entries for works with updated page counts; are they present? do they provide the right information?
  • try editing an excerpt and changing the digital page range; confirm the page count is updated when you save

(currently based on notes in the code we don't expect page count to update if you remove a page range; we could revisit that in future if it's a possible scenario)

@rlskoeser
Copy link
Contributor

rlskoeser commented Feb 2, 2024

As part of testing the new option for the page indexing script (#565 ) I have a list of db/solr page count mismatches from QA. Hopefully a helpful starting point for investigation. It looks like these are all Hathi ids, so there must be some discrepancy between how we count pages for the db vs how we actually get pages for indexing.

ppa-pagecount-mismatches.txt

@rlskoeser rlskoeser changed the title Discrepancy between total works / pages as reported by database and solr: Discrepancy between total works / pages as reported by database and solr Feb 29, 2024
@rlskoeser
Copy link
Contributor

I did a little spot-checking based on this list, and what I found for every record I tried is a discrepancy between the page count value stored in the database and the number I get when I recalculate the page count based on the hathitrust data. I'm wondering if this is another place where we're not accounting for updates to the hathitrust data and so things are getting out of sync.

I'm going to create a quick utility script we can run to update the page counts in the database, but we should keep this mind along with the other related items (rsync, excerpt page ranges) - maybe we can consolidate the updates somehow.

@rlskoeser
Copy link
Contributor

Currently the page count method saves the record if the count has changed - but only for non-excerpted works. This was surprising behavior to me when I was incorporating it into the script. I think we should refactor so it does not save, and adjust the calling code to save changes where needed.

It also seems that there may be some cases where saving a record clears out a page count that was previously set - see #591 (comment) and #596

@rlskoeser
Copy link
Contributor

Ran the new script in staging; it output the following summary:

Volumes with updated page count: 1,347
	Page count unchanged: 3,408
	Missing pairtree data: 0

When I run it a second time, it reports that it didn't have to make any updates.

@mnaydan I set the script up to create log entries documenting the page count change, you can see how they look in the log entry section of the admin site https://test-prosody.cdh.princeton.edu/admin/admin/logentry/

@rlskoeser
Copy link
Contributor

Now that page counts have been updated, I ran the page index script in the mode where it just updates records where page count doesn't match between solr and the database. The first time it reported 1 work and 159 pages not indexed in solr; the second time it reported 1 work and 52 pages not indexed.

When I run it in verbose mode, these are the two that still have discrepancies:

coo1.ark:/13960/t3st84m4q (201-254) : missing 54 (db: 54, solr: 0)
Indexing pages for 2 works with page count mismatches
ERROR:ppa.archive.models:Pairtree data for coo1.ark:/13960/t3st84m4q not found but status is Public

I'm guessing that one with missing data is one of the excerpt cases we know about already...

@mnaydan
Copy link
Contributor

mnaydan commented Mar 6, 2024

@rlskoeser I think coo1.ark:/13960/t3st84m4q is actually a newly identified case, but the same problem we caught on #591 -- me trying to add two excerpts from the same source, and then deleting the second one once I realized it didn't index, not realizing it would delete the pairtree data for the other excerpt as well.

@mnaydan
Copy link
Contributor

mnaydan commented Mar 6, 2024

I tested adjusting the excerpt range for existing excerpt coo.31924051399685 and it successfully changed the page count from 30 to 31 upon save. I also tested changing newly added hvd.32044106208028 from a full work to an excerpt and it successfully recalculated the page count upon save. Log entries appear in the link posted above as well as on individual history pages and provide the information I would expect.

While I could imagine a scenario where an excerpt would be converted into a full work, that hasn't happened yet, and I think it would be easier to just suppress+delete and re-add if that was what was needed, rather than building functionality to support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants