Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an option to make page indexing more efficient #565

Closed
rlskoeser opened this issue Dec 19, 2023 · 3 comments
Closed

add an option to make page indexing more efficient #565

rlskoeser opened this issue Dec 19, 2023 · 3 comments
Assignees
Labels

Comments

@rlskoeser
Copy link
Contributor

we often need to reindex all page content in ppa, but when we hit an error that causes the script to crash we have to reindex everything

would be helpful for developer sanity to have an opt-in / non-default way to index only the content that needs it - maybe we can determine by comparing page counts in the database and solr? or maybe we can use modification time to figure out which works need pages indexed? or some combination of the two?

@quadrismegistus
Copy link
Contributor

quadrismegistus commented Dec 19, 2023

Would this general pattern work?

  1. Get a map of {work_id} to its [{page_id}, {page_id}, ...] from django db. That gives us # pages per work too.
  2. Query for # of pages per work_id in solr.
  3. For those with mismatches, reindex. Or if --force or something set, reindex all.

If it's fast from solr to get a list of all page ids per work id, then you could do an even more careful comparison than # of pages by comparing sets of page ids to make sure they perfectly overlap.

@rlskoeser
Copy link
Contributor Author

We can easily get a count of total pages / total works; should be able to do a facet query in solr to get number of page items per source id (group id?). Could filter by modification time as well so we only count recently indexed pages and index works that don't have enough recent pages

@rlskoeser rlskoeser self-assigned this Dec 20, 2023
@mnaydan mnaydan added the chore label Jan 16, 2024
rlskoeser added a commit that referenced this issue Feb 1, 2024
…569)

* Add option to expedite page indexing based on page count mismatches

ref #565

* Adjust verbosity for work/page mismatch output; add minimal test
@rlskoeser
Copy link
Contributor Author

Was able to use this script in qa to do partial reindex after the full page index crashed part way through. We still need to figure out why there's a discrepancy in page counts between db and solr, but we're tracking that on #567

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants