-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add an option to make page indexing more efficient #565
Comments
Would this general pattern work?
If it's fast from solr to get a list of all page ids per work id, then you could do an even more careful comparison than # of pages by comparing sets of page ids to make sure they perfectly overlap. |
We can easily get a count of total pages / total works; should be able to do a facet query in solr to get number of page items per source id (group id?). Could filter by modification time as well so we only count recently indexed pages and index works that don't have enough recent pages |
Was able to use this script in qa to do partial reindex after the full page index crashed part way through. We still need to figure out why there's a discrepancy in page counts between db and solr, but we're tracking that on #567 |
we often need to reindex all page content in ppa, but when we hit an error that causes the script to crash we have to reindex everything
would be helpful for developer sanity to have an opt-in / non-default way to index only the content that needs it - maybe we can determine by comparing page counts in the database and solr? or maybe we can use modification time to figure out which works need pages indexed? or some combination of the two?
The text was updated successfully, but these errors were encountered: