-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle non-ascii chars in search #7378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a config file so it's fine to go to QA but a release note should be added to explain what to do, how to put the file into place, if the Solr index needs to be cleared or not, etc.
To start by answering the questions: this should require both the standard 'copy schema.xml' to the solr config dir (same as with any release dealing with solr). The other solr files don't need to be updated. To get the fix, one then has to run a solr re-index - might recommend the incremental one to avoid down-time. |
@qqmyers @pdurbin @scolapasta ... Do the changes to Solr in 7373 solr upgrade #7645 make the debate about the release notes and schema and re-index here moot? |
Good call - since copying the schema.xml from the Dataverse sw repo and reindexing are part of a fresh install, the instructions for that PR will cover what's needed here. (We could warn not to copy the local ones from solr 7.x (since there's a change in schema.xml due to this PR), but there are lots of things one shouldn't do...) |
|
||
## Upgrade Instructions | ||
|
||
1. You will need to replace or modify your `schema.xml` and restart solr. Re-indexing is required to get full-functionality from this change - the standard instructions for an incremental reindex could be added here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When PR #7645 gets merged (in QA), this PR will need to be updated, as these release notes should be merged with the release notes included in that Solr update PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the release notes from the Solr upgrade PR.
Solr Update
With this release we upgrade to the latest available stable release in the Solr 8.x branch. We recommend a fresh installation of Solr 8.8.1 (the index will be empty)
followed by an "index all".Before you start the "index all", Dataverse will appear to be empty because
the search results come from Solr. As indexing progresses, partial results will
appear until indexing is complete.See http://guides.dataverse.org/installation/prerequisites.html#installing-solr
[for the additional upgrade steps section]
Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration.
For example: (modify the path names as needed)
cd /usr/local/solr-8.8.1/server/solr/collection1/conf
wget https://github.com/IQSS/dataverse/releases/download/v5.4/updateSchemaMDB.sh
chmod +x updateSchemaMDB.sh
./updateSchemaMDB.sh -t .See http://guides.dataverse.org/en/5.4/admin/metadatacustomization.html?highlight=updateschemamdb for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djbrooke in tech hours, @scolapasta suggested moving this PR forward with the note to REMOVE this "Upload Instructions" section, as a duplicate of the steps already included in the Solr upgrade PR #7645.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mheppler great, thanks, I'll leave a note in this PR at the top of the notes so I don't forget, in case something slips and we don't get this into the release we expect to.
Please be aware of #7662 for this (maybe we can address this in the same go here?) |
What this PR does / why we need it: adds the ASCIIFoldingFilterFactory which should allow for matches when accented characters are used - in the text (the index side) or when a search term is added (the filter itself means an unaccented char in a search query works while the preserveOriginal == true means that search for the accented char also works).
Which issue(s) this PR closes:
Closes #820
Special notes for your reviewer: This doesn't handle all the cases in the title of #820, but there's a comment suggesting that things like & and negative numbers are/were addressed in other issues. So - guessing this closes #820 but not sure.
Also note the concern in the issue that this approach can cause false positives: e.g. "cañon" will be found searching for "canon" which isn't right if you intended to search only for the English word canon.
Suggestions on how to test this:
There are 2 new search cases that should work with this change:
an accented char in some text field should be matched by an ascii search .e.g. "cañon" will be found searching for "canon"
an accented search will match an ascii field.,e.g. "canon" will be found searching for "cañon"
Should work for words with all such chars, eg. the list in the issue: (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…).
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: