Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-ascii chars in search #7378

Merged
merged 9 commits into from
Mar 11, 2021

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Nov 2, 2020

What this PR does / why we need it: adds the ASCIIFoldingFilterFactory which should allow for matches when accented characters are used - in the text (the index side) or when a search term is added (the filter itself means an unaccented char in a search query works while the preserveOriginal == true means that search for the accented char also works).

Which issue(s) this PR closes:

Closes #820

Special notes for your reviewer: This doesn't handle all the cases in the title of #820, but there's a comment suggesting that things like & and negative numbers are/were addressed in other issues. So - guessing this closes #820 but not sure.
Also note the concern in the issue that this approach can cause false positives: e.g. "cañon" will be found searching for "canon" which isn't right if you intended to search only for the English word canon.

Suggestions on how to test this:
There are 2 new search cases that should work with this change:
an accented char in some text field should be matched by an ascii search .e.g. "cañon" will be found searching for "canon"
an accented search will match an ascii field.,e.g. "canon" will be found searching for "cañon"

Should work for words with all such chars, eg. the list in the issue: (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…).

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a config file so it's fine to go to QA but a release note should be added to explain what to do, how to put the file into place, if the Solr index needs to be cleared or not, etc.

@qqmyers
Copy link
Member Author

qqmyers commented Nov 3, 2020

To start by answering the questions: this should require both the standard 'copy schema.xml' to the solr config dir (same as with any release dealing with solr). The other solr files don't need to be updated. To get the fix, one then has to run a solr re-index - might recommend the incremental one to avoid down-time.

@coveralls
Copy link

coveralls commented Nov 3, 2020

Coverage Status

Coverage remained the same at 19.481% when pulling 5efe3f0 on QualitativeDataRepository:IQSS/820 into d167418 on IQSS:develop.

@djbrooke djbrooke moved this from Review 🦁 to Community Dev 💻❤️ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 4, 2020
@mheppler
Copy link
Contributor

mheppler commented Mar 5, 2021

@qqmyers @pdurbin @scolapasta ...

Do the changes to Solr in 7373 solr upgrade #7645 make the debate about the release notes and schema and re-index here moot?

@qqmyers
Copy link
Member Author

qqmyers commented Mar 5, 2021

Good call - since copying the schema.xml from the Dataverse sw repo and reindexing are part of a fresh install, the instructions for that PR will cover what's needed here. (We could warn not to copy the local ones from solr 7.x (since there's a change in schema.xml due to this PR), but there are lots of things one shouldn't do...)

conf/solr/7.7.2/schema.xml Outdated Show resolved Hide resolved

## Upgrade Instructions

1. You will need to replace or modify your `schema.xml` and restart solr. Re-indexing is required to get full-functionality from this change - the standard instructions for an incremental reindex could be added here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When PR #7645 gets merged (in QA), this PR will need to be updated, as these release notes should be merged with the release notes included in that Solr update PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the release notes from the Solr upgrade PR.

Solr Update

With this release we upgrade to the latest available stable release in the Solr 8.x branch. We recommend a fresh installation of Solr 8.8.1 (the index will be empty)
followed by an "index all".

Before you start the "index all", Dataverse will appear to be empty because
the search results come from Solr. As indexing progresses, partial results will
appear until indexing is complete.

See http://guides.dataverse.org/installation/prerequisites.html#installing-solr

[for the additional upgrade steps section]

Run the script updateSchemaMDB.sh to generate updated solr schema files and preserve any other custom fields in your Solr configuration.
For example: (modify the path names as needed)
cd /usr/local/solr-8.8.1/server/solr/collection1/conf
wget https://github.com/IQSS/dataverse/releases/download/v5.4/updateSchemaMDB.sh
chmod +x updateSchemaMDB.sh
./updateSchemaMDB.sh -t .

See http://guides.dataverse.org/en/5.4/admin/metadatacustomization.html?highlight=updateschemamdb for more information.

Copy link
Contributor

@mheppler mheppler Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djbrooke in tech hours, @scolapasta suggested moving this PR forward with the note to REMOVE this "Upload Instructions" section, as a duplicate of the steps already included in the Solr upgrade PR #7645.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mheppler great, thanks, I'll leave a note in this PR at the top of the notes so I don't forget, in case something slips and we don't get this into the release we expect to.

@poikilotherm
Copy link
Contributor

Please be aware of #7662 for this (maybe we can address this in the same go here?)

@kcondon kcondon merged commit bae37ca into IQSS:develop Mar 11, 2021
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Mar 11, 2021
@djbrooke djbrooke added this to the 5.4 milestone Mar 11, 2021
@kcondon kcondon self-assigned this Mar 12, 2021
@BPeuch BPeuch added this to Solved (thank you!) in Dataverse SODHA (Belgium) Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Dataverse SODHA (Belgium)
Solved (thank you!)
Development

Successfully merging this pull request may close these issues.

Search: Accented characters, ampersands, negative numbers and other special characters
8 participants