Search: Accented characters, ampersands, negative numbers and other special characters #820

kcondon · 2014-08-12T16:28:41Z

This was suggested by Eleni during discussions around search issues with accented characters.

The suggestion is to allow searching both using the original accented characters and without, in those cases where a user may not have access to accented keyboard, etc.

pdurbin · 2014-08-14T13:45:42Z

See also #818 (comment) and the search internationalization ticket at #326.

pdurbin · 2015-11-12T16:28:50Z

@mheppler pointed out an interesting answer at http://stackoverflow.com/questions/16627062/not-able-to-search-spanish-word-with-accent-in-solr/20657529#20657529

Here it is in full (note downsides though):

You can try using the ASCIIFoldingFilterFactory filter.

It converts characters with ascent into their no-ascent counterpart.
Put this in your schema.xml:

<filter class="solr.ASCIIFoldingFilterFactory"/>

Note: The downside is that words like "cañon" and "canon" are now equivalent and both hit the same documents IIRC.

mheppler · 2020-02-03T19:46:39Z

Dusted off this oldie but goodie to be a representative issue for many other search bugs/feature requests. I have closed the following issues, to consolidate them here, and moved over any pertinent information in their comments.

Search: Support searching on accented words by typing unaccented characters instead (original title of this issue)
Search: Searching on a negative number returns a lot of false results. Search: Searching on a negative number returns a lot of false results. #819 (-160 in Geographic Bounding Box for West Longitude and -65 for North Latitude, returns unexpected results, but 143 for East Longitude target dataset found)
Search: characters in (dataset names) return zero results Search: characters in (dataset names) return zero results #2156 (search População Estrangeira na Europa e no Mundo Nôvo (%) - 1870 - 2001 is broken in Harvard Dataverse, remove "%" and it returns results)
Search: Text after and including Ampersand (&) is removed from Search Query Search: Text after and including Ampersand (&) is removed from Search Query #2702 (search Economics & Politics in Harvard Dataverse, see Stack Overflow answer for potential fix)
Parameters issue in display on production Parameters issue in display on production #3993 (example on Harvard Dataverse)

pdurbin · 2020-02-04T16:32:38Z

@mheppler I'm still haunted by the bug report at #1928 (comment) that a search for "Experiment" found datasets with "Experience" but I haven't tested lately. That was nearly five years ago. 😄

BPeuch · 2020-10-12T14:31:13Z

We here in Belgium second this, as our three official languages – Dutch, French and German – all three contains special characters (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…).

Here is an illustration of how accented characters can hinder the search for / the discovery of datasets:

———————————————————————————————————————————————

pdurbin · 2020-10-13T15:06:28Z

@BPeuch in a report from a French installation, switching from text_en to text_fr seems to have helped: https://groups.google.com/g/dataverse-community/c/9sjpBpPRuFk/m/uxH2KKJnAQAJ

Since you have three official languages, however, I'm not sure if this will work for you.

BPeuch · 2020-10-13T15:11:55Z

That's valuable information still. Thank you, @pdurbin!

I fear that indeed it might not work ~~between~~ because some characters are specific to some of these languages (e.g. á and ó for Dutch). It could be worth a try though.

qqmyers · 2020-10-13T20:09:16Z

FWIW: For QDR, we addressed some of this with changes in the solr schema.xml leveraging some filters:
solr.WordDelimiterGraphFilterFactory
solr.ASCIIFoldingFilterFactory
solr.PatternReplaceFilterFactory

I can dig up that code if it's helpful - I don't know much about solr so I think what we did was mostly to cut/paste from sources I found on the web though, so you might be better off searching for the latest on this issue - perhaps including some of those filter names. (Our interest was primarily in handling characters from other languages and contractions, so it may not be as general as others might want.)

One thing I think is helpful to convey though - I think this can be solved/significantly improved just making solr changes ,versus requiring Dataverse code changes. So looking for answers related to solr or solr expertise at our institutions might be a good approach.

qqmyers · 2020-10-29T20:57:02Z

FWIW: QDR's solution can be seen in https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml - see the qqmyers changes starting from https://github.com/QualitativeDataRepository/dataverse/blame/develop/conf/solr/7.7.2/schema.xml#L573.
The main thing for non-ASCII characters was to enable the solr.ASCIIFoldingFilterFactory filter during indexing and querying. We also used the solr.PatternReplaceFilterFactory filter to try to recognize contractions (e.g. qu'est que) - you can see from the pattern there that this is fairly limited. The use of the solr.ASCIIFoldingFilterFactory is probably generic and could be pulled into a separate PR if there's interest in having it as the default for the community.

(You'll also see some modifications to the solr.WordDelimiterGraphFilterFactory params - I think those are related to file names containing numbers - basically an unrelated but also potentially useful change).

poikilotherm · 2020-11-02T10:16:51Z

This might be linked to #6675 and #7375.

kcondon added this to the In Review - Dataverse 4.0 milestone Aug 12, 2014

kcondon added Type: Feature labels Aug 12, 2014

kcondon assigned pdurbin Aug 12, 2014

pdurbin modified the milestones: Beta 10 - Dataverse 4.0, In Review - Dataverse 4.0 Nov 4, 2014

eaquigley modified the milestones: Beta 10 - Dataverse 4.0, Beta 9 - Dataverse 4.0, 4.1 Nov 10, 2014

scolapasta modified the milestones: In Review - Long Term, In Review - Short Term May 8, 2015

pdurbin removed their assignment Jan 21, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

kcondon added Priority 2: Moderate labels May 16, 2016

pdurbin removed the zTriaged label Jun 30, 2017

pdurbin added User Role: Guest Anyone using the system, even without an account and removed zEffort 1: Small labels Jul 12, 2017

djbrooke added this to Inbox 🗄 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 8, 2019

mheppler added Type: Bug a defect and removed UX & UI: Design This issue needs input on the design of the UI and from the product owner Type: Feature a feature request User Role: Guest Anyone using the system, even without an account labels Feb 3, 2020

mheppler changed the title ~~Search: Support searching on accented words by typing unaccented characters instead.~~ Search: Accented characters, ampersands, negative numbers and other special characters Feb 3, 2020

mheppler added this to Needs Discussion Before Ready in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 3, 2020

BPeuch added this to Pretty please in Dataverse SODHA (Belgium) Oct 12, 2020

qqmyers mentioned this issue Oct 13, 2020

Check solr config re: handling foreign chars and contractions QualitativeDataRepository/dataverse#57

Closed

poikilotherm mentioned this issue Nov 2, 2020

Gibberish characters in the data citation box in the UI #7375

Open

qqmyers mentioned this issue Nov 2, 2020

Handle non-ascii chars in search #7378

Merged

djbrooke removed this from Needs Discussion/Definition 💬 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 4, 2020

kcondon closed this as completed in #7378 Mar 11, 2021

BPeuch moved this from Pretty please to Solved (thank you!) in Dataverse SODHA (Belgium) Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search: Accented characters, ampersands, negative numbers and other special characters #820

Search: Accented characters, ampersands, negative numbers and other special characters #820

kcondon commented Aug 12, 2014

pdurbin commented Aug 14, 2014

pdurbin commented Nov 12, 2015

mheppler commented Feb 3, 2020 •

edited

pdurbin commented Feb 4, 2020

BPeuch commented Oct 12, 2020

pdurbin commented Oct 13, 2020

BPeuch commented Oct 13, 2020 •

edited

qqmyers commented Oct 13, 2020

qqmyers commented Oct 29, 2020

poikilotherm commented Nov 2, 2020

Search: Accented characters, ampersands, negative numbers and other special characters #820

Search: Accented characters, ampersands, negative numbers and other special characters #820

Comments

kcondon commented Aug 12, 2014

pdurbin commented Aug 14, 2014

pdurbin commented Nov 12, 2015

mheppler commented Feb 3, 2020 • edited

pdurbin commented Feb 4, 2020

BPeuch commented Oct 12, 2020

pdurbin commented Oct 13, 2020

BPeuch commented Oct 13, 2020 • edited

qqmyers commented Oct 13, 2020

qqmyers commented Oct 29, 2020

poikilotherm commented Nov 2, 2020

mheppler commented Feb 3, 2020 •

edited

BPeuch commented Oct 13, 2020 •

edited