Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

tfrancart · 2016-01-04T14:27:56Z

Hello

Using this analyzer configuration :

       text:analyzer [
         a text:ConfigurableAnalyzer ;
         text:tokenizer text:LetterTokenizer ;
         text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
       ]

Searches for "education" (without diacritic) correctly find labels containing "éducation", but searches for "éducation" (with diacritic) don't find labels containing "éducation" !

The same analyzer configuration than the one used in the text index should be applied to the search string.

The text was updated successfully, but these errors were encountered:

osma · 2016-01-04T14:32:45Z

Whoops!

Can you check what the corresponding REST API call returns? I.e. what does this URL give you in JSON?
/rest/v1/search?query=éducation*

tfrancart · 2016-01-04T14:37:40Z

Looks like an empty result :

{"@context":{"skos":"http://www.w3.org/2004/02/skos/core#","onki":"http://schema.onki.fi/onki#","uri":"@id","type":"@type","results":{"@id":"onki:results","@container":"@list"},"prefLabel":"skos:prefLabel","altLabel":"skos:altLabel","hiddenLabel":"skos:hiddenLabel","broader":"skos:broader"},"uri":"","results":[]}

osma · 2016-01-04T14:38:53Z

OK thanks. Then it's not an issue with the UI / JavaScript code. Need to investigate further.

tfrancart · 2016-01-04T14:43:52Z

Jena-text allows to set the Analyzer for the query (https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer, section "Analyzer for query"). I tried to set it like this :

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:/tmp/lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    text:storeValues true ; ## required for Skosmos 1.4
    text:queryAnalyzer [
        a text:ConfigurableAnalyzer ;
        text:tokenizer text:KeywordTokenizer ;
        text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    ]    
    .

(note the use of KeywordTokenizer to avoid splitting query string).

But this does not seem to help, unfortunately...

tfrancart · 2016-01-04T15:09:20Z

The following works : add an extra index field containing the "unfolded" string :

         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ] 
         ]
         # skos:prefLabel
         [ text:field "pref-unfolded" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:LowerCaseFilter)
           ] 
         ]

I don't see what could be the other consequences on skosmos of such a configuration ? do you see any problem with this approach ?

tfrancart · 2016-01-04T15:20:03Z

Further tests show that the approach above does in fact not work correctly.

osma · 2016-01-04T15:24:22Z

Thanks for testing several options. This may be a problem in jena-text itself. The ability to specify different analyzers is a fairly recent feature, especially the configurable analyzer, and probably not extensively tested.

osma · 2016-01-08T07:56:58Z

I have tested this and it indeed looks like a bug in jena-text. The text query for "éducation" gives no results, while "education" matches both "education" and "éducation". Apparently the configured analyzer is not used for the query.

According to jena-text documentation:

There is an ability to specify an analyzer to be used for the query string itself. It will find terms in the query text. If not set, then the analyzer used for the document will be used.

This is somewhat unclear but I take "the analyzer used for the document" to mean the same analyzer that was used for indexing, no other interpretation would make sense here.

osma · 2016-01-08T11:23:02Z

This turns out to be a bug/feature (depending on how you look at it) in Lucene. Its standard QueryParser implementation doesn't process wildcard queries via the Analyzer no matter what you do, and this appears to be by design. This affects Skosmos immensely, because almost all of its text queries are wildcard queries (the only exception is if you do a non-wildcard query via the REST API).

See:
http://stackoverflow.com/questions/28650774/lucene-query-parser-to-use-filters-for-wildcard-queries
http://www.gossamer-threads.com/lists/lucene/java-user/14224

Solr and Elasticsearch (both based on Lucene) seem to have evolved some solutions in this area but they may not be directly relevant for plain Lucene:
https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
http://stackoverflow.com/questions/13804514/elasticsearch-wildcard-query-with-ascii-folding

A possible solution would be to use AnalyzingQueryParser in jena-text.

tfrancart · 2016-01-08T12:02:30Z

Thanks for the analysis !
Here is another approach I can think of :

Not using wildcards queries in SKOSMOS
Using EdgeNGramTokenizer in Lucene/jena-text (https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to compute all the Ngrams, so that autocompletion can still work ?

Downside : indexes may grow very large with Ngrams.

osma · 2016-01-08T12:07:54Z

Using EdgeNGramTokenizer is an interesting thought, but I don't think it would work for all the use cases we want to support. For example wildcards may appear in the middle of the search query. In general we already allow users to enter wildcards in the search terms, I don't think they would like it if we took that away.

Anyway I think it should be fairly simple to add a configuration flag such as useAnalyzingQueryParser to jena-text. There are some caveats in using that query parser with some analyzers (see javadoc linked above), so I don't think it can be made the default in jena-text.

osma · 2016-02-09T13:13:01Z

This isn't really tied to Skosmos development cycles as the work would need to be done in Jena, not Skosmos code.

Anyway I've created a Jena issue for this: https://issues.apache.org/jira/browse/JENA-1134

osma · 2016-02-18T12:24:47Z

Moving to the 1.6 milestone in order to not delay the 1.5 release further. This doesn't affect Skosmos code anyway.

osma · 2016-04-12T13:56:24Z

I've tested with a fresh Fuseki snapshot (from today) that includes the JENA-1134 feature. I used the new setting in my Fuseki configuration file:

<#skosmosDemoIndexLucene> a text:TextIndexLucene ;
    text:queryParser text:AnalyzingQueryParser ;
    [...]

I also set the analyzers to use ConfigurableAnalyzer with ASCIIFoldingFilter, as above.

Now I get the same results with both education and éducation:
http://skosmos.dev.finto.fi/unesco/en/search?clang=fr&q=education
http://skosmos.dev.finto.fi/unesco/en/search?clang=fr&q=%C3%A9ducation

osma · 2016-04-12T14:03:54Z

I updated the TextAnalysisConfiguration wiki page to reflect this.

tfrancart changed the title ~~Search using ASCIIFoldingFilter does not work when one search with a diacritic~~ Search using ASCIIFoldingFilter does not work with searches containing a diacritic Jan 4, 2016

osma added the bug label Jan 4, 2016

osma added this to the 1.5 milestone Jan 4, 2016

osma added the may slip label Feb 9, 2016

osma self-assigned this Feb 9, 2016

osma modified the milestones: 1.6, 1.5 Feb 18, 2016

osma removed the may slip label Feb 18, 2016

osma added the needs documentation label Apr 12, 2016

osma closed this as completed Apr 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

tfrancart commented Jan 4, 2016

tfrancart commented Jan 4, 2016

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

osma commented Jan 8, 2016

osma commented Jan 8, 2016

tfrancart commented Jan 8, 2016

osma commented Jan 8, 2016

osma commented Feb 9, 2016

osma commented Feb 18, 2016

osma commented Apr 12, 2016

osma commented Apr 12, 2016

Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

Comments

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

tfrancart commented Jan 4, 2016

tfrancart commented Jan 4, 2016

tfrancart commented Jan 4, 2016

osma commented Jan 4, 2016

osma commented Jan 8, 2016

osma commented Jan 8, 2016

tfrancart commented Jan 8, 2016

osma commented Jan 8, 2016

osma commented Feb 9, 2016

osma commented Feb 18, 2016

osma commented Apr 12, 2016

osma commented Apr 12, 2016