Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424

Closed
tfrancart opened this issue Jan 4, 2016 · 15 comments
Assignees
Milestone

Comments

@tfrancart
Copy link
Contributor

Hello

Using this analyzer configuration :

       text:analyzer [
         a text:ConfigurableAnalyzer ;
         text:tokenizer text:LetterTokenizer ;
         text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
       ] 

Searches for "education" (without diacritic) correctly find labels containing "éducation", but searches for "éducation" (with diacritic) don't find labels containing "éducation" !

The same analyzer configuration than the one used in the text index should be applied to the search string.

@tfrancart tfrancart changed the title Search using ASCIIFoldingFilter does not work when one search with a diacritic Search using ASCIIFoldingFilter does not work with searches containing a diacritic Jan 4, 2016
@osma osma added the bug label Jan 4, 2016
@osma
Copy link
Member

osma commented Jan 4, 2016

Whoops!

Can you check what the corresponding REST API call returns? I.e. what does this URL give you in JSON?
/rest/v1/search?query=éducation*

@tfrancart
Copy link
Contributor Author

Looks like an empty result :

{"@context":{"skos":"http://www.w3.org/2004/02/skos/core#","onki":"http://schema.onki.fi/onki#","uri":"@id","type":"@type","results":{"@id":"onki:results","@container":"@list"},"prefLabel":"skos:prefLabel","altLabel":"skos:altLabel","hiddenLabel":"skos:hiddenLabel","broader":"skos:broader"},"uri":"","results":[]}

@osma
Copy link
Member

osma commented Jan 4, 2016

OK thanks. Then it's not an issue with the UI / JavaScript code. Need to investigate further.

@osma osma added this to the 1.5 milestone Jan 4, 2016
@tfrancart
Copy link
Contributor Author

Jena-text allows to set the Analyzer for the query (https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer, section "Analyzer for query"). I tried to set it like this :

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:/tmp/lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    text:storeValues true ; ## required for Skosmos 1.4
    text:queryAnalyzer [
        a text:ConfigurableAnalyzer ;
        text:tokenizer text:KeywordTokenizer ;
        text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    ]    
    .

(note the use of KeywordTokenizer to avoid splitting query string).

But this does not seem to help, unfortunately...

@tfrancart
Copy link
Contributor Author

The following works : add an extra index field containing the "unfolded" string :

         # skos:prefLabel
         [ text:field "pref" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
           ] 
         ]
         # skos:prefLabel
         [ text:field "pref-unfolded" ;
           text:predicate skos:prefLabel ;
           text:analyzer [
             a text:ConfigurableAnalyzer ;
             text:tokenizer text:LetterTokenizer ;
             text:filters (text:LowerCaseFilter)
           ] 
         ]

I don't see what could be the other consequences on skosmos of such a configuration ? do you see any problem with this approach ?

@tfrancart
Copy link
Contributor Author

Further tests show that the approach above does in fact not work correctly.

@osma
Copy link
Member

osma commented Jan 4, 2016

Thanks for testing several options. This may be a problem in jena-text itself. The ability to specify different analyzers is a fairly recent feature, especially the configurable analyzer, and probably not extensively tested.

@osma
Copy link
Member

osma commented Jan 8, 2016

I have tested this and it indeed looks like a bug in jena-text. The text query for "éducation" gives no results, while "education" matches both "education" and "éducation". Apparently the configured analyzer is not used for the query.

According to jena-text documentation:

There is an ability to specify an analyzer to be used for the query string itself. It will find terms in the query text. If not set, then the analyzer used for the document will be used.

This is somewhat unclear but I take "the analyzer used for the document" to mean the same analyzer that was used for indexing, no other interpretation would make sense here.

@osma
Copy link
Member

osma commented Jan 8, 2016

This turns out to be a bug/feature (depending on how you look at it) in Lucene. Its standard QueryParser implementation doesn't process wildcard queries via the Analyzer no matter what you do, and this appears to be by design. This affects Skosmos immensely, because almost all of its text queries are wildcard queries (the only exception is if you do a non-wildcard query via the REST API).

See:
http://stackoverflow.com/questions/28650774/lucene-query-parser-to-use-filters-for-wildcard-queries
http://www.gossamer-threads.com/lists/lucene/java-user/14224

Solr and Elasticsearch (both based on Lucene) seem to have evolved some solutions in this area but they may not be directly relevant for plain Lucene:
https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
http://stackoverflow.com/questions/13804514/elasticsearch-wildcard-query-with-ascii-folding

A possible solution would be to use AnalyzingQueryParser in jena-text.

@tfrancart
Copy link
Contributor Author

Thanks for the analysis !
Here is another approach I can think of :

  1. Not using wildcards queries in SKOSMOS
  2. Using EdgeNGramTokenizer in Lucene/jena-text (https://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html) to compute all the Ngrams, so that autocompletion can still work ?

Downside : indexes may grow very large with Ngrams.

@osma
Copy link
Member

osma commented Jan 8, 2016

Using EdgeNGramTokenizer is an interesting thought, but I don't think it would work for all the use cases we want to support. For example wildcards may appear in the middle of the search query. In general we already allow users to enter wildcards in the search terms, I don't think they would like it if we took that away.

Anyway I think it should be fairly simple to add a configuration flag such as useAnalyzingQueryParser to jena-text. There are some caveats in using that query parser with some analyzers (see javadoc linked above), so I don't think it can be made the default in jena-text.

@osma
Copy link
Member

osma commented Feb 9, 2016

This isn't really tied to Skosmos development cycles as the work would need to be done in Jena, not Skosmos code.

Anyway I've created a Jena issue for this: https://issues.apache.org/jira/browse/JENA-1134

@osma osma added the may slip label Feb 9, 2016
@osma osma self-assigned this Feb 9, 2016
@osma osma modified the milestones: 1.6, 1.5 Feb 18, 2016
@osma osma removed the may slip label Feb 18, 2016
@osma
Copy link
Member

osma commented Feb 18, 2016

Moving to the 1.6 milestone in order to not delay the 1.5 release further. This doesn't affect Skosmos code anyway.

@osma
Copy link
Member

osma commented Apr 12, 2016

I've tested with a fresh Fuseki snapshot (from today) that includes the JENA-1134 feature. I used the new setting in my Fuseki configuration file:

<#skosmosDemoIndexLucene> a text:TextIndexLucene ;
    text:queryParser text:AnalyzingQueryParser ;
    [...]

I also set the analyzers to use ConfigurableAnalyzer with ASCIIFoldingFilter, as above.

Now I get the same results with both education and éducation:
http://skosmos.dev.finto.fi/unesco/en/search?clang=fr&q=education
http://skosmos.dev.finto.fi/unesco/en/search?clang=fr&q=%C3%A9ducation

@osma
Copy link
Member

osma commented Apr 12, 2016

I updated the TextAnalysisConfiguration wiki page to reflect this.

@osma osma closed this as completed Apr 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants