New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search using ASCIIFoldingFilter does not work with searches containing a diacritic #424
Comments
Whoops! Can you check what the corresponding REST API call returns? I.e. what does this URL give you in JSON? |
Looks like an empty result : {"@context":{"skos":"http://www.w3.org/2004/02/skos/core#","onki":"http://schema.onki.fi/onki#","uri":"@id","type":"@type","results":{"@id":"onki:results","@container":"@list"},"prefLabel":"skos:prefLabel","altLabel":"skos:altLabel","hiddenLabel":"skos:hiddenLabel","broader":"skos:broader"},"uri":"","results":[]} |
OK thanks. Then it's not an issue with the UI / JavaScript code. Need to investigate further. |
Jena-text allows to set the Analyzer for the query (https://jena.apache.org/documentation/query/text-query.html#configuring-an-analyzer, section "Analyzer for query"). I tried to set it like this :
(note the use of KeywordTokenizer to avoid splitting query string). But this does not seem to help, unfortunately... |
The following works : add an extra index field containing the "unfolded" string :
I don't see what could be the other consequences on skosmos of such a configuration ? do you see any problem with this approach ? |
Further tests show that the approach above does in fact not work correctly. |
Thanks for testing several options. This may be a problem in jena-text itself. The ability to specify different analyzers is a fairly recent feature, especially the configurable analyzer, and probably not extensively tested. |
I have tested this and it indeed looks like a bug in jena-text. The text query for "éducation" gives no results, while "education" matches both "education" and "éducation". Apparently the configured analyzer is not used for the query. According to jena-text documentation:
This is somewhat unclear but I take "the analyzer used for the document" to mean the same analyzer that was used for indexing, no other interpretation would make sense here. |
This turns out to be a bug/feature (depending on how you look at it) in Lucene. Its standard QueryParser implementation doesn't process wildcard queries via the Analyzer no matter what you do, and this appears to be by design. This affects Skosmos immensely, because almost all of its text queries are wildcard queries (the only exception is if you do a non-wildcard query via the REST API). See: Solr and Elasticsearch (both based on Lucene) seem to have evolved some solutions in this area but they may not be directly relevant for plain Lucene: A possible solution would be to use AnalyzingQueryParser in jena-text. |
Thanks for the analysis !
Downside : indexes may grow very large with Ngrams. |
Using EdgeNGramTokenizer is an interesting thought, but I don't think it would work for all the use cases we want to support. For example wildcards may appear in the middle of the search query. In general we already allow users to enter wildcards in the search terms, I don't think they would like it if we took that away. Anyway I think it should be fairly simple to add a configuration flag such as |
This isn't really tied to Skosmos development cycles as the work would need to be done in Jena, not Skosmos code. Anyway I've created a Jena issue for this: https://issues.apache.org/jira/browse/JENA-1134 |
Moving to the 1.6 milestone in order to not delay the 1.5 release further. This doesn't affect Skosmos code anyway. |
I've tested with a fresh Fuseki snapshot (from today) that includes the JENA-1134 feature. I used the new setting in my Fuseki configuration file:
I also set the analyzers to use ConfigurableAnalyzer with ASCIIFoldingFilter, as above. Now I get the same results with both |
I updated the TextAnalysisConfiguration wiki page to reflect this. |
Hello
Using this analyzer configuration :
Searches for "education" (without diacritic) correctly find labels containing "éducation", but searches for "éducation" (with diacritic) don't find labels containing "éducation" !
The same analyzer configuration than the one used in the text index should be applied to the search string.
The text was updated successfully, but these errors were encountered: