-
Notifications
You must be signed in to change notification settings - Fork 170
How to search on fields without diacritics / accent? #317
Comments
Hi @smiklosovic: You are using the english analyzer for non-english text. Caron is a diacritic that is not included in english alphabet. Use the correct analyzer for your language. |
I do not know what letters will be used in our application in advance. When I choose analyzer which supports caron, I would miss other possible letters from different languages not supported by it. Is there any way how to support "everything"? How would you solve this? |
Hi @smiklosovic: You are using a string mapper that does not analyze the input text. You should use the text mapper instead. There is no such analyzer that could do everything. Each language has its own characteristics (alphabet, stopwords, delimiters.. ) and sometimes have conflicts between them. (the term 'a' could be a stopword in english but a valid and correct word in other language) You can create a mapper for every diferent language and performs searches against a bunch of mappers
With this approach you will have false positives. Also, you can code your own analyzer, include it in classpath and reference it with a classpath analyzer type. Hope this helps |
Thanks! Two questions. a) Why would I have false positives? b) what about computational efficiency? Do not I waste a lot of resources here by specifying multiple analyzers? What is the overhead? All I am basically asking for is to be able to match results regardless of accents on top of that so when I search for "Stefan Miklosovic", it would give me a record with "Štefan Miklošovič". Would that be possible? |
Hi @smiklosovic : There is one way you can get this running. You need to develop a custom Analyzer formed by a MappingCharFilter and a WhitespaceAnalyzer. With the MappingCharFilter you can replace any character with others. For example, the mapping table for diacritics in spanish is: á => a, é => e, í => i, ó => o, ú => u With the WhitespaceTokenizer text is splitted in tokens by whitespace character. You can read further instructions about how to generate and use a custom analyzer at #231, By now, you need to write code to use this custom analyzer feature. We are working to update this feature to be able to create custom analyzers in index creation query. This is a feature included in ElasticSearch also. Maybe you should read its documentation about custom analyzers to undertand better the lucene analysis pipeline . Also, you can always ask for consultancy services writing to contact@stratio.com Regards |
Any progress on this or am I still forced to follow the path mentioned in the comment of @ealonsodb ? |
This analyzer did that trick for me. ASCIIFoldingFilter converts all characters to their ascii doc equivalent.
|
hi! Any news about this problem? Is it still impossible to do it without creating custom analyzer? |
@gjabelAmbitas in the end, we created one more field into which we saved e.g. last name without diacritics and we are performing searches only on this "dumb" field. So whatever you enter, it is dumbed-down to remove all diacritics (there are libs for that, core java has it too) and we perform search on that string against that dumb column. We return "with diacritics" column to user back. This way, even you search without diacritics, you will get them back in results. By this approach, you can do wildcard queries too and use ?, *, . and so on ... The only case "it does not work" is if you want to search with diacritics, for example I am "Miklošovič" and I want to return ONLY people with name "Miklošovic". I would return my full name instead. But you have to ask yourself if that scenario is ever going to happen. Everybody enters your name upon searching without diacritics anyway. For transforming all existing names to this dumb column, we used Apache Spark and its Cassandra connector. |
Hey @ealonsodb
Lets say that I have my name saved in Cassandra which has some diacritics / accents in it: Štefan Miklošovič
I want to make such query that I would be found in DB even I make it like "Stefan Miklosovic" - it seems to me that the only viable way how to return myself is to use these diacritics characters but that is very cumbersome to do when you have some application which is used internationally.
The text was updated successfully, but these errors were encountered: