New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

case insensitivity for unicode characters #291

Open
tuananh opened this Issue Feb 21, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@tuananh
Copy link

tuananh commented Feb 21, 2018

How to produce

FT.CREATE testIndex SCHEMA name TEXT
FT.ADD testIndex doc1 1.0 FIELDS name "Đà Nẵng"
FT.SEARCH testIndex "đà" # no result
FT.SEARCH testIndex "Đà" # ok

I was expecting FT.SEARCH testIndex "đà" to works too because the below works

FT.ADD testIndex doc2 1.0 FIELDS name "Melia"
FT.SEARCH testIndex "melia"
@dvirsky

This comment has been minimized.

Copy link
Contributor

dvirsky commented Feb 21, 2018

This is a known issue but we won't attend to it in the immediate future, I suggest you normalize the input on the client side if this is urgent.

I'm keeping this open as we do intend to fix this, it just won't be in the next few weeks.

@tuananh

This comment has been minimized.

Copy link
Author

tuananh commented Feb 21, 2018

Yes, i'm normalizing data before inputting into redis right now; just wondering if this is expected behavior.

Can you help pointing me to the section where i need to make the changes to fix this?

@dvirsky

This comment has been minimized.

Copy link
Contributor

dvirsky commented Feb 21, 2018

It's not a trivial fix, but if you want to explore it, https://github.com/RedisLabsModules/RediSearch/blob/master/src/tokenize.c#L39

@tw-bert

This comment has been minimized.

Copy link

tw-bert commented Feb 22, 2018

I guess, to address this properly, an OSS collation module should be incorporated into RediSearch.

@dvirsky

This comment has been minimized.

Copy link
Contributor

dvirsky commented Feb 22, 2018

@tw-bert it's actually there, an excellent but little known tiny unicode library called libnu. It's just not used in the tokenizer. It is used for other things.

@tw-bert

This comment has been minimized.

Copy link

tw-bert commented Feb 22, 2018

@dvirsky : Great info, and libnu indeed has collation functions. We have a need for collation as well, but no immediate need. Good to hear it's on the roadmap.

@gkorland gkorland added this to Backlog in 1.4.0 + 1.4.1 Aug 8, 2018

@gkorland gkorland removed this from Backlog in 1.4.0 + 1.4.1 Aug 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment