Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FT.SEARCH output incorrect (FT.ADD not creating tokens correctly).. #168

Closed
arora-kushal opened this issue Sep 20, 2017 · 14 comments
Closed

Comments

@arora-kushal
Copy link

arora-kushal commented Sep 20, 2017

Hi,

We are facing an issue in RediSearch 0.21.0. The command is giving wrong output.

Could anyone please help me on this. The reproduction steps are mentioned below.

Thanks !

## Commands to setup:

"FT.CREATE" "VVV" "SCHEMA" "VVId_s" "TEXT" "SORTABLE" "PrimaryBIN_s" "TEXT" "SORTABLE" "StreetAddress_s" "TEXT" "SORTABLE" "PrimaryAddress_s" "TEXT" "SORTABLE" "CreationDate_l" "NUMERIC" "SORTABLE" "NormalizedSeverity_s" "TEXT" "SORTABLE" "SourceSeverity_s" "TEXT" "SORTABLE" "ClosedDate_l" "NUMERIC" "SORTABLE" "Description_s" "TEXT" "SORTABLE" "RemediationAgency_s" "TEXT" "SORTABLE" "Group_s" "TEXT" "SORTABLE" "SiteCompliId_l" "NUMERIC" "SORTABLE" "Bin_s" "TEXT" "SORTABLE" "Conditions_s" "TEXT" "SORTABLE" "Owned_s" "TEXT" "SORTABLE" "IsCondition_s" "TEXT" "SORTABLE" "VVCost_s" "TEXT" "SORTABLE" "CAPTracker_s" "TEXT" "SORTABLE" "Fixer_s" "TEXT" "SORTABLE" "OverallStatus_s" "TEXT" "SORTABLE" "CombinedStatus_s" "TEXT" "SORTABLE" "IsExcluded_s" "TEXT" "SORTABLE" "UnitNumber_s" "TEXT" "SORTABLE" "Borough_s" "TEXT" "SORTABLE" "DateAddedtoBCS_l" "NUMERIC" "SORTABLE" "UpdatedBy_s" "TEXT" "SORTABLE" "UpdatedAt_l" "NUMERIC" "SORTABLE" "Created_At_l" "NUMERIC" "SORTABLE" "Created_At_Date_l" "NUMERIC" "SORTABLE" "RemediationOwner_s" "TEXT" "SORTABLE" "VVType_s" "TEXT" "SORTABLE" "IsActive_l" "NUMERIC" "SORTABLE" "IsActiveText_s" "TEXT" "SORTABLE"

"FT.ADD" "VVV" "ss:d:{#ClientList}50" "1" "FIELDS" "VVId_s" "123456789K" "PrimaryBIN_s" "2" "StreetAddress_s" "1800" "PrimaryAddress_s" "1800" "CreationDate_l" "6" "NormalizedSeverity_s" "days" "SourceSeverity_s" "NO" "ClosedDate_l" "222" "Description_s" "Work" "RemediationAgency_s" "U" "Group_s" "2" "SiteCompliId_l" "0" "Bin_s" "2" "Conditions_s" "2" "Owned_s" "Private" "IsCondition_s" "No" "VVCost_s" "2" "CAPTracker_s" "2" "Fixer_s" "2" "OverallStatus_s" "O" "CombinedStatus_s" "Open" "IsExcluded_s" "No" "UnitNumber_s" "2" "Borough_s" "BBBB" "DateAddedtoBCS_l" "232" "UpdatedBy_s" "2" "UpdatedAt_l" "32" "Created_At_l" "6" "Created_At_Date_l" "6" "RemediationOwner_s" "U" "VVType_s" "C" "IsActive_l" "2" "IsActiveText_s" "Yes"

Search Command : "FT.SEARCH" "VVV" "@IsActiveText_s:(Yes)"
Expected output : "ss:d:{#ClientList}50"
Actual Output (INCORRECT): 0

However, if the above commands have less columns, then it gives correct output. For e.g.:

"FT.CREATE" "VVV" "SCHEMA" "VVId_s" "TEXT" "SORTABLE" "IsActiveText_s" "TEXT" "SORTABLE"

"FT.ADD" "VVV" "ss:d:{#ClientList}50" "1" "FIELDS" "VVId_s" "123456789K" "IsActiveText_s" "Yes"

Search Command : "FT.SEARCH" "VVV" "@IsActiveText_s:(Yes)"
Expected output : "ss:d:{#ClientList}50"
Actual Output (CORRECT): "ss:d:{#ClientList}50"

@dvirsky
Copy link
Contributor

dvirsky commented Sep 20, 2017

the index is limited to 32 text fields and unlimited numeric fields. However if some of them are used only for sorting, I can release a fix that makes them not count in those 32. Will that work for you?

@dvirsky
Copy link
Contributor

dvirsky commented Sep 20, 2017

you can maybe solve it by indexing the yes/no fields as numeric and not text for example.

@arora-kushal
Copy link
Author

Thanks Dvir for your prompt response. Yes, it would be great if you can release a fix that makes them not count in those 32.

Also, In the above example, there are 25 text fields and 8 numeric fields. So, ideally it should work (as numeric fields would not be counted) ?

@dvirsky
Copy link
Contributor

dvirsky commented Sep 20, 2017

The numeric fields might be mistakenly counted in those 32 as well, let me check. In anyway it should not fail silently. I'll try to release a fix for this ASAP

@dvirsky
Copy link
Contributor

dvirsky commented Sep 20, 2017

Yeah, there was a bug and they were counted together, I fixed it and now writing a test for a huge schema. It should be pushed soon and you can try it.

@dvirsky
Copy link
Contributor

dvirsky commented Sep 20, 2017

OK, it should work now. Please pull master and try.

@arora-kushal
Copy link
Author

Yes, working fine now. Thanks

@dvirsky
Copy link
Contributor

dvirsky commented Sep 21, 2017

@arora-kushal nice! there are tests for big schema now (64 fields), and the actual limitation is 1024 fields (I should document it somewhere). Also, it doesn't just fail silently and tells you if there are more than 32 text fields.

I would have added more, but I'm marking the field ids for each term in each document with a bitmask that is either 8,16,24 or 32 bits, to allow filtering for many fields at constant speed. The encoding schema doesn't allow more than 32 bits for the field mask at the moment, and most people will probably not use more than 8.

I do however want to add a new type of field, I call it a "tag field", much like indexing a VARCHAR field in SQL. It will be just like a text field, but is not tokenized, and cannot be refered to without the field name, thus it doesn't need a bit mask. This will allow unlimited fields for things like (in your schema) Borough_s, RemediationAgency_s etc, where you don't just search for the value, but filter for it.

@arora-kushal
Copy link
Author

Thanks for explaining it in detail. Looking forward for this solution as it will be very useful for us.

@arora-kushal
Copy link
Author

Hi Dvir,

Do you have any update on this?

Actually, We are building a solution where we have schema of more than 32 text fields and it requires filtering. So, the solution you have mentioned above will be very useful for us.

@dvirsky
Copy link
Contributor

dvirsky commented Oct 16, 2017

@arora-kushal Hi, it's not ready yet, I've started implementing it but have been tied in other stuff. I don't want to promise anything but it's coming soon.

@gauravgoel151
Copy link

We are building a similar solution where we have around 60 text fields on which searching/filterting can be applied.
Until the above solution is ready, possible alternatives could be following:

We will create two indexes for a schema containing fields as follows:
1) Index - 1 : Containging 32 text fields out of which 31 are data fields and 1 will act as pointer for key of Index-2.
2) Index - 2 : For remaining fields.
3) Collating the search results at the end inside a high level module and returning it back to client

Any suggestions will be highly appreciated.

@gauravgoel151
Copy link

@dvirsky Could you please guide us if we are thinking in right direction? Or there could be better workaround for it?

@dvirsky
Copy link
Contributor

dvirsky commented Oct 25, 2017

It's a hack, indeed, but it will work.
I'd do the following:

  1. In one of the indexes, use NOSAVE, and make it only save ids, to avoid duplicate docs etc.
  2. When searching - first go to the index with no documents in it, and call it with NOCONTENT and a relatively large pagination limit. You can only do that from the client, because it is a concurrent command and won't work with Lua or another module. This will give you a long list of IDs.
  3. then take the IDs you've received from the first query, and use INKEYS when calling the second index.

As long as there aren't too many results to load on the first query (A few thousands should be fine), it will work, and can be a good temporary solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants