Skip to content

Commit

Permalink
MOD-5756: Modify FT.SEARCH to avoid unnecessary escaping (#4433)
Browse files Browse the repository at this point in the history
* Add query_parser/v3

* Add more tag tests

* Add test for TEXT testExact

* autoescaping single tag between brackets

* Fix tests COORD=1

* Fix wildcard support

* wildcard + prefix/infix/suffix is invalid

* Test backward compatibility

* Test some uncovered cases

* Test prefix/infix/suffix with TEXT field

* Remove temporary debug messages

* Test: escape 'w' single_tag

* Add more tag tests

* WIP: Tests to increase coverage parser v3

* Add more tests

* Fix lexer.c v3

* Fix test invalid syntax

* Test punct and cntrl characters

* split unescaped_tag rule

* Add one more test UNESCAPED_TAG

* Fix expected test format  in cluster

* Update src/query_parser/v3/lexer.rl - Fix description

Co-authored-by: Omer Shadmi <76992134+oshadmi@users.noreply.github.com>

* Test pipe with dialect < 5, add comment about backslack escaping

* Test escaping $

* One more test escaping $

* lexer v3 - remove leading and trailing spaces

* Test short tags

* Add JSON tests

* Use comma separator for JSON tests

* Add test testTagUNF()

* Test tag autoescaping using DEFAULT_DIALECT 5

* More test using DEFAULT DIALECT 5 in test_search_params: test_geo, test_attr, test_binary_data.

* Revert changes to test_search_params:test_geo

* testTagUNF: Create index before hashes

* Revert change in QueryNode_DumpSds()

* Test aggregate with TAG autoescaping

* Fix testTagAutoescaping, remove additional right curly brace

* Update testDialect5InvalidSyntax()

* update parser/v3 taking latest parser/v2

* Create parser v3

* Test dialect: DEFAULT_DIALECT as module arg

* More tests for text queries

* Improve invalid syntax text

* Fix parser v3, unary op after field name

* Add missint test with modifierlist

* WIP: Test isempty() with DIALECT 5

* test_v1_vs_v2_vs_v5()

* Add tests to improve codecov using dialect 5

* One more test to improve codecov

* Add WITHCOUNT to fix test with DIALECT 5

* Fix testEmptyValueTags() for DIALECT > 2

* Fix test_tags

* Test float without leading zero

* Fix wrong float number test

* Test ragel minization at the end of compilation

* Revert "Test ragel minization at the end of compilation"

This reverts commit 0dae78b.

* Fix make-parser.mk

* cpp-test parser v3

* Update parser v3

* Update lexer

* Fix tests

* Fix float number syntax, leading zero is optional

* Test number format

* Support numbers with multiple signs

* Create macros in parser.y v3

* Use set_max_dialect

* Fix testEmptyValueTags()

* MOD-6750 Fix numeric range syntax (#4505)

Fix numeric range syntax

* MOD-6749: Querying numeric fields using simple operators (#4516)

* Simplify single_tag

* Remove unescaped_tag2

* lexer v3 - remove colon from tag expressions

* Create unescaped_tag2 to create UNESCAPED_TAG without escape

* Fix leading/trailing spaces deletion

* Validate tok.len

* Remove debugging code

* Fix lexer.rl v3 format

* Temp: Try to fix sanitizer

* Revert "Temp: Try to fix sanitizer"

This reverts commit 66002f1.

* Test tag with * as literal

* Fix parser: tag rules

* Update tests/pytests/test_tags.py - minor typo

---------

Co-authored-by: Omer Shadmi <76992134+oshadmi@users.noreply.github.com>
(cherry picked from commit e907ece)
  • Loading branch information
nafraf committed May 27, 2024
1 parent e803dd3 commit c6dd396
Show file tree
Hide file tree
Showing 9 changed files with 3,016 additions and 1,400 deletions.
1,138 changes: 808 additions & 330 deletions src/query_parser/v3/lexer.c

Large diffs are not rendered by default.

363 changes: 331 additions & 32 deletions src/query_parser/v3/lexer.rl

Large diffs are not rendered by default.

1,958 changes: 1,001 additions & 957 deletions src/query_parser/v3/parser.c

Large diffs are not rendered by default.

63 changes: 33 additions & 30 deletions src/query_parser/v3/parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,35 +8,38 @@
#define RP 8
#define RB 9
#define RSQB 10
#define TERM 11
#define QUOTE 12
#define LP 13
#define LB 14
#define LSQB 15
#define TILDE 16
#define MINUS 17
#define PLUS 18
#define AND 19
#define EQUAL 20
#define NOT_EQUAL 21
#define EQUAL_EQUAL 22
#define GE 23
#define GT 24
#define LE 25
#define LT 26
#define ARROW 27
#define COLON 28
#define NUMBER 29
#define SIZE 30
#define STAR 31
#define TAGLIST 32
#define UNESCAPED_TAG 11
#define TERM 12
#define QUOTE 13
#define LP 14
#define LB 15
#define LSQB 16
#define TILDE 17
#define MINUS 18
#define PLUS 19
#define AND 20
#define EQUAL 21
#define NOT_EQUAL 22
#define EQUAL_EQUAL 23
#define GE 24
#define GT 25
#define LE 26
#define LT 27
#define ARROW 28
#define COLON 29
#define NUMBER 30
#define SIZE 31
#define STAR 32
#define TERMLIST 33
#define PREFIX 34
#define SUFFIX 35
#define CONTAINS 36
#define PERCENT 37
#define ATTRIBUTE 38
#define VERBATIM 39
#define WILDCARD 40
#define AS_T 41
#define SEMICOLON 42
#define PREFIX_TAG 35
#define SUFFIX 36
#define SUFFIX_TAG 37
#define CONTAINS 38
#define CONTAINS_TAG 39
#define PERCENT 40
#define ATTRIBUTE 41
#define VERBATIM 42
#define WILDCARD 43
#define AS_T 44
#define SEMICOLON 45
79 changes: 42 additions & 37 deletions src/query_parser/v3/parser.y
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

%left RP RB RSQB.

%left UNESCAPED_TAG.
%left TERM.
%left QUOTE.
%left LP LB LSQB.
Expand All @@ -38,9 +39,8 @@
%left SIZE.
%left STAR.

%left TAGLIST.
%left TERMLIST.
%left PREFIX SUFFIX CONTAINS.
%left PREFIX PREFIX_TAG SUFFIX SUFFIX_TAG CONTAINS CONTAINS_TAG.
%left PERCENT.
%left ATTRIBUTE.
%left VERBATIM WILDCARD.
Expand Down Expand Up @@ -197,6 +197,9 @@ static void reportSyntaxError(QueryError *status, QueryToken* tok, const char *m
%type affix { QueryNode * }
%destructor affix { QueryNode_Free($$); }

%type affix_tag { QueryNode * }
%destructor affix_tag { QueryNode_Free($$); }

%type suffix { QueryNode * }
%destructor suffix { QueryNode_Free($$); }

Expand Down Expand Up @@ -224,8 +227,8 @@ static void reportSyntaxError(QueryError *status, QueryToken* tok, const char *m
%type fuzzy { QueryNode *}
%destructor fuzzy { QueryNode_Free($$); }

%type tag_list { QueryNode *}
%destructor tag_list { QueryNode_Free($$); }
%type single_tag { QueryNode *}
%destructor single_tag { QueryNode_Free($$); }

%type geo_filter { QueryParam *}
%destructor geo_filter { QueryParam_Free($$); }
Expand Down Expand Up @@ -647,6 +650,22 @@ affix(A) ::= CONTAINS(B) . {
A = NewPrefixNode_WithParams(ctx, &B, true, true);
}

/////////////////////////////////////////////////////////////////
// Prefix expressions based on tags
/////////////////////////////////////////////////////////////////

affix_tag(A) ::= PREFIX_TAG(B) . {
A = NewPrefixNode_WithParams(ctx, &B, true, false);
}

affix_tag(A) ::= SUFFIX_TAG(B) . {
A = NewPrefixNode_WithParams(ctx, &B, false, true);
}

affix_tag(A) ::= CONTAINS_TAG(B) . {
A = NewPrefixNode_WithParams(ctx, &B, true, true);
}

// verbatim(A) ::= VERBATIM(B) . {
// A = NewVerbatimNode_WithParams(ctx, &B);
// }
Expand Down Expand Up @@ -731,10 +750,10 @@ expr(A) ::= ISEMPTY LP modifier(B) RP . {
}

/////////////////////////////////////////////////////////////////
// Tag Lists - curly braces separated lists of words
// Single Tag - tag enclosed in curly braces
/////////////////////////////////////////////////////////////////

expr(A) ::= modifier(B) COLON LB tag_list(C) RB . {
expr(A) ::= modifier(B) COLON LB single_tag(C) RB . {
if (!C) {
A = NULL;
} else {
Expand All @@ -751,44 +770,26 @@ expr(A) ::= modifier(B) COLON LB tag_list(C) RB . {
}
}

tag_list(A) ::= param_term_case(B) . [TAGLIST] {
single_tag(A) ::= ATTRIBUTE(B) . {
A = NewPhraseNode(0);
B.type = QT_PARAM_TERM_CASE;
QueryNode_AddChild(A, NewTokenNode_WithParams(ctx, &B));
}

tag_list(A) ::= affix(B) . [TAGLIST] {
A = NewPhraseNode(0);
QueryNode_AddChild(A, B);
}

tag_list(A) ::= verbatim(B) . [TAGLIST] {
A = NewPhraseNode(0);
QueryNode_AddChild(A, B);
}

tag_list(A) ::= termlist(B) . [TAGLIST] {
A = NewPhraseNode(0);
QueryNode_AddChild(A, B);
}

tag_list(A) ::= tag_list(B) OR param_term_case(C) . [TAGLIST] {
QueryNode_AddChild(B, NewTokenNode_WithParams(ctx, &C));
A = B;
}

tag_list(A) ::= tag_list(B) OR affix(C) . [TAGLIST] {
QueryNode_AddChild(B, C);
A = B;
single_tag(A) ::= UNESCAPED_TAG(B) . {
A = NewPhraseNode(0);
B.type = QT_TERM_CASE;
QueryNode_AddChild(A, NewTokenNode_WithParams(ctx, &B));
}

tag_list(A) ::= tag_list(B) OR verbatim(C) . [TAGLIST] {
QueryNode_AddChild(B, C);
A = B;
single_tag(A) ::= affix_tag(B) . {
A = NewPhraseNode(0);
QueryNode_AddChild(A, B);
}

tag_list(A) ::= tag_list(B) OR termlist(C) . [TAGLIST] {
QueryNode_AddChild(B, C);
A = B;
single_tag(A) ::= verbatim(B) . {
A = NewPhraseNode(0);
QueryNode_AddChild(A, B);
}

/////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -918,7 +919,7 @@ geo_filter(A) ::= LSQB param_num(B) param_num(C) param_num(D) param_term(E) RSQB
}

/////////////////////////////////////////////////////////////////
// Geomtriy Queries
// Geometry Queries
/////////////////////////////////////////////////////////////////
expr(A) ::= modifier(B) COLON geometry_query(C). {
if (C) {
Expand Down Expand Up @@ -1153,6 +1154,10 @@ term(A) ::= SIZE(B). {
A = B;
}

term(A) ::= UNESCAPED_TAG(B) . {
A = B;
}

///////////////////////////////////////////////////////////////////////////////////
// Parameterized Primitives (actual numeric or string, or a parameter/placeholder)
///////////////////////////////////////////////////////////////////////////////////
Expand Down
37 changes: 28 additions & 9 deletions tests/cpptests/test_cpp_query.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ TEST_F(QueryTest, testParser_delta) {
assertValidQuery_v(2, "(*)");
assertValidQuery_v(5, "(*)");

// params are avalible from version 2.
// params are available from version 2.
assertInvalidQuery_v(1, "$hello");
assertValidQuery_v(2, "$hello");
assertValidQuery_v(5, "$hello");
Expand All @@ -121,7 +121,6 @@ TEST_F(QueryTest, testParser_delta) {
assertInvalidQuery_v(5, "@title:(@num:[0 10])");
assertInvalidQuery_v(5, "@t1:@t2:@t3:hello");


// minor bug in v1
assertValidQuery_v(1, "@title:{foo}}}}}");
assertInvalidQuery_v(2, "@title:{foo}}}}}");
Expand All @@ -145,6 +144,7 @@ TEST_F(QueryTest, testParser_delta) {
assertValidQuery_v(5, "-@foo:as");

// From version 5, punctuation chars are not ignored, they should be escaped
// to be part of the term
assertValidQuery_v(1,"hello world&good");
assertValidQuery_v(2,"hello world&good");
assertInvalidQuery_v(5,"hello world&good");
Expand Down Expand Up @@ -447,15 +447,33 @@ TEST_F(QueryTest, testParser_v3) {
// Tag queries
assertValidQuery("@tags:{foo}", ctx);
assertValidQuery("@tags:{foo|bar baz|boo}", ctx);
assertValidQuery("@tags:{foo|bar\\ baz|boo}", ctx);
// Invalid: only the 'tag_invalid_punct' characters should be escaped, ' ' is not one of them
assertInvalidQuery("@tags:{foo|bar\\ baz|boo}", ctx);
assertValidQuery("@tags:{foo*}", ctx);
assertValidQuery("@tags:{foo\\-*}", ctx);
// Invalid: only the 'tag_invalid_punct' characters should be escaped, '-' is not one of them
assertInvalidQuery("@tags:{foo\\-*}", ctx);
assertInvalidQuery("@tags:{*foo\\-}", ctx);
assertInvalidQuery("@tags:{*foo\\-*}", ctx);
assertValidQuery("@tags:{foo-*}", ctx);
assertValidQuery("@tags:{*foo-}", ctx);
assertValidQuery("@tags:{*foo-*}", ctx);
assertValidQuery("@tags:{bar | foo*}", ctx);
assertValidQuery("@tags:{bar* | foo}", ctx);
assertValidQuery("@tags:{bar* | foo*}", ctx);

// Invalid: the '*' in the middle of the tag should be escaped.
// See 'tag_invalid_punct' in parser.rl
assertInvalidQuery("@tags:{bar* | foo}", ctx);
// Valid: the '*' in the middle was escaped
assertValidQuery("@tags:{bar\\* | foo}", ctx);
assertValidQuery("@tags:{bar*} | @tags:{foo}", ctx);
// Invalid: the '*' in the middle of the tag should be escaped
assertInvalidQuery("@tags:{bar* | foo*}", ctx);
assertValidQuery("@tags:{bar*} | @tags:{foo*}", ctx);

// Invalid: the '}'s should be escaped. See 'tag_invalid_punct' in parser.rl
assertInvalidQuery("@title:{foo}}}}}", ctx);
assertInvalidQuery("@title:{{{{{foo}", ctx);
// Valid: the '}' except the last one were escaped
assertValidQuery("@title:{foo\\}\\}\\}\\}}", ctx);
// Valid: the '{' does not need to be escaped, it is not part of 'tag_invalid_punct'
assertValidQuery("@title:{{{{{foo}", ctx);
assertInvalidQuery("@tags:{foo|bar\\ baz|}", ctx);
assertInvalidQuery("@tags:{foo|bar\\ baz|", ctx);
assertInvalidQuery("{foo|bar\\ baz}", ctx);
Expand Down Expand Up @@ -549,7 +567,8 @@ TEST_F(QueryTest, testParser_v3) {
assertValidQuery("@hello:[0 10]=>[KNN 10 @vec_field $BLOB]", ctx);
assertValidQuery("(@tit_le|bo_dy:barack @body|title|url|something_else:obama)=>[KNN 10 @vec_field $BLOB]", ctx);
assertValidQuery("(-hello ~world ~war)=>[KNN 10 @vec_field $BLOB]", ctx);
assertValidQuery("@tags:{bar* | foo}=>[KNN 10 @vec_field $BLOB]", ctx);
assertInvalidQuery("@tags:{bar* | foo}=>[KNN 10 @vec_field $BLOB]", ctx);
assertValidQuery("(@tags:{bar*} | @tags:{foo})=>[KNN 10 @vec_field $BLOB]", ctx);
assertValidQuery("(no -as) => {$weight: 0.5} => [KNN 10 @vec_field $BLOB]", ctx);

// Invalid complex queries with hybrid vector
Expand Down
72 changes: 67 additions & 5 deletions tests/pytests/test_aggregate_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ def test_apply(env):
conn.execute_command('HSET', 'dkey:3', 'name', 'Perrito', 'breed', 'poodle', 'code', 'ca?33-22')
conn.execute_command('HSET', 'dkey:4', 'name', 'Lou Dog', 'breed', 'Dalmatian', 'code', 'ca:99-##')
conn.execute_command('HSET', 'dkey:5', 'name', 'dipper', 'breed', 'dalmatian', 'code', 'ca:99-##')
conn.execute_command('HSET', 'dkey:6', 'name', 'Duff', 'breed', 'Dalmatian', 'code', 'gp-33-22')
conn.execute_command('HSET', 'dkey:7', 'name', 'Triumph', 'breed', 'Mountain Hound', 'code', 'gp-33-22')
conn.execute_command('HSET', 'dkey:8', 'name', 'Chuck', 'breed', 'Saluki', 'code', 'gp-33-22')
conn.execute_command('HSET', 'dkey:9', 'name', 'Tuk', 'breed', 'Husky', 'code', 'gp-33-22')
conn.execute_command('HSET', 'dkey:10', 'name', 'Jul', 'breed', 'St. Bernard', 'code', 'gp-33-22')
conn.execute_command('HSET', 'dkey:6', 'name', 'Duff', 'breed', 'Dalmatian', 'code', 'gg-*33-22')
conn.execute_command('HSET', 'dkey:7', 'name', 'Triumph', 'breed', 'Mountain Hound', 'code', 'gg-*33-22')
conn.execute_command('HSET', 'dkey:8', 'name', 'Chuck', 'breed', 'Saluki', 'code', '*gg-33-22')
conn.execute_command('HSET', 'dkey:9', 'name', 'Tuk', 'breed', 'Husky', 'code', '*gg-33-22')
conn.execute_command('HSET', 'dkey:10', 'name', 'Jul', 'breed', 'St. Bernard', 'code', 'gg-33-22')

res1 = env.cmd('ft.aggregate', 'idx', '@breed:(Dal*|Poo*|Ru*|Mo*)', 'LOAD', '2', '@name', '@breed', 'FILTER', 'exists(@breed)', 'APPLY', 'upper(@name)', 'AS', 'n', 'APPLY', 'upper(@breed)', 'AS', 'b', 'SORTBY', '4', '@b', 'ASC', '@n', 'ASC')
res2 = env.cmd('ft.aggregate', 'idx', '@breed:($p1*|$p2*|$p3*|$p4*)', 'LOAD', '2', '@name', '@breed', 'FILTER', 'exists(@breed)', 'APPLY', 'upper(@name)', 'AS', 'n', 'APPLY', 'upper(@breed)', 'AS', 'b', 'SORTBY', '4', '@b', 'ASC', '@n', 'ASC', 'PARAMS', '8', 'p1', 'Dal', 'p2', 'Poo', 'p3', 'Ru', 'p4', 'Mo')
Expand All @@ -85,3 +85,65 @@ def test_apply(env):
res2 = env.cmd('ft.aggregate', 'idx', '@breed:($p1*|$p2*|$p3*|$p4*)', 'PARAMS', '8', 'p1', 'Dal', 'p2', 'Poo', 'p3', 'Ru', 'p4', 'Mo', 'SORTBY', '1', '@name')
env.assertEqual(res2, res1)

# Tag autoescaping
if dialect == 5:
res1 = env.cmd('ft.aggregate', 'idx', '@code:{ca?33-22}',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res1, [1, ['code', 'ca?33-22', 'total', '3']])
res2 = env.cmd('ft.aggregate', 'idx', '@code:{$p1}',
'PARAMS', '2', 'p1', 'ca?33-22',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res2, res1)

res1 = env.cmd('ft.aggregate', 'idx', "@code:{*:99-##}",
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res1, [1, ['code', 'ca:99-##', 'total', '2']])
res2 = env.cmd('ft.aggregate', 'idx', "@code:{*$p1}",
'PARAMS', '2', 'p1', ':99-##',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res2, res1)

# ------------------------------------------------------------------
# Tests with literal '*' being a part of the tag
# ------------------------------------------------------------------
# full match, the '*' is escaped to be literal
res1 = env.cmd('ft.aggregate', 'idx', "@code:{\\*gg-33-22}",
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res1, [1, ['code', '*gg-33-22', 'total', '2']])
# full match, the '*' is part of the PARAM, so it's not escaped
res2 = env.cmd('ft.aggregate', 'idx', "@code:{$p1}",
'PARAMS', '2', 'p1', '*gg-33-22',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res2, res1)

# prefix, the '*' in the middle is escaped to be literal
res1 = env.cmd('ft.aggregate', 'idx', "@code:{gg-\\*33-*}",
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res1, [1, ['code', 'gg-*33-22', 'total', '2']])
# prefix, the '*' is part of the PARAM, so it's not escaped
res2 = env.cmd('ft.aggregate', 'idx', "@code:{$p1*}",
'PARAMS', '2', 'p1', 'gg-*33-',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res2, res1)

# suffix, the second '*' is escaped to be literal
res1 = env.cmd('ft.aggregate', 'idx', "@code:{*\\*33-22}",
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res1, [1, ['code', 'gg-*33-22', 'total', '2']])
# suffix, the '*' is part of the PARAM, so it's not escaped
res2 = env.cmd('ft.aggregate', 'idx', "@code:{*$p1}",
'PARAMS', '2', 'p1', '*33-22',
'GROUPBY', '1', '@code',
'REDUCE', 'COUNT', 0, 'AS', 'total')
env.assertEqual(res2, res1)


0 comments on commit c6dd396

Please sign in to comment.