MOD-5756: Modify FT.SEARCH to avoid unnecessary escaping (#4433)

* Add query_parser/v3 * Add more tag tests * Add test for TEXT testExact * autoescaping single tag between brackets * Fix tests COORD=1 * Fix wildcard support * wildcard + prefix/infix/suffix is invalid * Test backward compatibility * Test some uncovered cases * Test prefix/infix/suffix with TEXT field * Remove temporary debug messages * Test: escape 'w' single_tag * Add more tag tests * WIP: Tests to increase coverage parser v3 * Add more tests * Fix lexer.c v3 * Fix test invalid syntax * Test punct and cntrl characters * split unescaped_tag rule * Add one more test UNESCAPED_TAG * Fix expected test format in cluster * Update src/query_parser/v3/lexer.rl - Fix description Co-authored-by: Omer Shadmi <76992134+oshadmi@users.noreply.github.com> * Test pipe with dialect < 5, add comment about backslack escaping * Test escaping $ * One more test escaping $ * lexer v3 - remove leading and trailing spaces * Test short tags * Add JSON tests * Use comma separator for JSON tests * Add test testTagUNF() * Test tag autoescaping using DEFAULT_DIALECT 5 * More test using DEFAULT DIALECT 5 in test_search_params: test_geo, test_attr, test_binary_data. * Revert changes to test_search_params:test_geo * testTagUNF: Create index before hashes * Revert change in QueryNode_DumpSds() * Test aggregate with TAG autoescaping * Fix testTagAutoescaping, remove additional right curly brace * Update testDialect5InvalidSyntax() * update parser/v3 taking latest parser/v2 * Create parser v3 * Test dialect: DEFAULT_DIALECT as module arg * More tests for text queries * Improve invalid syntax text * Fix parser v3, unary op after field name * Add missint test with modifierlist * WIP: Test isempty() with DIALECT 5 * test_v1_vs_v2_vs_v5() * Add tests to improve codecov using dialect 5 * One more test to improve codecov * Add WITHCOUNT to fix test with DIALECT 5 * Fix testEmptyValueTags() for DIALECT > 2 * Fix test_tags * Test float without leading zero * Fix wrong float number test * Test ragel minization at the end of compilation * Revert "Test ragel minization at the end of compilation" This reverts commit 0dae78b. * Fix make-parser.mk * cpp-test parser v3 * Update parser v3 * Update lexer * Fix tests * Fix float number syntax, leading zero is optional * Test number format * Support numbers with multiple signs * Create macros in parser.y v3 * Use set_max_dialect * Fix testEmptyValueTags() * MOD-6750 Fix numeric range syntax (#4505) Fix numeric range syntax * MOD-6749: Querying numeric fields using simple operators (#4516) * Simplify single_tag * Remove unescaped_tag2 * lexer v3 - remove colon from tag expressions * Create unescaped_tag2 to create UNESCAPED_TAG without escape * Fix leading/trailing spaces deletion * Validate tok.len * Remove debugging code * Fix lexer.rl v3 format * Temp: Try to fix sanitizer * Revert "Temp: Try to fix sanitizer" This reverts commit 66002f1. * Test tag with * as literal * Fix parser: tag rules * Update tests/pytests/test_tags.py - minor typo --------- Co-authored-by: Omer Shadmi <76992134+oshadmi@users.noreply.github.com> (cherry picked from commit e907ece)
RediSearch · May 27, 2024 · c6dd396 · c6dd396
1 parent e803dd3
commit c6dd396
Show file tree

Hide file tree

Showing 9 changed files with 3,016 additions and 1,400 deletions.
diff --git a/src/query_parser/v3/lexer.c b/src/query_parser/v3/lexer.c
diff --git a/src/query_parser/v3/lexer.rl b/src/query_parser/v3/lexer.rl
diff --git a/src/query_parser/v3/parser.c b/src/query_parser/v3/parser.c
diff --git a/src/query_parser/v3/parser.h b/src/query_parser/v3/parser.h
@@ -8,35 +8,38 @@
 #define RP                               8
 #define RB                               9
 #define RSQB                            10
-#define TERM                            11
-#define QUOTE                           12
-#define LP                              13
-#define LB                              14
-#define LSQB                            15
-#define TILDE                           16
-#define MINUS                           17
-#define PLUS                            18
-#define AND                             19
-#define EQUAL                           20
-#define NOT_EQUAL                       21
-#define EQUAL_EQUAL                     22
-#define GE                              23
-#define GT                              24
-#define LE                              25
-#define LT                              26
-#define ARROW                           27
-#define COLON                           28
-#define NUMBER                          29
-#define SIZE                            30
-#define STAR                            31
-#define TAGLIST                         32
+#define UNESCAPED_TAG                   11
+#define TERM                            12
+#define QUOTE                           13
+#define LP                              14
+#define LB                              15
+#define LSQB                            16
+#define TILDE                           17
+#define MINUS                           18
+#define PLUS                            19
+#define AND                             20
+#define EQUAL                           21
+#define NOT_EQUAL                       22
+#define EQUAL_EQUAL                     23
+#define GE                              24
+#define GT                              25
+#define LE                              26
+#define LT                              27
+#define ARROW                           28
+#define COLON                           29
+#define NUMBER                          30
+#define SIZE                            31
+#define STAR                            32
 #define TERMLIST                        33
 #define PREFIX                          34
-#define SUFFIX                          35
-#define CONTAINS                        36
-#define PERCENT                         37
-#define ATTRIBUTE                       38
-#define VERBATIM                        39
-#define WILDCARD                        40
-#define AS_T                            41
-#define SEMICOLON                       42
+#define PREFIX_TAG                      35
+#define SUFFIX                          36
+#define SUFFIX_TAG                      37
+#define CONTAINS                        38
+#define CONTAINS_TAG                    39
+#define PERCENT                         40
+#define ATTRIBUTE                       41
+#define VERBATIM                        42
+#define WILDCARD                        43
+#define AS_T                            44
+#define SEMICOLON                       45
diff --git a/src/query_parser/v3/parser.y b/src/query_parser/v3/parser.y
@@ -20,6 +20,7 @@
 
 %left RP RB RSQB.
 
+%left UNESCAPED_TAG.
 %left TERM.
 %left QUOTE.
 %left LP LB LSQB.
@@ -38,9 +39,8 @@
 %left SIZE.
 %left STAR.
 
-%left TAGLIST.
 %left TERMLIST.
-%left PREFIX SUFFIX CONTAINS.
+%left PREFIX PREFIX_TAG SUFFIX SUFFIX_TAG CONTAINS CONTAINS_TAG.
 %left PERCENT.
 %left ATTRIBUTE.
 %left VERBATIM WILDCARD.
@@ -197,6 +197,9 @@ static void reportSyntaxError(QueryError *status, QueryToken* tok, const char *m
 %type affix { QueryNode * }
 %destructor affix { QueryNode_Free($$); }
 
+%type affix_tag { QueryNode * }
+%destructor affix_tag { QueryNode_Free($$); }
+
 %type suffix { QueryNode * }
 %destructor suffix { QueryNode_Free($$); }
 
@@ -224,8 +227,8 @@ static void reportSyntaxError(QueryError *status, QueryToken* tok, const char *m
 %type fuzzy { QueryNode *}
 %destructor fuzzy { QueryNode_Free($$); }
 
-%type tag_list { QueryNode *}
-%destructor tag_list { QueryNode_Free($$); }
+%type single_tag { QueryNode *}
+%destructor single_tag { QueryNode_Free($$); }
 
 %type geo_filter { QueryParam *}
 %destructor geo_filter { QueryParam_Free($$); }
@@ -647,6 +650,22 @@ affix(A) ::= CONTAINS(B) . {
     A = NewPrefixNode_WithParams(ctx, &B, true, true);
 }
 
+/////////////////////////////////////////////////////////////////
+// Prefix expressions based on tags
+/////////////////////////////////////////////////////////////////
+
+affix_tag(A) ::= PREFIX_TAG(B) . {
+    A = NewPrefixNode_WithParams(ctx, &B, true, false);
+}
+
+affix_tag(A) ::= SUFFIX_TAG(B) . {
+    A = NewPrefixNode_WithParams(ctx, &B, false, true);
+}
+
+affix_tag(A) ::= CONTAINS_TAG(B) . {
+    A = NewPrefixNode_WithParams(ctx, &B, true, true);
+}
+
 // verbatim(A) ::= VERBATIM(B) . {
 //    A = NewVerbatimNode_WithParams(ctx, &B);
 // }
@@ -731,10 +750,10 @@ expr(A) ::= ISEMPTY LP modifier(B) RP . {
 }
 
 /////////////////////////////////////////////////////////////////
-// Tag Lists - curly braces separated lists of words
+// Single Tag - tag enclosed in curly braces
 /////////////////////////////////////////////////////////////////
 
-expr(A) ::= modifier(B) COLON LB tag_list(C) RB . {
+expr(A) ::= modifier(B) COLON LB single_tag(C) RB . {
     if (!C) {
         A = NULL;
     } else {
@@ -751,44 +770,26 @@ expr(A) ::= modifier(B) COLON LB tag_list(C) RB . {
     }
 }
 
-tag_list(A) ::= param_term_case(B) . [TAGLIST] {
+single_tag(A) ::= ATTRIBUTE(B) . {
   A = NewPhraseNode(0);
+  B.type = QT_PARAM_TERM_CASE;
   QueryNode_AddChild(A, NewTokenNode_WithParams(ctx, &B));
 }
 
-tag_list(A) ::= affix(B) . [TAGLIST] {
-    A = NewPhraseNode(0);
-    QueryNode_AddChild(A, B);
-}
-
-tag_list(A) ::= verbatim(B) . [TAGLIST] {
-    A = NewPhraseNode(0);
-    QueryNode_AddChild(A, B);
-}
-
-tag_list(A) ::= termlist(B) . [TAGLIST] {
-    A = NewPhraseNode(0);
-    QueryNode_AddChild(A, B);
-}
-
-tag_list(A) ::= tag_list(B) OR param_term_case(C) . [TAGLIST] {
-  QueryNode_AddChild(B, NewTokenNode_WithParams(ctx, &C));
-  A = B;
-}
-
-tag_list(A) ::= tag_list(B) OR affix(C) . [TAGLIST] {
-    QueryNode_AddChild(B, C);
-    A = B;
+single_tag(A) ::= UNESCAPED_TAG(B) . {
+  A = NewPhraseNode(0);
+  B.type = QT_TERM_CASE;
+  QueryNode_AddChild(A, NewTokenNode_WithParams(ctx, &B));
 }
 
-tag_list(A) ::= tag_list(B) OR verbatim(C) . [TAGLIST] {
-    QueryNode_AddChild(B, C);
-    A = B;
+single_tag(A) ::= affix_tag(B) . {
+  A = NewPhraseNode(0);
+  QueryNode_AddChild(A, B);
 }
 
-tag_list(A) ::= tag_list(B) OR termlist(C) . [TAGLIST] {
-    QueryNode_AddChild(B, C);
-    A = B;
+single_tag(A) ::= verbatim(B) . {
+  A = NewPhraseNode(0);
+  QueryNode_AddChild(A, B);
 }
 
 /////////////////////////////////////////////////////////////////
@@ -918,7 +919,7 @@ geo_filter(A) ::= LSQB param_num(B) param_num(C) param_num(D) param_term(E) RSQB
 }
 
 /////////////////////////////////////////////////////////////////
-// Geomtriy Queries
+// Geometry Queries
 /////////////////////////////////////////////////////////////////
 expr(A) ::= modifier(B) COLON geometry_query(C). {
   if (C) {
@@ -1153,6 +1154,10 @@ term(A) ::= SIZE(B). {
   A = B;
 }
 
+term(A) ::= UNESCAPED_TAG(B) . {
+  A = B;
+}
+
 ///////////////////////////////////////////////////////////////////////////////////
 // Parameterized Primitives (actual numeric or string, or a parameter/placeholder)
 ///////////////////////////////////////////////////////////////////////////////////

diff --git a/tests/cpptests/test_cpp_query.cpp b/tests/cpptests/test_cpp_query.cpp
@@ -102,7 +102,7 @@ TEST_F(QueryTest, testParser_delta) {
   assertValidQuery_v(2, "(*)");
   assertValidQuery_v(5, "(*)");
 
-  // params are avalible from version 2.
+  // params are available from version 2.
   assertInvalidQuery_v(1, "$hello");
   assertValidQuery_v(2, "$hello");
   assertValidQuery_v(5, "$hello");
@@ -121,7 +121,6 @@ TEST_F(QueryTest, testParser_delta) {
   assertInvalidQuery_v(5, "@title:(@num:[0 10])");
   assertInvalidQuery_v(5, "@t1:@t2:@t3:hello");
 
-
   // minor bug in v1
   assertValidQuery_v(1, "@title:{foo}}}}}");
   assertInvalidQuery_v(2, "@title:{foo}}}}}");
@@ -145,6 +144,7 @@ TEST_F(QueryTest, testParser_delta) {
   assertValidQuery_v(5, "-@foo:as");
 
   // From version 5, punctuation chars are not ignored, they should be escaped
+  // to be part of the term
   assertValidQuery_v(1,"hello world&good");
   assertValidQuery_v(2,"hello world&good");
   assertInvalidQuery_v(5,"hello world&good");
@@ -447,15 +447,33 @@ TEST_F(QueryTest, testParser_v3) {
   // Tag queries
   assertValidQuery("@tags:{foo}", ctx);
   assertValidQuery("@tags:{foo|bar baz|boo}", ctx);
-  assertValidQuery("@tags:{foo|bar\\ baz|boo}", ctx);
+  // Invalid: only the 'tag_invalid_punct' characters should be escaped, ' ' is not one of them
+  assertInvalidQuery("@tags:{foo|bar\\ baz|boo}", ctx);
   assertValidQuery("@tags:{foo*}", ctx);
-  assertValidQuery("@tags:{foo\\-*}", ctx);
+  // Invalid: only the 'tag_invalid_punct' characters should be escaped, '-' is not one of them
+  assertInvalidQuery("@tags:{foo\\-*}", ctx);
+  assertInvalidQuery("@tags:{*foo\\-}", ctx);
+  assertInvalidQuery("@tags:{*foo\\-*}", ctx);
+  assertValidQuery("@tags:{foo-*}", ctx);
+  assertValidQuery("@tags:{*foo-}", ctx);
+  assertValidQuery("@tags:{*foo-*}", ctx);
   assertValidQuery("@tags:{bar | foo*}", ctx);
-  assertValidQuery("@tags:{bar* | foo}", ctx);
-  assertValidQuery("@tags:{bar* | foo*}", ctx);
-
+  // Invalid: the '*' in the middle of the tag should be escaped.
+  // See 'tag_invalid_punct' in parser.rl
+  assertInvalidQuery("@tags:{bar* | foo}", ctx);
+  // Valid: the '*' in the middle was escaped
+  assertValidQuery("@tags:{bar\\* | foo}", ctx);
+  assertValidQuery("@tags:{bar*} | @tags:{foo}", ctx);
+  // Invalid: the '*' in the middle of the tag should be escaped
+  assertInvalidQuery("@tags:{bar* | foo*}", ctx);
+  assertValidQuery("@tags:{bar*} | @tags:{foo*}", ctx);
+
+  // Invalid: the '}'s should be escaped. See 'tag_invalid_punct' in parser.rl
   assertInvalidQuery("@title:{foo}}}}}", ctx);
-  assertInvalidQuery("@title:{{{{{foo}", ctx);
+  // Valid: the '}' except the last one were escaped
+  assertValidQuery("@title:{foo\\}\\}\\}\\}}", ctx);
+  // Valid: the '{' does not need to be escaped, it is not part of 'tag_invalid_punct'
+  assertValidQuery("@title:{{{{{foo}", ctx);
   assertInvalidQuery("@tags:{foo|bar\\ baz|}", ctx);
   assertInvalidQuery("@tags:{foo|bar\\ baz|", ctx);
   assertInvalidQuery("{foo|bar\\ baz}", ctx);
@@ -549,7 +567,8 @@ TEST_F(QueryTest, testParser_v3) {
   assertValidQuery("@hello:[0 10]=>[KNN 10 @vec_field $BLOB]", ctx);
   assertValidQuery("(@tit_le|bo_dy:barack @body|title|url|something_else:obama)=>[KNN 10 @vec_field $BLOB]", ctx);
   assertValidQuery("(-hello ~world ~war)=>[KNN 10 @vec_field $BLOB]", ctx);
-  assertValidQuery("@tags:{bar* | foo}=>[KNN 10 @vec_field $BLOB]", ctx);
+  assertInvalidQuery("@tags:{bar* | foo}=>[KNN 10 @vec_field $BLOB]", ctx);
+  assertValidQuery("(@tags:{bar*} | @tags:{foo})=>[KNN 10 @vec_field $BLOB]", ctx);
   assertValidQuery("(no -as) => {$weight: 0.5} => [KNN 10 @vec_field $BLOB]", ctx);
 
   // Invalid complex queries with hybrid vector

diff --git a/tests/pytests/test_aggregate_params.py b/tests/pytests/test_aggregate_params.py
@@ -72,11 +72,11 @@ def test_apply(env):
         conn.execute_command('HSET', 'dkey:3', 'name', 'Perrito', 'breed', 'poodle', 'code', 'ca?33-22')
         conn.execute_command('HSET', 'dkey:4', 'name', 'Lou Dog', 'breed', 'Dalmatian', 'code', 'ca:99-##')
         conn.execute_command('HSET', 'dkey:5', 'name', 'dipper', 'breed', 'dalmatian', 'code', 'ca:99-##')
-        conn.execute_command('HSET', 'dkey:6', 'name', 'Duff', 'breed', 'Dalmatian', 'code', 'gp-33-22')
-        conn.execute_command('HSET', 'dkey:7', 'name', 'Triumph', 'breed', 'Mountain Hound', 'code', 'gp-33-22')
-        conn.execute_command('HSET', 'dkey:8', 'name', 'Chuck', 'breed', 'Saluki', 'code', 'gp-33-22')
-        conn.execute_command('HSET', 'dkey:9', 'name', 'Tuk', 'breed', 'Husky', 'code', 'gp-33-22')
-        conn.execute_command('HSET', 'dkey:10', 'name', 'Jul', 'breed', 'St. Bernard', 'code', 'gp-33-22')
+        conn.execute_command('HSET', 'dkey:6', 'name', 'Duff', 'breed', 'Dalmatian', 'code', 'gg-*33-22')
+        conn.execute_command('HSET', 'dkey:7', 'name', 'Triumph', 'breed', 'Mountain Hound', 'code', 'gg-*33-22')
+        conn.execute_command('HSET', 'dkey:8', 'name', 'Chuck', 'breed', 'Saluki', 'code', '*gg-33-22')
+        conn.execute_command('HSET', 'dkey:9', 'name', 'Tuk', 'breed', 'Husky', 'code', '*gg-33-22')
+        conn.execute_command('HSET', 'dkey:10', 'name', 'Jul', 'breed', 'St. Bernard', 'code', 'gg-33-22')
 
         res1 = env.cmd('ft.aggregate', 'idx', '@breed:(Dal*|Poo*|Ru*|Mo*)', 'LOAD', '2', '@name', '@breed', 'FILTER', 'exists(@breed)', 'APPLY', 'upper(@name)', 'AS', 'n', 'APPLY', 'upper(@breed)', 'AS', 'b', 'SORTBY', '4', '@b', 'ASC', '@n', 'ASC')
         res2 = env.cmd('ft.aggregate', 'idx', '@breed:($p1*|$p2*|$p3*|$p4*)', 'LOAD', '2', '@name', '@breed', 'FILTER', 'exists(@breed)', 'APPLY', 'upper(@name)', 'AS', 'n', 'APPLY', 'upper(@breed)', 'AS', 'b', 'SORTBY', '4', '@b', 'ASC', '@n', 'ASC', 'PARAMS', '8', 'p1', 'Dal', 'p2', 'Poo', 'p3', 'Ru', 'p4', 'Mo')
@@ -85,3 +85,65 @@ def test_apply(env):
         res2 = env.cmd('ft.aggregate', 'idx', '@breed:($p1*|$p2*|$p3*|$p4*)', 'PARAMS', '8', 'p1', 'Dal', 'p2', 'Poo', 'p3', 'Ru', 'p4', 'Mo', 'SORTBY', '1', '@name')
         env.assertEqual(res2, res1)
 
+        # Tag autoescaping
+        if dialect == 5:
+            res1 = env.cmd('ft.aggregate', 'idx', '@code:{ca?33-22}',
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res1, [1, ['code', 'ca?33-22', 'total', '3']])
+            res2 = env.cmd('ft.aggregate', 'idx', '@code:{$p1}',
+                           'PARAMS', '2', 'p1', 'ca?33-22', 
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res2, res1)
+
+            res1 = env.cmd('ft.aggregate', 'idx', "@code:{*:99-##}",
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res1, [1, ['code', 'ca:99-##', 'total', '2']])
+            res2 = env.cmd('ft.aggregate', 'idx', "@code:{*$p1}",
+                           'PARAMS', '2', 'p1', ':99-##', 
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res2, res1)
+
+            # ------------------------------------------------------------------
+            # Tests with literal '*' being a part of the tag
+            # ------------------------------------------------------------------
+            # full match, the '*' is escaped to be literal
+            res1 = env.cmd('ft.aggregate', 'idx', "@code:{\\*gg-33-22}",
+                            'GROUPBY', '1', '@code',
+                            'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res1, [1, ['code', '*gg-33-22', 'total', '2']])
+            # full match, the '*' is part of the PARAM, so it's not escaped
+            res2 = env.cmd('ft.aggregate', 'idx', "@code:{$p1}",
+                           'PARAMS', '2', 'p1', '*gg-33-22', 
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res2, res1)
+
+            # prefix, the '*' in the middle is escaped to be literal
+            res1 = env.cmd('ft.aggregate', 'idx', "@code:{gg-\\*33-*}",
+                            'GROUPBY', '1', '@code',
+                            'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res1, [1, ['code', 'gg-*33-22', 'total', '2']])
+            # prefix, the '*' is part of the PARAM, so it's not escaped
+            res2 = env.cmd('ft.aggregate', 'idx', "@code:{$p1*}",
+                           'PARAMS', '2', 'p1', 'gg-*33-',
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res2, res1)
+
+            # suffix, the second '*' is escaped to be literal
+            res1 = env.cmd('ft.aggregate', 'idx', "@code:{*\\*33-22}",
+                            'GROUPBY', '1', '@code',
+                            'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res1, [1, ['code', 'gg-*33-22', 'total', '2']])
+            # suffix, the '*' is part of the PARAM, so it's not escaped
+            res2 = env.cmd('ft.aggregate', 'idx', "@code:{*$p1}",
+                           'PARAMS', '2', 'p1', '*33-22',
+                           'GROUPBY', '1', '@code',
+                           'REDUCE', 'COUNT', 0, 'AS', 'total')
+            env.assertEqual(res2, res1)
+
+