Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Starrocks regexp function support CASE_INSENSITIVE #14439

Open
jxzdoing opened this issue Dec 1, 2022 · 26 comments
Open

[Feature] Starrocks regexp function support CASE_INSENSITIVE #14439

jxzdoing opened this issue Dec 1, 2022 · 26 comments
Assignees

Comments

@jxzdoing
Copy link

jxzdoing commented Dec 1, 2022

Feature request

Now,Starrocks function "regexp" match content CASE_SENSITIVE,But we need to ignore case when matching in our scenario
should we add a new function such as named "regexp_ignore" or rewrite the function "regexp" add a default parameter "case_sensitive",if we don't pass the parameter,default is CASE_INSENSITIVE,if we pass value mean CASE_INSENSITIVE,sunch as:
select regexp("mac","MAC",1);
return 1
select regexp("mac","MAC");
return 0

@wangsimo0
Copy link
Contributor

wangsimo0 commented Dec 5, 2022

Cause regexp in SR support regular expression, so you can write sth like select regexp('mac','[MACmac]'); that match string that either is mac or MAC.
And I know this is very limited and not easy to use, what you recommend is great, we'll do some further research.

@wangsimo0
Copy link
Contributor

also like function has the same problem. Better to add another ilike function to support case-insensitive scenario.

@wangsimo0
Copy link
Contributor

wangsimo0 commented Dec 5, 2022

As jxzdoing mentioned, the regexp better add new optional parameter:

  • c: Case-sensitive matching (the default).
  • i: Case-insensitive matching.

that means:

select regexp('mac','MAC','i');
+----------------------+
| regexp('mac', 'MAC','i') |
+----------------------+
|                    1 |
+----------------------+

select regexp('mac','MAC','c');
+----------------------+
| regexp('mac', 'MAC','c') |
+----------------------+
|                    0 |
+----------------------+

select regexp('mac','MAC');
+----------------------+
| regexp('mac', 'MAC','c') |
+----------------------+
|                    0 |
+----------------------+

@numbernumberone
Copy link

Hi @wangsimo0 , please assign this to me. Thank you.

@numbernumberone
Copy link

As @jxzdoing said, or the ilike function that @wangsimo0 referred to should we add a new function such as named "regexp_ignore" to support case-insensitive scenario? I feel that this will look more intuitive and easier for users to understand.

such as

select regexp_ignore('mac','MAC');
+----------------------+
|                    1 |
+----------------------+

select regexp_ignore('max','MAC');
+----------------------+
|                    0 |
+----------------------+

select regexp('mac','MAC');
+----------------------+
|                    0 |
+----------------------+

@numbernumberone
Copy link

Then I will implement it according to the following method.

As jxzdoing mentioned, the regexp better add new optional parameter:

c: Case-sensitive matching (the default).
i: Case-insensitive matching.
that means:

select regexp('mac','MAC','i');
+----------------------+
| regexp('mac', 'MAC','i') |
+----------------------+
| 1 |
+----------------------+

select regexp('mac','MAC','c');
+----------------------+
| regexp('mac', 'MAC','c') |
+----------------------+
| 0 |
+----------------------+

select regexp('mac','MAC');
+----------------------+
| regexp('mac', 'MAC','c') |
+----------------------+
| 0 |
+----------------------+
ok ?

@wangsimo0
Copy link
Contributor

@numbernumberone That's great! Personally, I prefer the parameter design. There are other examples in other databases:
snowflake: https://docs.snowflake.com/en/sql-reference/functions/rlike
teradata:https://docs.teradata.com/r/Teradata-VantageTM-SQL-Functions-Expressions-and-Predicates/March-2019/Regular-Expression-Functions/Pattern-Matching-Limitation
oracle:https://docs.oracle.com/database/121/SQLRF/functions162.htm#SQLRF06300

does this make sense? or do you have any other recommendations?

@numbernumberone
Copy link

Thank you very much for the clear description. I followed the parameter mode.

@numbernumberone
Copy link

@wangsimo0 Teacher Wang, I am currently facing some difficulties and need to ask for your help. This is the code I submitted for this function. Can you help me identify where the problem is? I have been trying to debug the program for a long time but couldn't find a solution. Do you have any good suggestions for me? Thank you very much.

main...numbernumberone:starrocks:dev_20230422

@numbernumberone
Copy link

image

@numbernumberone
Copy link

2023-04-24 10:22:50,312 WARN (starrocks-mysql-nio-pool-0|120) [ConnectProcessor.handleQuery():369] Process one query failed. SQL: select regexp('mac','MAC','i'), because.
com.starrocks.common.AnalysisException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ',' at line 1
at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:325) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:462) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:728) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_362]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_362]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_362]

@packy92
Copy link
Contributor

packy92 commented Apr 24, 2023

You have done a good job at the function execution level.
For a new sql grammar, the first thing SR does is to require our grammar parser to support it. SR currently uses antlr to write grammar rules to automatically generate parser code. You can refer to these rules in the Starrocks.g4 file to see the difference between the definition of REGEXP fucntion and the other functions. In addition, when you say you have problem when debug, do you mean you cannot debug the FE code?

@numbernumberone
Copy link

Thank you very much director, the error message I provided is the error of the FE node. Logically, if I modify functions.py, will fe automatically generate the code for antlr syntax parsing?

    [60021, 'REGEXP', 'BOOLEAN', ['VARCHAR', 'VARCHAR', 'VARCHAR'], 'LikePredicate::regex', 'LikePredicate::regex_prepare',
     'LikePredicate::regex_close'],

@packy92
Copy link
Contributor

packy92 commented Apr 24, 2023

functions.py just help us to generate VectorizedBuiltinFunctions.java code which used to do some register fucntion job. The parser code generated by Starrocks.g4 help us to covert a sql string to an AST object.

@numbernumberone
Copy link

Thank you so much. Could I ask if there are any related reference materials or explanations? Because I didn't see any instructions for Starrocks.g4 in the official documentation.

such as:

https://github.com/StarRocks/starrocks/pull/1264/files

and

https://mp.weixin.qq.com/s/T8K6ci_sx51oGn8H5fSKTg

1 similar comment
@numbernumberone
Copy link

Thank you so much. Could I ask if there are any related reference materials or explanations? Because I didn't see any instructions for Starrocks.g4 in the official documentation.

such as:

https://github.com/StarRocks/starrocks/pull/1264/files

and

https://mp.weixin.qq.com/s/T8K6ci_sx51oGn8H5fSKTg

@packy92
Copy link
Contributor

packy92 commented Apr 24, 2023

For most common function, we have defined a common rule to parse them, but REGEXP function is a little different because its name is a keyword. You can get some information from this article

@numbernumberone
Copy link

I would like to ask for advice,Constant strings are not using regular expressions, but rather using string matching algorithms. So, in a case-insensitive scenario, should string matching be used instead of regular expression matching?

@packy92
Copy link
Contributor

packy92 commented Apr 25, 2023

Do you mean the second paramter is a constant string?

@numbernumberone
Copy link

yeah

 // The following four conditionals check if the pattern is a constant string,
    // starts with a constant string and is followed by any number of wildcard characters,
    // ends with a constant string and is preceded by any number of wildcard characters or
    // has a constant substring surrounded on both sides by any number of wildcard
    // characters. In any of these conditions, we can search for the pattern more
    // efficiently by using our own string match functions rather than regex matching.
    if (RE2::FullMatch(pattern_str, EQUALS_RE, &search_string)) {
        state->set_search_string(search_string);
        state->function = &constant_equals_fn;
    } else if (RE2::FullMatch(pattern_str, STARTS_WITH_RE, &search_string)) {
        state->set_search_string(search_string);
        state->function = &constant_starts_with_fn;
    } else if (RE2::FullMatch(pattern_str, ENDS_WITH_RE, &search_string)) {
        state->set_search_string(search_string);
        state->function = &constant_ends_with_fn;
    } else if (RE2::FullMatch(pattern_str, SUBSTRING_RE, &search_string)) {
        state->set_search_string(search_string);
        state->function = &constant_substring_fn;
    } else {
        RETURN_IF_ERROR(compile_with_hyperscan_or_re2<false>(pattern_str, state, context, pattern));
    }

@packy92
Copy link
Contributor

packy92 commented Apr 25, 2023

You can try it and test if your string case-insensitive matching algorithm is better than common regular expression matching algorithm.

@numbernumberone
Copy link

Ok, thank you very much

@wangsimo0 wangsimo0 added the good first issue Good for newcomers label May 12, 2023
Copy link

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

@numbernumberone
Copy link

Sorry, I just took some time out recently to continue working on it. It will be done soon

@RickWong
Copy link

RickWong commented Jan 11, 2024

Question, is a non-regexp version of ILIKE going to be made available?

Edit: Answering this for others, ILIKE is actually already available as an SQL function bool ilike(column, pattern), just not documented.

@xyan3880
Copy link

xyan3880 commented May 12, 2024

Sorry, I just took some time out recently to continue working on it. It will be done soon

It will be appreciated.

Also it seems regexp in starrocks handles "new line" differently in matching ".*" or ".*?" . Usually .* should not match new line.

In starrocks, "abc\nd" matches "abc.*d":

select regexp('abc\nd', 'abc.*d') a;
| a    |
|    1 |

Not in Mariadb:

MariaDB > select 'abc\nd' regexp 'abc.*d' a;
| a |
| 0 |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants