Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added regexpMathScalar #15522

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

Lyra126
Copy link

@Lyra126 Lyra126 commented Feb 6, 2024

Summary of the changes / Why this improves CrateDB

Added RegexpMatchScalar

Submitting this pull request as a solution to the following issue: #15117

  • Created a class, implementing it similar to the existing RegexpMatchOperator.
  • Instead inherited it from Scalar instead of Operator
  • Had it support the use of flags.
  • Updated ScalarFunctionModule.java to register it as a function

Checklist

  • Added an entry in the latest docs/appendices/release-notes/<x.y.0>.rst for user facing changes
  • Updated documentation & sql_features table for user facing changes
  • Touched code is covered by tests
  • [✓] CLA is signed
  • [✓] This does not contain breaking changes, or if it does:
    • It is released within a major release
    • It is recorded in the latest docs/appendices/release-notes/<x.y.0>.rst
    • It was marked as deprecated in an earlier release if possible
    • You've thought about the consequences and other components are adapted
      (E.g. AdminUI)

@crate-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@matriv matriv requested a review from seut February 8, 2024 11:47
@seut
Copy link
Member

seut commented Feb 8, 2024

@Lyra126
Thank you for this contribution.
Probably I confused you by my comment #15117 (comment) that the scalar is almost equal to the existing operator. But the scalar should return the matched values, so the return type should be a array of strings ->DataTypes.STRING_ARRAY.
See also the example in the original issue request, this example must work.

Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add related tests, best by extending from ScalarTestCase please? thanks!


public RegexpMatchScalar(Signature signature, BoundSignature boundSignature) {
super(signature, boundSignature);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a compile implementation? By this, any literal pattern will be compiled and flags will be parsed only once and not on each row, resulting in performance improvements.

Comment on lines 96 to 101
int flagSeparatorIndex = pattern.lastIndexOf('/');
if (flagSeparatorIndex != -1 && flagSeparatorIndex < pattern.length() - 1) {
pattern = pattern.substring(0, flagSeparatorIndex);
flagsStr = pattern.substring(flagSeparatorIndex + 1);
flags = parseFlags(flagsStr);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any optional flag would be the 3rd argument and not part of the pattern, see https://www.postgresql.org/docs/current/functions-matching.html.
To support this, a custom toQuery(Function function, LuceneQueryBuilder.Context context) would be needed to take the 3rd argument into account.

@Lyra126
Copy link
Author

Lyra126 commented Feb 24, 2024

Thank you for your continuous review and help with this issue! This is my first time contributing, and I appreciate the help and clarification. Let me know if I'm missing anything else.


@Test
public void testNormalize() throws Exception {
assertNormalize("'foo bar' ~ '([A-Z][^ ]+ ?){2}'", isLiteral(true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are testing the existing operator and not your new scalar.
Not exactly sure what you want to test here but this expression will fail in anyway as the regex won't match.
I highly recommend to run your tests yourself ;)

Suggested change
assertNormalize("'foo bar' ~ '([A-Z][^ ]+ ?){2}'", isLiteral(true));
assertNormalize("regexp_match('foo bar', '([A-Z][^ ]+ ?){2}')", isLiteral());

compilePattern(patternString, flags);
}

private void compilePattern(String patternString, int flags) throws PatternSyntaxException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not implementing the interface's compile method.. Please check existing implementations to understand how it is used.


List<String> matches = new ArrayList<>();
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile(pattern).matcher(source);
while (matcher.find()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PG documentation, it should return null if no match is found.
Additionally, the results of the all groups in the current match should be returned, not the first group only but for all global matches (the global flag 'g' must no be supported by this method).


// Check if the pattern contains flags
String flagsStr = null;
int flagSeparatorIndex = pattern.lastIndexOf('/');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where the flags are added to the pattern by the / separator? I think the existing toQuery implementations won't work as they only support 2 arguments, a reference and a literal, but not 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants