Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: try upgrade regex-automata #3575

Merged
merged 5 commits into from Mar 26, 2024

Conversation

tisonkun
Copy link
Contributor

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This refers to #3043.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun requested a review from zhongzc March 25, 2024 04:03
@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Mar 25, 2024
@tisonkun
Copy link
Contributor Author

tisonkun commented Mar 25, 2024

Seems some internal impl issues exist; the following cases failed:

index inverted_index::search::fst_apply::intersection_apply::tests::test_intersection_fst_applier_with_valid_pattern
mito2 sst::index::creator::tests::test_create_and_query_regex

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Copy link

codecov bot commented Mar 25, 2024

Codecov Report

Attention: Patch coverage is 96.87500% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.88%. Comparing base (2b2fd80) to head (8b9d6f8).
Report is 4 commits behind head on main.

❗ Current head 8b9d6f8 differs from pull request most recent head d46b413. Consider uploading reports for the commit d46b413 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3575      +/-   ##
==========================================
- Coverage   85.41%   84.88%   -0.53%     
==========================================
  Files         911      917       +6     
  Lines      152425   152890     +465     
==========================================
- Hits       130195   129784     -411     
- Misses      22230    23106     +876     

Copy link
Contributor

@zhongzc zhongzc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@tisonkun
Copy link
Contributor Author

tisonkun commented Mar 25, 2024

@zhongzc It just compiled and passed test. I suggest you do some end to end test for where you need it in the index functionality ..

Also, to glue the regex-automata and fst together, "extra" checks are involved. I don't test the performance after this change.

@zhongzc
Copy link
Contributor

zhongzc commented Mar 25, 2024

Also, to glue the regex-automata and fst together, "extra" checks are involved. I don't test the performance after this change.

I'm not currently motivated to add automated testing to check the performance of fst matching regex statements. In my manual end-to-end bench, it is not a critical impact point.

I suggest you do some end to end test for where you need it in the index functionality ..

Once we have e2e testing one day, I will add enough test cases for index

@tisonkun
Copy link
Contributor Author

tisonkun commented Mar 25, 2024

@zhongzc Thank you! This makes sense to me.

Could you add some other reviews to this PR? I don't know who is familiar with this logic.

@zhongzc
Copy link
Contributor

zhongzc commented Mar 25, 2024

I feel like it’s more correct to write it like this,
ref to https://github.com/BurntSushi/regex-automata/blob/0ba880134d649866fa15809dec9c6eae89cd7591/src/dfa/transducer.rs#L41-L74

impl fst::Automaton for DfaFstAutomaton {
    type State = StateID;

    #[inline]
    fn start(&self) -> Self::State {
        let config = Config::new();
        self.0.start_state(&config).unwrap()
    }

    #[inline]
    fn is_match(&self, state: &Self::State) -> bool {
        self.0.is_match_state(*state)
    }

    #[inline]
    fn can_match(&self, state: &Self::State) -> bool {
        !self.0.is_dead_state(*state)
    }

    #[inline]
    fn accept_eof(&self, state: &StateID) -> Option<StateID> {
        if self.0.is_match_state(*state) {
            return Some(*state);
        }
        Some(self.0.next_eoi_state(*state))
    }

    #[inline]
    fn accept(&self, state: &Self::State, byte: u8) -> Self::State {
        if self.0.is_match_state(*state) {
            return *state;
        }
        self.0.next_state(*state, byte)
    }
}

@tisonkun
Copy link
Contributor Author

@zhongzc Cool. Let me try it out.

Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Contributor Author

It works. Burnt gives a more concrete implementation in https://github.com/BurntSushi/aho-corasick/blob/56256dca1bcd2365fd1dc987c1c06195429a2e2c/src/transducer.rs, which handles Anchored and Unanchored separately.

In our case, it should be always unanchored, meaning we don't ensure the patter match the whole string, but part of the string is OK.

Copy link
Contributor

@evenyag evenyag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

@tisonkun tisonkun added this pull request to the merge queue Mar 26, 2024
Merged via the queue into GreptimeTeam:main with commit 7c1c6e8 Mar 26, 2024
17 checks passed
@tisonkun tisonkun deleted the upgrade-regex-automata branch March 26, 2024 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants