New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSIX submatch semantics is broken in two (rare) cases. #12

Open
skvadrik opened this Issue Aug 12, 2017 · 2 comments

Comments

Projects
None yet
3 participants
@skvadrik

skvadrik commented Aug 12, 2017

As I've been working on the implementation of POSIX-compliant capturing groups in RE2C (http://re2c.org), I discovered a couple of bugs in Regex-TDFA. I found them when fuzz-testing my implementation against Regex-TDFA-1.2.2. The algorithm used in RE2C is described in detail in the following paper:

http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

It is a slightly modified version of Laurikari algorithm. POSIX submatch semantics is due to Kuklewicz: https://wiki.haskell.org/index.php?title=Regular_expressions/Bounded_space_proposal&oldid=11475 ,
but I made an attempt on formalizing Kuklewicz algorithm (also described in the paper). The reported bugs are rare (fuzzer found them approximately once in 50000 runs), so they are probably caused by some mis-optimization.

The first bug can be triggered by regular expression (((a*)|b)|b)+ and input string ab: Regex-TDFA returns incorrect submatch result for second capturing group ((a*)|b) (no match instead of b at offset 1). Some alternative regular expressions that cause the same error: (((a*)|b)|b){1,2}, ((b|(a*))|b)+.

$ ghci
GHCi, version 8.0.2: http://www.haskell.org/ghc/  :? for help
Prelude> import Text.Regex.TDFA as T

The error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "ab"  T.=~ "^(((a*)|b)|b)*" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(-1,0))]]

The same error:

Prelude T> "ab"  T.=~ "^((b|(a*))|b)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

Again, the same error:

Prelude T> "ab"  T.=~ "^(((a*)|b)|b){1,2}" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]

The second bug can be triggered by regular expression ((a?)(())*|a)+ and input string aa. Incorrect result is for second group (a?) (no match instead of a at offset 1), third group (()) and fourth group () (no match instead of empty match at offset 2). Alternative variant that also fails: ((a?()?)|a)+.

The error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a)+" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0)),(4,(-1,0))]]

But not with * (the example below works correctly!):

Prelude T> "aa"  T.=~ "^((a?)(())*|a)*" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?)(())*|a){1,2}" :: [MatchArray]
[array (0,4) [(0,(0,2)),(1,(1,1)),(2,(1,1)),(3,(2,0)),(4,(2,0))]]

The same error:

Prelude T> "aa"  T.=~ "^((a?()?)|a)+" :: [MatchArray]
[array (0,3) [(0,(0,2)),(1,(1,1)),(2,(-1,0)),(3,(-1,0))]]
@neongreen

This comment has been minimized.

Show comment
Hide comment
@neongreen

neongreen Mar 10, 2018

Collaborator

(NB. I'm not the author of the algorithm and unfortunately I don't have the time to delve into this. PRs are welcome.)

Collaborator

neongreen commented Mar 10, 2018

(NB. I'm not the author of the algorithm and unfortunately I don't have the time to delve into this. PRs are welcome.)

@ChrisKuklewicz

This comment has been minimized.

Show comment
Hide comment
@ChrisKuklewicz

ChrisKuklewicz Mar 11, 2018

Owner
Owner

ChrisKuklewicz commented Mar 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment