Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFAs can take wrong actions due to ambiguities #48

Closed
jakobnissen opened this issue May 22, 2020 · 3 comments
Closed

DFAs can take wrong actions due to ambiguities #48

jakobnissen opened this issue May 22, 2020 · 3 comments

Comments

@jakobnissen
Copy link
Member

Consider the following, highly simplified FASTA format:

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

machine = (function ()
    header = re">[A-Z]+"
    seqline = re"[A-Z]+"
    sequence = seqline * re.rep(re"\n" * seqline)
    record = header * re"\n" * sequence
    fasta = re.rep1(record * re"\n")

    header.actions[:exit] = [:header]
    seqline.actions[:exit] = [:seqline]
    record.actions[:exit] = [:record]

    Automa.compile(fasta)
end)()

By definition of record, a record MUST contain a header, which contains a >. However see the flowchart produced for this machine by:

write("fasta.dot", Automa.machine2dot(machine))
run(`dot -Tsvg -o fasta.svg fasta.dot`)

The state 5 corresponds to seqline, and a following newline brings it to state 6. In this state transition, the action record is executed. But clearly, from the defition, we did not necessarily exit record! This creates a loop between state 5 and 6, such that record is executed at every newline.

This means that record can be repeatedly exited, without it being entered, and allows for e.g. a record only containing seqline, against the definition above.

@jakobnissen
Copy link
Member Author

jakobnissen commented May 25, 2020

Simplified to

machine = let
    C = re.rep(re"A" * re"B")
    D = C * re"A"
    C.actions[:enter] = [:enter]
    C.actions[:exit] = [:exit]
    Automa.compile(D)
end

Note in particular that the C pattern is entered once, and then can be exited an arbitrary amount of times with an input like "ABABAB".

@jakobnissen
Copy link
Member Author

jakobnissen commented May 25, 2020

Right, so I think I've figured out the problem. Let's have a simple regex C:

A = re"XZ"
B = re"XY"
C = A | B
A.actions[:enter] = [:entering_A]
B.actions[:enter] = [:entering_B]
A.actions[:exit] = [:exiting_A]
B.actions[:exit] = [:exiting_B]

Now, create an NFA using Automa.re2nfa(C) and visualize it. This NFA represents those regexes and their actions perfectly. Note that this NFA is truly nondeterministic; when encountering an "X" byte, it is impossible to know if we are entering the A regex or the B regex. In the NFA, this is visualized by two different epsilon edges, one that leads to A and one that leads to B. Only the one that leads to A have the action entering_A associated with it, and vice versa for B.

Now convert the NFA to a DFA. The resulting DFA is equivalent to the NFA only in the sense that it recognizes the same inputs. The edges are NOT the same, and consequently, the two epsilon edges, labeled with entering_A and entering_B are collapsed, and important information is lost. The result is that when we encounter an "X" byte, the regex enters BOTH A and B. Notably, at end of file, only ONE of these regexes are exited (which is clearly absurd)

This is impossible to solve, I think, but it would really be nice if there was some sort of warning or even an error if attempting to convert an NFA that had actions associated to ambiguous transitions to a DFA.

Edit: This can be done by modifying the epsilon_closure function. If you can go multiple paths from one node to another node only moving through epsilon edges, and not all these paths have the same labeled edges, then the machine is ambiguous and should not be compiled.

@jakobnissen jakobnissen changed the title Possible bug in state transitions DFAs can take wrong actions due to ambiguities May 25, 2020
@jakobnissen
Copy link
Member Author

Solved by #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant