GitHub - ChrisKuklewicz/ocaml-regex-tfa: OCaml implementation of POSIX regular expressions with efficient tagged engine

To compile: run ./compile

This is my "learn OCaml the hard way" project. The project is to take the great and awesome Haskell regex-tdfa and achieve a similar goal in OCaml.

What does it do? POSIX ERE

It implements POSIX extended regular expressions. By following a proper specification: Without dark corners, without weird results with ambiguous patterns, and with a good starting test suite.

Why was my Haskell library great? Correctness

I have tested against a few other POSIX implementations and the AT&T one, including an expanded collection of hand written tests and via random fuzz testing against them. All of them failed at least some tests (results) -- except for my regex-tdfa which has fixed.

What was Haskell library awesome? Efficiency

By following the efficient tagged single-pass design behind the TRE library I was able to get space usage dependent on only the pattern, not the length of searched text, even while capturing the proper substring.

But then I discovered that TRE was and is quite buggy. Some serious bugs and and TRE does not follow the great spec for "ambiguous" patterns. I had to invent a new and more complicated algorithm to handle ambiguous patterns and improve it with a new kind of comparison pass to recover the constant space usage.

What is the new OCaml version? Redesigned

I am not porting the Haskell code, but rather creating a very different implementation of the tagged automata.

One big change is that the old Haskell design handled "(p){10,100}" by creating 100 copies of the pattern "(p)" in the automata while the new OCaml design keeps a single copy. The Haskell code builds an NFA into a DFA while the OCaml code uses an internal syntax tree as the automata state. This is a much clearer approach.

Is it done? Almost

The latest engine, simFlush.ml, does not do the optimizing comparison pass and so is not yet in true constant space usage. The potted history:

simulate.ml

"Depth first" implementation that generated all possible matches and then picks the correct one. Much happiness in that this passed the same unit tests as the Haskell version.

simStep.ml

Improved so that there is no bracktacking, possibilities are pruned at each step and complete patterns are reported promptly. A big map of saved contexts is used to keep track of possibilities. Selecting best winner at end passes unit tests. This really shows that the new handling of "(p){10,100}" patterns worked.

saveContext.ml

Unfinished and abandoned step. Took break and came back later.

simCont.ml

Worked the design out (see CODE-GUIDE.md) for improving simStep.ml, replacing the context map with "continue-to" fields in the tree. Each character leads to "two and a half" passes of the tree. Tests pass.

simFlush.ml

Store the state in mutable tree instead of a map -- this improves the traversal of the state. Each character leads to two passes of the tree. The tree traversal allows merging possibilities to be compared earlier which improves efficiency. ( For pattern size 'm' I think a simCont pass had O(m^2) behavior and simFlush has O(m) ).

to be written

Add the comparison pass to compress the state, this will change the worst case space usage for text of length 'n' from O(n) to O(1).

The compression of loop-histories needs to group the stored histories into the right cohorts. The Haskell chorts are quite broad but I think starting from simflush.ml I can make the new cohorts more narrow. The loop-history only matters when (the Repeat node has a variable length sub-pattern) and (histories have the same Repeat start) and (either (histories are looping back at the same position) or (histories have same Repeat stop)).

I think the flushUp from the sub-pattern of the Repeat catches (either (histories are looping back at the same position) or (histories have same Repeat stop)). Thus the cohort is already there and ready for compression.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
test		test
.gitignore		.gitignore
.ocamlinit		.ocamlinit
CODE-GUIDE.md		CODE-GUIDE.md
Common.ml		Common.ml
Common.mli		Common.mli
CommonRank.ml		CommonRank.ml
CommonRank.mli		CommonRank.mli
CorePattern.ml		CorePattern.ml
CorePattern.mli		CorePattern.mli
LICENSE		LICENSE
Main.ml		Main.ml
Monoid.ml		Monoid.ml
Monoid.mli		Monoid.mli
Pattern.ml		Pattern.ml
Pattern.mli		Pattern.mli
README.md		README.md
ReadPattern.ml		ReadPattern.ml
ReadPattern.mli		ReadPattern.mli
SimCont.ml		SimCont.ml
SimCont.mli		SimCont.mli
SimFlush.ml		SimFlush.ml
SimFlush.mli		SimFlush.mli
SimRank.ml		SimRank.ml
SimStep.ml		SimStep.ml
SimStep.mli		SimStep.mli
Simulate.ml		Simulate.ml
Simulate.mli		Simulate.mli
Testab.ml		Testab.ml
Testab.mli		Testab.mli
Testnull.ml		Testnull.ml
Testnull.mli		Testnull.mli
UnitTest.ml		UnitTest.ml
UnitTest.mli		UnitTest.mli
UnitTestCont.ml		UnitTestCont.ml
UnitTestCont.mli		UnitTestCont.mli
UnitTestFlush.ml		UnitTestFlush.ml
UnitTestFlush.mli		UnitTestFlush.mli
UnitTestRank.ml		UnitTestRank.ml
UnitTestRank.mli		UnitTestRank.mli
UnitTestStep.ml		UnitTestStep.ml
UnitTestStep.mli		UnitTestStep.mli
WhichTest.ml		WhichTest.ml
WhichTest.mli		WhichTest.mli
compile		compile
make_dep.sh		make_dep.sh
maker		maker
ocamlbrowser.sh		ocamlbrowser.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

ChrisKuklewicz/ocaml-regex-tfa

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages