Add a tokenize function. #1479

nbp · 2017-07-22T15:28:54Z

Previously we had a way to make a tokenizer by abusing the construction of large regexp, but due to Nix 1.12 changes, we are limited in the number of states creations and thus to quadratic algorithms, due to the full-match nature of std::regex_match.

Tokenizers are made by looking for finite set of vocabulary, and thus a limited set of states. We could make a tokenize function which given a regex iterates over continuous matches of tokens, and returns a list of matches, or a way to fold these matches as they appear.

Doing so would help solve the problem seen at mozilla/nixpkgs-mozilla#40 which currently has no good solutions.

One way to implement it would be to use http://en.cppreference.com/w/cpp/regex/regex_iterator/regex_iterator , and maybe given the std::regex_constants::match_continuous argument.

The text was updated successfully, but these errors were encountered:

taktoa · 2017-08-02T20:35:15Z

Relevant: #1491

nbp mentioned this issue Jul 22, 2017

parseTOML regex too large for Nix 1.12 mozilla/nixpkgs-mozilla#40

Closed

This was referenced Aug 15, 2017

Add an Earley parser builtin #1491

Open

Add builtins.split function. #1516

Merged

edolstra closed this as completed in #1516 Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a tokenize function. #1479

Add a tokenize function. #1479

nbp commented Jul 22, 2017

taktoa commented Aug 2, 2017

Add a tokenize function. #1479

Add a tokenize function. #1479

Comments

nbp commented Jul 22, 2017

taktoa commented Aug 2, 2017