Regexp as an Algebraic Data Type

Greg Lawson edited this page May 22, 2014 · 38 revisions

The Regexp data type in Ruby seems to be stuck at the level of abstraction of its ancestors in Perl 5. In order to be both easier to use and more powerful it needs to support a higher level of abstraction and become more Ruby-like. I call this an Algebraic Data Type. In Computer Science a Data Type (see Wikipedia) is a conceptually related set of data structures and functions. The Ruby core Regexp class (see) provides the minimal data structure and functions. In mathematics an algebra (see Wikipedia) is formalization of operators and predicates that draws on our intuitions learned in high school algebra. The mathematical axioms for regular expressions are the Kleene Algebra (see Wikipedia).

Additions to Regexp include:

  1. Syntactic sugar (see Wikipedia) for creating Regexp from other Regexps and hiding the String implementation (which is error prone). See Regexp and See RegexpTest

  2. Regexp Parsing to allow decomposition of Regexp into their components. See regexp_parse to replace class RegexpParse.

  3. Pretty print Regexps where metacharacters and escapes are not ambiguous (Regexp#to_s, Regexp#inspect do not do this to my satisfaction).

  4. Pretty print Regexp with the syntactic sugar of #1 above.

  5. Storing useful Regexp into databases for reuse. (See Stack Overflow for discussion of security)

  6. Allow Regexps to be composed of stored Regexp that can be improved independently. Aliases provide one mechanism. Grok does a simple version of this where %{regesp:semantic_string} is used. Ruby Regexp Posix Character Classes provide a mechanism that could be hijacked for macro expansion.

  7. Ordering Regexps by generalization / specialization, probability of random match, .

  8. Finding the most specific Regexps that match one or more strings. See Generic_Type and GenericTypeTest

  9. Determining the data type of a column of data (most specific match that matches all the data).

  10. Automatically create an Hash for Regexp named captures. See Parse and ParseTest

  11. Manage single (see Regexp#match) and multiple matches (see String#split). This is currently done by the instance methods of Parse. But since I'd rather use instances of Parse to store Regexp with specific options, I plan to split off a new Capture class from the Parse class. Also see Replication Expansion.

  12. Regexp debugging:

    1. given a failed match, suggests Regexp generalizations that do match,

    2. given a failed match give matching and not matching components and substrings. See RegexpMatch and RegexpMatchTest

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.