Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

first commit

  • Loading branch information...
commit 8749ebd17faf03034ed4450925c2f37209dab364 0 parents
kapec authored
1,181 doc.html
@@ -0,0 +1,1181 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html>
+<head>
+ <title>LPeg - Parsing Expression Grammars For Lua</title>
+ <link rel="stylesheet" href="http://www.keplerproject.org/doc.css" type="text/css"/>
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
+</head>
+<body>
+
+<div id="container">
+
+<div id="product">
+ <div id="product_logo">
+ <a href="http://www.inf.puc-rio.br/~roberto/lpeg.html">
+ <img alt="LPeg logo" src="lpeg-128.gif"/>
+ </a>
+ </div>
+ <div id="product_name"><big><strong>LPeg</strong></big></div>
+ <div id="product_description">
+ Parsing Expression Grammars For Lua, version 0.6
+ </div>
+</div> <!-- id="product" -->
+
+<div id="main">
+
+<div id="navigation">
+<h1>LPeg</h1>
+
+<ul>
+ <li><strong>Home</strong>
+ <ul>
+ <li><a href="#intro">Introduction</a></li>
+ <li><a href="#basic">Basic Constructions</a></li>
+ <li><a href="#grammar">Grammars</a></li>
+ <li><a href="#captures">Captures</a></li>
+ <li><a href="#ex">Some Examples</a></li>
+ <li><a href="#re">The <code>re</code> Module</a></li>
+ <li><a href="#download">Download</a></li>
+ <li><a href="#license">License</a></li>
+ </ul>
+ </li>
+</ul>
+</div> <!-- id="navigation" -->
+
+<div id="content">
+
+
+<h2><a name="intro">Introduction</a></h2>
+
+<p>
+<em>LPeg</em> is a new pattern-matching library for Lua,
+based on
+<a href="http://pdos.csail.mit.edu/%7Ebaford/packrat/">
+Parsing Expression Grammars</a> (PEGs).
+In this text, I assume you are familiar with PEGs.
+If you are not, you can get a quick start reading
+the
+<a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">
+Wikipedia Entry for PEGs</a>
+or Section 2 of
+<a href="http://pdos.csail.mit.edu/%7Ebaford/packrat/popl04/peg-popl04.pdf">
+Parsing Expression Grammars: A Recognition-Based Syntactic Foundation</a>
+(the section has only one page).
+The nice thing about PEGs is that it has a formal basis
+(instead of being an ad-hoc set of features),
+allows an <em>efficient and simple implementation</em>,
+and does most things we expect from a pattern-matching library
+(and more, as we can define entire grammars).
+</p>
+
+<p>
+Following the Snobol tradition,
+LPeg defines patterns as first-class objects.
+That is, patterns are regular Lua values
+(represented by userdata).
+The library offers several functions to create
+and compose patterns.
+With the use of metamethods,
+several of these functions are provided as infix or prefix
+operators.
+On the one hand,
+the result is usually much more verbose than the typical
+encoding of patterns using the so called
+<em>regular expressions</em>
+(which typically are not regular expressions in the formal sense).
+On the other hand,
+first-class patterns allow much better documentation
+(as it is easy to comment the code,
+to use auxiliary variables to break complex definitions, etc.)
+and are extensible,
+as we can define new functions to create and compose patterns.
+</p>
+
+<p>
+For a quick glance of the library,
+the following table summarizes its basic operations
+for creating patterns:
+</p>
+<table border="1">
+<tbody><tr><td><b>Operator</b></td><td><b>Description</b></td></tr>
+<tr><td><code>lpeg.P(string)</code></td>
+ <td>Matches <code>string</code> literally</td></tr>
+<tr><td><code>lpeg.P(number)</code></td>
+ <td>Matches exactly <code>number</code> characters</td></tr>
+<tr><td><code>lpeg.S(string)</code></td>
+ <td>Matches any character in <code>string</code> (set)</td></tr>
+<tr><td><code>lpeg.R("<em>xy</em>")</code></td>
+ <td>Matches any character between <em>x</em> and <em>y</em> (range)</td></tr>
+<tr><td><code>patt^n</code></td>
+ <td>Matches at least <code>n</code> repetitions of <code>patt</code></td></tr>
+<tr><td><code>patt^-n</code></td>
+ <td>Matches at most <code>n</code> repetitions of <code>patt</code></td></tr>
+<tr><td><code>patt1 * patt2</code></td>
+ <td>Matches <code>patt1</code> followed by <code>patt2</code></td></tr>
+<tr><td><code>patt1 + patt2</code></td>
+ <td>Matches <code>patt1</code> or <code>patt2</code>
+ (ordered choice)</td></tr>
+<tr><td><code>patt1 - patt2</code></td>
+ <td>Matches <code>patt1</code> if <code>patt2</code> does not match</td></tr>
+<tr><td><code>-patt</code></td>
+ <td>Equivalent to <code>"" - patt</code></td></tr>
+</tbody></table>
+
+<p>As a very simple example, <code>lpeg.R("09")^1</code> matches
+a non-empty sequence of digits.
+As a not so simple example,
+<code>-lpeg.P(1)</code>
+(which can be written as <code>lpeg.P(-1)</code>
+or simply <code>-1</code> for operations expecting a pattern)
+matches an empty string only if it cannot match a single character;
+so, it succeeds only at the subject's end.
+</p>
+
+<p>
+Those not convinced by the previous syntax
+can try the <a href="#re"><code>re</code> module</a>,
+which implements patterns following a regular-expression style
+(e.g., <code>[09]+</code>).
+(This module is less than 100 lines of Lua code,
+and of course uses LPeg to parse regular expressions.)
+This library currently supports only very basic captures.
+But it provides a good source for those looking for
+extended examples of LPeg definitions.
+</p>
+
+
+
+<h2><a name="basic">Basic Constructions</a></h2>
+
+<p>
+Most of the following operations build patterns.
+All operations that expect a pattern as an argument
+may receive also strings, tables, numbers, booleans, or functions,
+which are translated to patterns according to
+the rules of function <a href="#lpeg"><code>lpeg.P</code></a>.
+</p>
+
+
+<h3><code>lpeg.match (pattern, subject [, init])</code></h3>
+<p>
+The matching function.
+It attempts to match the given pattern against the subject string.
+If the match succeeds,
+returns the index in the subject of the first character after the match,
+or the values of <a href="#captures">captured values</a>
+(if the pattern captured any value).
+</p>
+
+<p>
+An optional numeric argument <code>init</code> makes the match
+starts at that position in the subject string.
+As usual in Lua libraries,
+a negative value counts from the end.
+</p>
+
+<p>
+Unlike typical pattern-matching functions,
+<code>match</code> works only in <em>anchored</em> mode;
+that is, it tries to match the pattern with a prefix of
+the given subject string (at position <code>init</code>),
+not with an arbitrary substring of the subject.
+So, if we want to find a pattern anywhere in a string,
+we must either write a loop in Lua or write a pattern that
+matches anywhere.
+This second approach is easy and quite efficient;
+see <a href="#ex">examples</a>.
+</p>
+
+
+<h3><a name="lpeg"><code>lpeg.P (value)</code></a></h3>
+<p>
+Converts the given value into a proper pattern,
+according to the following rules:
+</p>
+<ul>
+
+<li><p>
+If the argument is a pattern,
+it is returned unmodified.
+</p></li>
+
+<li><p>
+If the argument is a string,
+it is translated to a pattern that matches literally the string.
+</p></li>
+
+<li><p>
+If the argument is a number,
+it is translated as follows.
+A non-negative number <em>n</em> gives a pattern that
+matches exactly <em>n</em> characters;
+a negative number <em>-n</em> gives a pattern that
+succeeds only if the input string does not have <em>n</em> characters.
+It is (as expected) equivalent to the unary minus operation (see below)
+applied over the absolute value of <em>n</em>.
+</p></li>
+
+<li><p>
+If the argument is a boolean,
+the result is a pattern that always succeeds or always fails
+(according to the boolean value),
+without consuming any input.
+</p></li>
+
+<li><p>
+If the argument is a table,
+it is interpreted as a grammar
+(see <a href="#grammar">Grammars</a>).
+</p></li>
+
+<li><p>
+If the argument is a function,
+returns a pattern that matches as follows.
+Each time there is an attempt for a match against this pattern,
+the function is called,
+always with two arguments:
+the original subject string,
+and the current position in the subject.
+If the call returns a <em>valid</em> number,
+the match succeeds
+and the returned number becomes the new current position.
+If the call returns <b>false</b>, <b>nil</b>, or an invalid number,
+the match fails.
+</p>
+
+<p>
+If the function is called with parameters <em>s</em> and <em>i</em>,
+its result is valid if
+<em>i &lt;= result &lt;= len(s) + 1</em>.
+</p></li>
+
+</ul>
+
+
+
+<h3><code>lpeg.R ({range})</code></h3>
+<p>
+Returns a pattern that matches any single character
+belonging to one of the given <em>ranges</em>.
+Each <code>range</code> is a string <em>xy</em> of length 2,
+representing all characters with code
+between the codes of <em>x</em> and <em>y</em>
+(both inclusive).
+</p>
+
+<p>
+As an example, the pattern
+<code>lpeg.R("09")</code> matches any digit,
+and <code>lpeg.R("az", "AZ")</code> matches any ASCII letter.
+</p>
+
+
+<h3><code>lpeg.S (string)</code></h3>
+<p>
+Returns a pattern that matches any single character that
+appears in the given string.
+(The <code>S</code> stands for <em>Set</em>.)
+</p>
+
+<p>
+As an example, the pattern
+<code>lpeg.S("+-*/")</code> matches any arithmetic operator.
+</p>
+
+<p>
+Note that, if <code>s</code> is a character
+(that is, a string of length 1),
+then <code>lpeg.P(s)</code> is equivalent to <code>lpeg.S(s)</code>
+which is equivalent to <code>lpeg.R(s..s)</code>.
+Note also that both <code>lpeg.S("")</code> and <code>lpeg.R()</code>
+are patterns that always fail.
+</p>
+
+
+<h3><code>lpeg.V (v)</code></h3>
+<p>
+This operation creates a non-terminal (a <em>variable</em>)
+for a grammar.
+The created non-terminal refers to the rule indexed by <code>v</code>
+in the enclosing grammar.
+(See <a href="#grammar">Grammars</a> for details.)
+</p>
+
+
+<h3><code>#patt</code></h3>
+<p>
+Returns a pattern equivalent to <em>&amp;patt</em> in the original
+PEG notation.
+This is a pattern that matches only if the input string
+does match <code>patt</code>,
+but without consuming any input,
+independently of success or failure.
+</p>
+
+
+<h3><code>-patt</code></h3>
+<p>
+Returns a pattern equivalent to <em>!patt</em> in the original
+PEG notation.
+This pattern matches only if the input string
+does not match <code>patt</code>.
+It does not consume any input,
+independently of success or failure.
+</p>
+
+<p>
+As an example, the pattern
+<code>-1</code> matches only the end of string.
+</p>
+
+
+<h3><code>patt1 + patt2</code></h3>
+<p>
+Returns a pattern equivalent to an <em>ordered choice</em>
+of <code>patt1</code> and <code>patt2</code>.
+(This is denoted by <em>patt1 / patt2</em> in the original PEG notation,
+not to be confused with the <code>/</code> operation in LPeg.)
+It matches either <code>patt1</code> or <code>patt2</code>
+(with no backtracking once one of them succeeds).
+The identity element for this operation is the pattern
+<code>lpeg.P(false)</code>,
+which always fails.
+</p>
+
+<p>
+If both <code>patt1</code> and <code>patt2</code> are
+character sets,
+this operation is equivalent to set union:
+</p>
+<pre class="example">
+lower = lpeg.R("az")
+upper = lpeg.R("AZ")
+letter = lower + upper
+</pre>
+
+
+<h3><code>patt1 - patt2</code></h3>
+<p>
+Returns a pattern equivalent to <em>!patt2 patt1</em>.
+This pattern asserts that the input does not match
+<code>patt2</code> and then matches <code>patt1</code>.
+</p>
+
+<p>
+If both <code>patt1</code> and <code>patt2</code> are
+character sets,
+this operation is equivalent to set difference.
+Note that <code>-patt</code> is equivalent to <code>"" - patt</code>
+(or <code>0 - patt</code>).
+If <code>patt</code> is a character set,
+<code>1 - patt</code> is its complement.
+</p>
+
+
+<h3><code>patt1 * patt2</code></h3>
+<p>
+Returns a pattern that matches <code>patt1</code>
+and then matches <code>patt2</code>,
+starting where <code>patt1</code> finished.
+The identity element for this operation is the
+pattern <code>lpeg.P(true)</code>,
+which always succeeds.
+</p>
+
+<p>
+(LPeg uses the <code>*</code> operator
+[instead of the more obvious <code>..</code>]
+both because it has
+the right priority and because in formal languages it is
+common to use a dot for denoting concatenation.)
+</p>
+
+
+<h3><code>patt^n</code></h3>
+<p>
+If <code>n</code> is nonnegative,
+this pattern is
+equivalent to <em>patt<sup>n</sup> patt*</em>.
+It matches at least <code>n</code> occurrences of <code>patt</code>.
+</p>
+
+<p>
+Otherwise, when <code>n</code> is negative,
+this pattern is equivalent to <em>(patt?)<sup>-n</sup></em>.
+That is, it matches at most <code>-n</code>
+occurrences of <code>patt</code>.
+</p>
+
+<p>
+In particular, <code>patt^0</code> is equivalent to <em>patt*</em>,
+<code>patt^1</code> is equivalent to <em>patt+</em>,
+and <code>patt^-1</code> is equivalent to <em>patt?</em>
+in the original PEG notation.
+</p>
+
+<p>
+In all cases,
+the resulting pattern is greedy with no backtracking.
+That is, it matches only the longest possible sequence
+of matches for <code>patt</code>.
+</p>
+
+
+
+<h2><a name="grammar">Grammars</a></h2>
+
+<p>
+With the use of Lua variables,
+it is possible to define patterns incrementally,
+with each new pattern using previously defined ones.
+However, this technique does not allow the definition of
+recursive patterns.
+For recursive patterns,
+we need real grammars.
+</p>
+
+<p>
+LPeg represents grammars with tables,
+where each entry is a rule.
+</p>
+
+<p>
+The call <code>lpeg.V(v)</code>
+creates a pattern that represents the nonterminal
+(or <em>variable</em>) with index <code>v</code> in a grammar.
+Because the grammar still does not exist when
+this function is evaluated,
+the result is an <em>open reference</em> to the respective rule.
+</p>
+
+<p>
+A table is <em>fixed</em> when it is converted to a pattern
+(either by calling <code>lpeg.P</code> or by using it wherein a
+pattern is expected).
+Then every open reference created by <code>lpeg.V(v)</code>
+is corrected to refer to the rule indexed by <code>v</code> in the table.
+</p>
+
+<p>
+When a table is fixed,
+the result is a pattern that matches its <em>initial rule</em>.
+The entry with index 1 in the table defines its initial rule.
+If that entry is a string,
+it is assumed to be the name of the initial rule.
+Otherwise, LPeg assumes that the entry 1 itself is the initial rule.
+</p>
+
+<p>
+As an example,
+the following grammar matches strings of a's and b's that
+have the same number of a's and b's:
+</p>
+<pre class="example">
+equalcount = lpeg.P{
+ "S"; -- initial rule name
+ S = "a" * lpeg.V"B" + "b" * lpeg.V"A" + "",
+ A = "a" * lpeg.V"S" + "b" * lpeg.V"A" * lpeg.V"A",
+ B = "b" * lpeg.V"S" + "a" * lpeg.V"B" * lpeg.V"B",
+} * -1
+</pre>
+
+<h2><a name="captures">Captures</a></h2>
+
+<p>
+Captures specify what a match operation should return
+(the so called <em>semantic information</em>).
+LPeg offers several kinds of captures,
+which build values based on matches and combine them to
+create new values.
+</p>
+
+<p>
+The following table summarizes the basic captures:
+</p>
+<table border="1">
+<tbody><tr><td><b>Operation</b></td><td><b>What is Captured</b></td></tr>
+<tr><td><code>lpeg.C(patt)</code></td>
+ <td>the match for <code>patt</code></td></tr>
+<tr><td><code>lpeg.Cc(value)</code></td>
+ <td>the given value (matches the empty string)</td></tr>
+<tr><td><code>lpeg.Cp()</code></td>
+ <td>the current position (matches the empty string)</td></tr>
+<tr><td><code>lpeg.Cs(patt)</code></td>
+ <td>the match for <code>patt</code>
+ with nested captures replacing their matches</td></tr>
+<tr><td><code>lpeg.Ct(patt)</code></td>
+ <td>a table with all captures from <code>patt</code></td></tr>
+<tr><td><code>lpeg.Ca(patt)</code></td>
+ <td>an <em>accumulation</em> (or <em>folding</em>) of the
+ captures from <code>patt</code></td></tr>
+<tr><td><code>patt / function</code></td>
+ <td>the returns of <code>function</code> applied to the captures
+ of <code>patt</code></td></tr>
+<tr><td><code>patt / table</code></td>
+ <td><code>table[c]</code>, where <code>c</code> is the (first)
+ capture of <code>patt</code></td></tr>
+<tr><td><code>patt / string</code></td>
+ <td><code>string</code>, with some marks replaced by captures
+ of <code>patt</code></td></tr>
+</tbody></table>
+
+<p>
+A capture pattern captures a value every time it succeeds.
+For instance,
+a capture inside a loop generates as many values as matched by the loop.
+A capture generates a value only when it succeeds.
+For instance,
+the pattern <code>lpeg.C(lpeg.P'a'^-1)</code>
+captures the empty string when there is no <code>'a'</code>
+(because the pattern <code>'a'?</code> succeeds),
+while the pattern <code>lpeg.C('a')^-1</code>
+does not capture any value when there is no <code>'a'</code>
+(because the pattern <code>'a'</code> fails).
+</p>
+
+<h3><code>lpeg.C (patt)</code></h3>
+<p>
+Creates a <em>simple capture</em>,
+which captures the substring of the subject that matches <code>patt</code>.
+The captured value is a string.
+If <code>patt</code> has other captures,
+their values are returned after this one.
+</p>
+
+
+<h3><code>lpeg.Ca (patt)</code></h3>
+<p>
+Creates an <em>accumulator capture</em>.
+This capture assumes that <code>patt</code> should produce
+at least one captured value of any kind,
+which becomes the initial value of an <em>accumulator</em>.
+Pattern <code>patt</code> then may produce
+zero or more <em>function captures</em>.
+Each of these functions in these captures is called having the
+accumulator as its first argument
+(followed by any other arguments provided by its own pattern),
+and the value returned by the function becomes the new value
+of the accumulator.
+The final value of this accumulator is the sole result of
+the whole capture.
+</p>
+
+<p>
+As an example,
+the following pattern matches a list of numbers separated
+by commas and returns their addition:
+</p>
+<pre class="example">
+-- matches a numeral and captures its value
+local number = lpeg.R"09"^1 / tonumber
+
+-- auxiliary function to add two numbers
+local function add (acc, newvalue) return acc + newvalue end
+
+list = lpeg.Ca(number * ("," * number / add)^0)
+
+-- example of use
+print(list:match("10,30,43")) --&gt; 83
+</pre>
+
+
+<h3><code>lpeg.Cc (value)</code></h3>
+<p>
+Creates a <em>constant capture</em>.
+This pattern matches the empty string and
+produces <code>value</code> as its captured value.
+</p>
+
+
+<h3><code>lpeg.Cs (patt)</code></h3>
+<p>
+Creates a <em>substitution capture</em>,
+which captures the substring of the subject that matches <code>patt</code>,
+with <em>substitutions</em>.
+For any capture inside <code>patt</code>,
+the substring that matched the capture is replaced by the capture value
+(which should be a string).
+The capture values from <code>patt</code> are not returned independently
+(only as substrings in the resulting string).
+</p>
+
+
+<h3><code>lpeg.Cp ()</code></h3>
+<p>
+Creates a <em>position capture</em>.
+It matches the empty string and
+captures the position in the subject where the match occurs.
+The captured value is a number.
+</p>
+
+
+<h3><code>lpeg.Ct (patt)</code></h3>
+<p>
+Creates a <em>table capture</em>.
+This capture creates a table and puts all captures made by
+<code>patt</code> inside this table in successive integer keys,
+starting at 1.
+</p>
+
+<p>
+The captured value is only this table.
+The captures made by <code>patt</code> are not
+returned independently (only as table elements).
+</p>
+
+
+<h3><code>patt / function</code></h3>
+<p>
+Creates a <em>function capture</em>.
+It calls the given function passing all captures made by
+<code>patt</code> as arguments,
+or the whole match if <code>patt</code> made no capture.
+The values returned by the function
+are the final values of the capture.
+(This capture may create multiple values.)
+In particular,
+if <code>function</code> returns no value,
+there is no captured value;
+everything works as if there was no capture.
+</p>
+
+
+<h3><code>patt / string</code></h3>
+<p>
+Creates a <em>string capture</em>.
+It creates a capture string based on <code>string</code>.
+The captured value is a copy of <code>string</code>,
+except that the character <code>%</code> works as an escape character:
+any sequence in <code>string</code> of the form <code>%<em>n</em></code>,
+with <em>n</em> between 1 and 9,
+stands for the match of the <em>n</em>-th capture in <code>patt</code>.
+(Currently these nested captures can be only simple captures.)
+The sequence <code>%0</code> stands for the whole match.
+The sequence <code>%%</code> stands for a single&nbsp;<code>%</code>.
+
+
+</p><h3><code>patt / table</code></h3>
+<p>
+Creates a <em>query capture</em>.
+It indexes the given table using as key the value of the first capture
+of <code>patt</code>,
+or the whole match if <code>patt</code> made no capture.
+The value at that index is the final value of the capture.
+</p>
+
+<p>
+If the table does not have that key,
+there is no captured value.
+Everything works as if there was no capture.
+</p>
+
+
+
+
+<h2><a name="ex">Some Examples</a></h2>
+
+<h3>Splitting a String</h3>
+<p>
+The following code splits a string using a given pattern
+<code>sep</code> as a separator:
+</p>
+<pre class="example">
+function split (s, sep)
+ sep = lpeg.P(sep)
+ local elem = lpeg.C((1 - sep)^0)
+ local p = elem * (sep * elem)^0
+ return lpeg.match(p, s)
+end
+</pre>
+<p>
+First the function ensures that <code>sep</code> is a proper pattern.
+The pattern <code>elem</code> is a repetition of zero of more
+arbitrary characters as long as there is not a match against
+the separator. It also captures its result.
+The pattern <code>p</code> matches a list of elements separated
+by <code>sep</code>.
+</p>
+
+<p>
+If the split results in too many values,
+it may overflow the maximum number of values
+that can be returned by a Lua function.
+In this case,
+we should collect these values in a table:
+</p>
+<pre class="example">
+function split (s, sep)
+ sep = lpeg.P(sep)
+ local elem = lpeg.C((1 - sep)^0)
+ local p = lpeg.Ct(elem * (sep * elem)^0) -- make a table capture
+ return lpeg.match(p, s)
+end
+</pre>
+
+
+<h3>Searching for a Pattern</h3>
+<p>
+The primitive <code>match</code> works only in anchored mode.
+If we want to find a pattern anywhere in a string,
+we must write a pattern that matches anywhere.
+</p>
+
+<p>
+Because patterns are composable,
+we can write a function that,
+given any arbitrary pattern <code>p</code>,
+returns a new pattern that searches for <code>p</code>
+anywhere in a string.
+There are several ways to do the search.
+one way is like this:
+</p>
+<pre class="example">
+function anywhere (p)
+ return lpeg.P{ p + 1 * lpeg.V(1) }
+end
+</pre>
+<p>
+This grammar has a straight reading:
+it matches <code>p</code> or skips one character and tries again.
+</p>
+
+<p>
+If we want to know where the pattern is in the string
+(instead of knowing only that it is there somewhere),
+we can add position captures to the pattern:
+</p>
+<pre class="example">
+local I = lpeg.Cp()
+function anywhere (p)
+ return lpeg.P{ I * p * I + 1 * lpeg.V(1) }
+end
+</pre>
+
+<p>
+Another option for the search is like this:
+</p>
+<pre class="example">
+local I = lpeg.Cp()
+function anywhere (p)
+ return (1 - lpeg.P(p))^0 * I * p * I
+end
+</pre>
+<p>
+Again the pattern has a straight reading:
+it skips as many characters as possible while not matching <code>p</code>,
+and then matches <code>p</code> (plus appropriate captures).
+</p>
+
+<p>
+If we want to look for a pattern only at word boundaries,
+we can use the following transformer:
+</p>
+
+<pre class="example">
+local wordletter = lpeg.R("AZ", "az")
+
+function atwordboundary (p)
+ return lpeg.P{
+ [1] = p + wordletter^0 * (1 - wordletter)^1 * lpeg.V(1)
+ }
+end
+</pre>
+
+
+<h3><a name="balanced"></a>Balanced Parentheses</h3>
+<p>
+The following pattern matches only strings with balanced parentheses:
+</p>
+<pre class="example">
+b = lpeg.P{ "(" * ((1 - lpeg.S"()") + lpeg.V(1))^0 * ")" }
+</pre>
+<p>
+Reading the first (and only) rule of the given grammar,
+we have that a balanced string is
+an open parenthesis,
+followed by zero or more repetitions of either
+a non-parenthesis character or
+a balanced string (<code>lpeg.V(1)</code>),
+followed by a closing parenthesis.
+</p>
+
+
+<h3>Global Substitution</h3>
+<p>
+The next example does a job somewhat similar to <code>string.gsub</code>.
+It receives a pattern and a replacement value,
+and substitutes the replacement value for all occurrences of the pattern
+in a given string:
+</p>
+<pre class="example">
+function gsub (s, patt, repl)
+ patt = lpeg.P(patt)
+ patt = lpeg.Cs((patt / repl + 1)^0)
+ return lpeg.match(patt, s)
+end
+</pre>
+<p>
+As in <code>string.gsub</code>,
+the replacement value can be a string,
+a function, or a table.
+</p>
+
+
+<h3>Comma-Separated Values (CSV)</h3>
+<p>
+This example breaks a string into comma-separated values,
+returning all fields:
+</p>
+<pre class="example">
+local field = '"' * lpeg.Cs(((lpeg.P(1) - '"') + lpeg.P'""' / '"')^0) * '"' +
+ lpeg.C((1 - lpeg.S',\n"')^0)
+
+local record = field * (',' * field)^0 * (lpeg.P'\n' + -1)
+
+function csv (s)
+ return lpeg.match(record, s)
+end
+</pre>
+<p>
+A field is either a quoted field
+(which may contain any character except an individual quote,
+which may be written as two quotes that are replaced by one)
+or an unquoted field
+(which cannot contain commas, newlines, or quotes).
+A record is a list of fields separated by commas,
+ending with a newline or the string end (-1).
+</p>
+
+
+<h3>Lua's Long Strings</h3>
+<p>
+This example matches long strings in Lua.
+Probably there is no pure PEG grammar that matches that syntax
+(as there is no Context-Free Grammar for it),
+so we resort to Lua:
+</p>
+<pre class="example">
+local start = "[" * lpeg.P"="^0 * "["
+longstring = lpeg.P(function (s, i)
+ local l = lpeg.match(start, s, i)
+ if not l then return nil end
+ local p = lpeg.P("]" .. string.rep("=", l - i - 2) .. "]")
+ p = (1 - p)^0 * p
+ return lpeg.match(p, s, l)
+end)
+</pre>
+<p>
+The function first checks whether there is a long string starting
+at position <code>i</code>, using the <code>start</code> pattern.
+If so, it builds a pattern <code>p</code>
+that matches an ending bracket with the correct number of equal signs
+and then looks for this pattern.
+</p>
+
+<p>
+Lua-function patterns are not efficient.
+So, it may be worth the use of an and-predicate
+to pre-analyze the subject before calling Lua:
+</p>
+<pre class="example">
+longstring = #("[" * lpeg.S"[=") * longstring
+</pre>
+<p>
+In this example, the pattern only calls Lua if the current position
+matches an open bracket followed by either another open bracket or
+an equal sign (so that the only valid token there is a long string).
+</p>
+
+
+<h3>UTF-8 and Latin 1</h3>
+<p>
+It is not difficult to use LPeg to convert a string from
+utf-8 encoding to Latin 1 (ISO 8859-1):
+</p>
+
+<pre class="example">
+-- convert a two-byte utf8 sequence to a Latin 1 character
+local function f2 (s)
+ local c1, c2 = string.byte(s, 1, 2)
+ return string.char(c1 * 64 + c2 - 12416)
+end
+
+local utf8 = lpeg.R("\0\127")
+ + lpeg.R("\194\195") * lpeg.R("\128\191") / f2
+
+local decode_pattern = lpeg.Cs(utf8^0) * -1
+</pre>
+<p>
+In this code,
+the definition of utf-8 is already restricted to the
+Latin 1 range (from 0 to 255).
+Any encoding outside this range (as well as any invalid encoding)
+will not match that pattern.
+</p>
+
+<p>
+As the definition of <code>decode_pattern</code> demands that
+the pattern matches the whole input (because of the -1 at its end),
+any invalid string will simply fail to match,
+without any useful information about the problem.
+We can improve this situation redefining <code>decode_pattern</code>
+as follows:
+</p>
+<pre class="example">
+local function er (_, i) error("invalid encoding at position " .. i) end
+
+local decode_pattern = lpeg.Cs(utf8^0) * (-1 + lpeg.P(er))
+</pre>
+<p>
+Now, if the pattern <code>utf8^0</code> stops
+before the end of the string,
+an appropriate error function is called.
+</p>
+
+
+<h3>UTF-8 and Unicode</h3>
+<p>
+We can extend the previous patterns to handle all Unicode code points.
+Of course,
+we cannot translate them to Latin 1 or any other one-byte encoding.
+Instead, our translation results in a array with the code points
+represented as numbers.
+The full code is here:
+</p>
+<pre class="example">
+-- decode a two-byte utf8 sequence
+local function f2 (s)
+ local c1, c2 = string.byte(s, 1, 2)
+ return c1 * 64 + c2 - 12416
+end
+
+-- decode a three-byte utf8 sequence
+local function f3 (s)
+ local c1, c2, c3 = string.byte(s, 1, 3)
+ return (c1 * 64 + c2) * 64 + c3 - 925824
+end
+
+-- decode a four-byte utf8 sequence
+local function f4 (s)
+ local c1, c2, c3, c4 = string.byte(s, 1, 4)
+ return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168
+end
+
+local cont = lpeg.R("\128\191") -- continuation byte
+
+local utf8 = lpeg.R("\0\127") / string.byte
+ + lpeg.R("\194\223") * cont / f2
+ + lpeg.R("\224\239") * cont * cont / f3
+ + lpeg.R("\240\244") * cont * cont * cont / f4
+
+local decode_pattern = lpeg.Ct(utf8^0) * -1
+</pre>
+
+
+<h3>Arithmetic Expressions</h3>
+<p>
+This example is a complete parser and evaluator for simple
+arithmetic expressions.
+We write it in two styles.
+The first approach first builds a syntax tree and then
+traverses this tree to compute the expression value:
+</p>
+<pre class="example">
+-- Lexical Elements
+local Space = lpeg.S(" \n\t")^0
+local Number = lpeg.C(lpeg.P"-"^-1 * lpeg.R("09")^1) * Space
+local FactorOp = lpeg.C(lpeg.S("+-")) * Space
+local TermOp = lpeg.C(lpeg.S("*/")) * Space
+local Open = "(" * Space
+local Close = ")" * Space
+
+-- Grammar
+local V = lpeg.V
+local Exp, Term, Factor = 1, 2, 3
+G = lpeg.P{ "Exp",
+ Exp = lpeg.Ct(V"Factor" * (FactorOp * V"Factor")^0);
+ Factor = lpeg.Ct(V"Term" * (TermOp * V"Term")^0);
+ Term = Number + Open * V"Exp" * Close;
+}
+
+G = Space * G * -1
+
+-- Evaluator
+function eval (x)
+ if type(x) == "string" then
+ return tonumber(x)
+ else
+ local op1 = eval(x[1])
+ for i = 2, #x, 2 do
+ local op = x[i]
+ local op2 = eval(x[i + 1])
+ if (op == "+") then op1 = op1 + op2
+ elseif (op == "-") then op1 = op1 - op2
+ elseif (op == "*") then op1 = op1 * op2
+ elseif (op == "/") then op1 = op1 / op2
+ end
+ end
+ return op1
+ end
+end
+
+-- Parser/Evaluator
+function evalExp (s)
+ local t = lpeg.match(G, s)
+ if not t then error("syntax error", 2) end
+ return eval(t)
+end
+
+-- small example
+print(evalExp"3 + 5*9 / (1+1) - 12")
+</pre>
+
+<p>
+The second style computes the expression value on the fly,
+without building the syntax tree.
+The following grammar takes this approach.
+(It assumes the same lexical elements as before.)
+</p>
+<pre class="example">
+-- Auxiliary function
+function eval (v1, op, v2)
+ if (op == "+") then return v1 + v2
+ elseif (op == "-") then return v1 - v2
+ elseif (op == "*") then return v1 * v2
+ elseif (op == "/") then return v1 / v2
+ end
+end
+
+-- Grammar
+local V = lpeg.V
+G = lpeg.P{ "Exp",
+ Exp = lpeg.Ca(V"Factor" * (FactorOp * V"Factor" / eval)^0);
+ Factor = lpeg.Ca(V"Term" * (TermOp * V"Term" / eval)^0);
+ Term = Number / tonumber + Open * V"Exp" * Close;
+}
+
+-- small example
+print(lpeg.match(G, "3 + 5*9 / (1+1) - 12"))
+</pre>
+<p>
+Note the use of the accumulator capture.
+To compute the value of an expression,
+the accumulator starts with the value of the first factor,
+and then applies <code>eval</code> over
+the accumulator, the operator,
+and the new factor for each repetition.
+</p>
+
+
+<h2><a name="re"></a>The <code>re</code> Module</h2>
+
+<p>
+The <code>re</code> Module
+(provided by file <code>re.lua</code> in the distribution)
+supports a conventional regular-expression syntax for pattern construction.
+</p>
+
+<p>
+If offers two functions:
+</p>
+<ul>
+<li><p><code>re.compile(string)</code> compiles the given string and
+returns an equivalent LPeg pattern.
+The given string may define either an expression or a grammar.
+(It memoizes its results,
+so that there is no penalty in calling it multiple times
+with the same string.)
+</p></li>
+<li><p><code>re.match(subject,pattern)</code> compiles the given pattern
+(a string) and matches it against the given subject.
+</p></li>
+</ul>
+
+<p>
+The syntax for patterns for this module
+follows closely the original PEG notation.
+Like in the original PEG notation,
+spaces have no meaning,
+and literal strings must be enclosed between quotes
+(but <code>re</code> allows both double and single quotes).
+Anywhere any special character
+may be escaped by a prefixed <code>%</code>
+(not <code>\</code> like in original PEG).
+</p>
+
+<p>
+Like traditional regular expressions (but unlike original PEG),
+a character class may start with a <code>~</code> to complement
+its meaning.
+Curly brackets around a pattern capture its corresponding match.
+Empty curly brackets capture the current position.
+(Other captures are not supported in this version.)
+</p>
+
+<p>
+As a simple example,
+the following call will produce the same pattern produced by the
+Lua expression in the <a href="#balanced">balanced parentheses</a> example:
+</p>
+<pre>
+b = re.compile[[ balanced &lt;- '(' ([^()] / balanced)* ')' ]]
+</pre>
+
+
+
+<h2><a name="download"></a>Download</h2>
+
+<p>LPeg
+<a href="http://www.inf.puc-rio.br/~roberto/lpeg-0.6.tar.gz">source code</a>.</p>
+
+
+<h2><a name="license">License</a></h2>
+
+<p>
+Copyright &copy; 2007 Lua.org, PUC-Rio.
+</p>
+<p>
+Permission is hereby granted, free of charge,
+to any person obtaining a copy of this software and
+associated documentation files (the "Software"),
+to deal in the Software without restriction,
+including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software,
+and to permit persons to whom the Software is
+furnished to do so,
+subject to the following conditions:
+</p>
+
+<p>
+The above copyright notice and this permission notice
+shall be included in all copies or substantial portions of the Software.
+</p>
+
+<p>
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
+DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+</p>
+
+</div> <!-- id="content" -->
+
+</div> <!-- id="main" -->
+
+<div id="about">
+<p><small>
+$Id: doc.html,v 1.33 2007/04/12 14:38:05 roberto Exp $
+</small></p>
+</div> <!-- id="about" -->
+
+</div> <!-- id="container" -->
+
+</body>
+</html>
BIN  lpeg-128.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,913 lpeg.c
@@ -0,0 +1,1913 @@
+/*
+** $Id: lpeg.c,v 1.67 2007/04/12 14:16:51 roberto Exp $
+** LPeg - PEG pattern matching for Lua
+** Copyright 2007, Lua.org & PUC-Rio (see documentation for license)
+** written by Roberto Ierusalimschy
+*/
+
+/*
+ PEG rules:
+
+ e1 | e2 -> choice L1; e1; commit L2; L1: e2; L2:
+ e* -> L2: choice L1; e; commit L2; L1:
+ or e* -> choice L1; L2: e; partialcommit L2; L1:
+ e? -> choice L1; e; commit L1; L1:
+ !e -> choice L1; e; commit L2; L2: fail; L1:
+ or !e -> choice L1; e; failtwice; L1:
+ &e -> choice L1; choice L2; e; L2: commit L3; L3: fail; L1:
+ or &e -> choice L1; e; backcommit L2; L1: fail; L2:
+*/
+
+
+#include <assert.h>
+#include <limits.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "lua.h"
+#include "lauxlib.h"
+
+
+/* maximum call/backtrack levels */
+#define MAXBACK 400
+
+/* initial size for capture's list */
+#define IMAXCAPTURES 600
+
+
+/* index, on Lua stack, for subject */
+#define SUBJIDX 2
+
+/* index, on Lua stack, for substitution value cache */
+#define SUBSCACHE 3
+
+/* index, on Lua stack, for capture list */
+#define CAPLISTIDX (SUBSCACHE + 1)
+
+/* index, on Lua stack, for pattern's fenv */
+#define PENVIDX (CAPLISTIDX + 1)
+
+
+
+typedef unsigned char byte;
+
+
+#define CHARSETSIZE ((UCHAR_MAX/CHAR_BIT) + 1)
+
+
+typedef byte Charset[CHARSETSIZE];
+
+
+typedef const char *(*PattFunc) (const void *ud,
+ const char *o, /* string start */
+ const char *s, /* current position */
+ const char *e); /* string end */
+
+
+/* Virtual Machine's instructions */
+typedef enum Opcode {
+ IAny, IChar, ISet, IZSet,
+ ITestAny, ITestChar, ITestSet, ITestZSet,
+ IRet, IEnd,
+ IChoice, IJmp, ICall, IOpenCall,
+ ICommit, IPartialCommit, IBackCommit, IFailTwice, IFail, IGiveup,
+ IFunc, ILFunc,
+ IFullCapture, IEmptyCapture, IOpenCapture, ICloseCapture
+} Opcode;
+
+
+#define ISJMP 1
+#define ISCHECK 2
+#define ISTEST 4
+#define ISNOFAIL 8
+#define ISCAPTURE 16
+#define ISMOVABLE 32
+#define ISFENVOFF 64
+
+static const byte opproperties[] = {
+ /* IAny */ ISCHECK,
+ /* IChar */ ISCHECK,
+ /* ISet */ ISCHECK,
+ /* IZSet */ ISCHECK,
+ /* ITestAny */ ISJMP | ISTEST | ISNOFAIL,
+ /* ITestChar */ ISJMP | ISTEST | ISNOFAIL,
+ /* ITestSet */ ISJMP | ISTEST | ISNOFAIL,
+ /* ITestZSet */ ISJMP | ISTEST | ISNOFAIL,
+ /* IRet */ 0,
+ /* IEnd */ 0,
+ /* IChoice */ ISJMP,
+ /* IJmp */ ISJMP | ISNOFAIL,
+ /* ICall */ ISJMP,
+ /* IOpenCall */ ISFENVOFF,
+ /* ICommit */ ISJMP,
+ /* IPartialCommit */ ISJMP,
+ /* IBackCommit */ ISJMP,
+ /* IFailTwice */ 0,
+ /* IFail */ 0,
+ /* IGiveup */ 0,
+ /* IFunc */ 0,
+ /* ILFunc */ ISFENVOFF,
+ /* IFullCapture */ ISCAPTURE | ISNOFAIL | ISFENVOFF,
+ /* IEmptyCapture */ ISCAPTURE | ISNOFAIL | ISMOVABLE | ISFENVOFF,
+ /* IOpenCapture */ ISCAPTURE | ISNOFAIL | ISMOVABLE | ISFENVOFF,
+ /* ICloseCapture */ ISCAPTURE | ISNOFAIL | ISMOVABLE | ISFENVOFF
+};
+
+
+typedef union Instruction {
+ struct Inst {
+ byte code;
+ byte aux;
+ short offset;
+ } i;
+ PattFunc f;
+ byte buff[1];
+} Instruction;
+
+static const Instruction giveup = {{IGiveup, 0, 0}};
+
+#define getkind(op) ((op)->i.aux & 0xF)
+#define getoff(op) (((op)->i.aux >> 4) & 0xF)
+
+#define dest(p,x) ((x) + ((p)+(x))->i.offset)
+
+#define MAXOFF 0xF
+
+#define isprop(op,p) (opproperties[(op)->i.code] & (p))
+#define isjmp(op) isprop(op, ISJMP)
+#define iscapture(op) isprop(op, ISCAPTURE)
+#define ischeck(op) isprop(op, ISCHECK)
+#define istest(op) isprop(op, ISTEST)
+#define isnofail(op) isprop(op, ISNOFAIL)
+#define ismovable(op) isprop(op, ISMOVABLE)
+#define isfenvoff(op) isprop(op, ISFENVOFF)
+
+/* kinds of captures */
+typedef enum CapKind {
+ Cclose, Cposition, Cconst, Csimple, Ctable, Cfunction,
+ Cquery, Cstring, Csubst, Caccum
+} CapKind;
+
+#define iscapnosize(k) ((k) == Cposition || (k) == Cconst)
+
+
+typedef struct Capture {
+ const char *s; /* position */
+ short idx;
+ byte kind;
+ byte siz;
+} Capture;
+
+
+/* maximum size (in elements) for a pattern */
+#define MAXPATTSIZE (SHRT_MAX - 10)
+
+
+/* size (in elements) for an instruction plus extra l bytes */
+#define instsize(l) (((l) - 1)/sizeof(Instruction) + 2)
+
+
+/* size (in elements) for a ISet instruction */
+#define CHARSETINSTSIZE instsize(CHARSETSIZE)
+
+
+
+#define loopset(v,b) { int v; for (v = 0; v < CHARSETSIZE; v++) b; }
+
+
+#define testchar(st,c) (((int)(st)[((c) >> 3)] & (1 << ((c) & 7))))
+#define setchar(st,c) ((st)[(c) >> 3] |= (1 << ((c) & 7)))
+
+
+
+static int sizei (Instruction *i) {
+ switch(i->i.code) {
+ case ISet: case IZSet: case ITestSet: case ITestZSet:
+ return CHARSETINSTSIZE;
+ case IFunc:
+ return i->i.offset;
+ default:
+ return 1;
+ }
+}
+
+
+static const char *val2str (lua_State *L, int idx) {
+ const char *k = lua_tostring(L, idx);
+ if (k != NULL)
+ return lua_pushfstring(L, "rule '%s'", k);
+ else
+ return lua_pushfstring(L, "rule <a %s>", luaL_typename(L, -1));
+}
+
+
+static int getposition (lua_State *L, int t, int i) {
+ int res;
+ lua_getfenv(L, -1);
+ lua_rawgeti(L, -1, i); /* get key from pattern's environment */
+ lua_gettable(L, t); /* get position from positions table */
+ res = lua_tointeger(L, -1);
+ if (res == 0) { /* key has no registered position? */
+ lua_rawgeti(L, -2, i); /* get key again */
+ luaL_error(L, "%s is not defined in given grammar", val2str(L, -1));
+ }
+ lua_pop(L, 2); /* remove environment and position */
+ return res;
+}
+
+
+
+/*
+** {======================================================
+** Printing patterns
+** =======================================================
+*/
+
+
+static void printcharset (const Charset st) {
+ int i;
+ printf("[");
+ for (i = 0; i <= UCHAR_MAX; i++) {
+ int first = i;
+ while (testchar(st, i) && i <= UCHAR_MAX) i++;
+ if (i - 1 == first)
+ printf("(%02x)", first);
+ else if (i - 1 > first)
+ printf("(%02x-%02x)", first, i - 1);
+ }
+ printf("]");
+}
+
+
+static void printcapkind (int kind) {
+ const char *const modes[] = {
+ "close", "position", "constant", "simple", "table", "function",
+ "query", "string", "substitution", "accumulator"};
+ printf("%s", modes[kind]);
+}
+
+
+static void printinst (const Instruction *op, const Instruction *p) {
+ const char *const names[] = {
+ "any", "char", "set", "zset",
+ "testany", "testchar", "testset", "testzset",
+ "ret", "end",
+ "choice", "jmp", "call", "open_call",
+ "commit", "partial_commit", "back_commit", "failtwice", "fail", "giveup",
+ "func", "Luafunc",
+ "fullcapture", "emptycapture", "opencapture", "closecapture"
+ };
+ printf("%02d: %s ", p - op, names[p->i.code]);
+ switch ((Opcode)p->i.code) {
+ case IChar: {
+ printf("'%c'", p->i.aux);
+ break;
+ }
+ case ITestChar: {
+ printf("'%c'", p->i.aux);
+ printf("-> %d", dest(0, p) - op);
+ break;
+ }
+ case IAny: {
+ printf("* %d", p->i.aux);
+ break;
+ }
+ case ITestAny: {
+ printf("* %d", p->i.aux);
+ printf("-> %d", dest(0, p) - op);
+ break;
+ }
+ case IFullCapture: case IOpenCapture:
+ case IEmptyCapture: case ICloseCapture: {
+ printcapkind(getkind(p));
+ printf("(n = %d)", getoff(p));
+ /* go through */
+ }
+ case ILFunc: {
+ printf(" (%d)", p->i.offset);
+ break;
+ }
+ case ISet: case IZSet: {
+ printcharset((p+1)->buff);
+ break;
+ }
+ case ITestSet: case ITestZSet: {
+ printcharset((p+1)->buff);
+ printf("-> %d", dest(0, p) - op);
+ break;
+ }
+ case IOpenCall: {
+ printf("-> %d", p->i.offset);
+ break;
+ }
+ case IChoice: {
+ printf("-> %d (%d)", dest(0, p) - op, p->i.aux);
+ break;
+ }
+ case IJmp: case ICall: case ICommit:
+ case IPartialCommit: case IBackCommit: {
+ printf("-> %d", dest(0, p) - op);
+ break;
+ }
+ default: break;
+ }
+ printf("\n");
+}
+
+
+static void printpatt (Instruction *p) {
+ Instruction *op = p;
+ for (;;) {
+ printinst(op, p);
+ if (p->i.code == IEnd) break;
+ p += sizei(p);
+ }
+}
+
+
+static void printcap (Capture *cap) {
+ for (; cap->s; cap++) {
+ printcapkind(cap->kind);
+ printf(" (idx: %d - size: %d) -> %p\n", cap->idx, cap->siz, cap->s);
+ }
+}
+
+/* }====================================================== */
+
+
+
+
+/*
+** {======================================================
+** Virtual Machine
+** =======================================================
+*/
+
+
+typedef struct Stack {
+ const char *s;
+ const Instruction *p;
+ int caplevel;
+} Stack;
+
+
+static Capture *doublecap (lua_State *L, Capture *cap, int captop) {
+ Capture *newc;
+ if (captop >= INT_MAX/((int)sizeof(Capture) * 2))
+ luaL_error(L, "too many captures");
+ newc = (Capture *)lua_newuserdata(L, captop * 2 * sizeof(Capture));
+ memcpy(newc, cap, captop * sizeof(Capture));
+ lua_replace(L, CAPLISTIDX);
+ return newc;
+}
+
+
+static const char *match (lua_State *L, const char *o, const char *s,
+ const char *e, Instruction *op, Capture *capture) {
+ Stack stackbase[MAXBACK];
+ Stack *stacklimit = stackbase + MAXBACK;
+ Stack *stack = stackbase; /* point to first empty slot in stack */
+ int capsize = IMAXCAPTURES;
+ int captop = 0; /* point to first empty slot in captures */
+ const Instruction *p = op;
+ stack->p = &giveup; stack->s = s; stack->caplevel = 0; stack++;
+ for (;;) {
+#if defined(DEBUG)
+ printf("s: |%s| stck: %d c: %d ", s, stack - stackbase, captop);
+ printinst(op, p);
+#endif
+ switch ((Opcode)p->i.code) {
+ case IEnd: {
+ assert(stack == stackbase + 1);
+ capture[captop].kind = Cclose;
+ capture[captop].s = NULL;
+ return s;
+ }
+ case IGiveup: {
+ assert(stack == stackbase);
+ return NULL;
+ }
+ case IRet: {
+ assert(stack > stackbase && (stack - 1)->s == NULL);
+ p = (--stack)->p;
+ continue;
+ }
+ case IAny: {
+ int n = p->i.aux;
+ if (n > e - s) goto fail;
+ else { p++; s += n; }
+ continue;
+ }
+ case ITestAny: {
+ int n = p->i.aux;
+ if (n > e - s) p += p->i.offset;
+ else { p++; s += n; }
+ continue;
+ }
+ case IChar: {
+ if ((byte)*s != p->i.aux || s >= e) goto fail;
+ else { p++; s++; }
+ continue;
+ }
+ case ITestChar: {
+ if ((byte)*s != p->i.aux || s >= e) p += p->i.offset;
+ else { p++; s++; }
+ continue;
+ }
+ case ISet: {
+ int c = (unsigned char)*s;
+ if (!testchar((p+1)->buff, c)) goto fail;
+ else { p += CHARSETINSTSIZE; s++; }
+ continue;
+ }
+ case ITestSet: {
+ int c = (unsigned char)*s;
+ if (!testchar((p+1)->buff, c)) p += p->i.offset;
+ else { p += CHARSETINSTSIZE; s++; }
+ continue;
+ }
+ case IZSet: {
+ int c = (unsigned char)*s;
+ if (!testchar((p+1)->buff, c) || s >= e) goto fail;
+ else { p += CHARSETINSTSIZE; s++; }
+ continue;
+ }
+ case ITestZSet: {
+ int c = (unsigned char)*s;
+ if (!testchar((p+1)->buff, c) || s >= e) p += p->i.offset;
+ else { p += CHARSETINSTSIZE; s++; }
+ continue;
+ }
+ case IFunc: {
+ const char *r = (p+1)->f((p+2)->buff, o, s, e);
+ if (r == NULL) goto fail;
+ s = r;
+ p += p->i.offset;
+ continue;
+ }
+ case ILFunc: {
+ lua_Integer res;
+ lua_rawgeti(L, PENVIDX, p->i.offset); /* push function */
+ lua_pushvalue(L, SUBJIDX); /* push original subject */
+ lua_pushinteger(L, s - o + 1); /* current position */
+ lua_call(L, 2, 1);
+ res = lua_tointeger(L, -1) - 1;
+ lua_pop(L, 1);
+ if (res < s - o || res > e - o) goto fail;
+ s = o + res;
+ p++;
+ continue;
+ }
+ case IJmp: {
+ p += p->i.offset;
+ continue;
+ }
+ case IChoice: {
+ if (stack >= stacklimit)
+ return (luaL_error(L, "too many pending calls/choices"), (char *)0);
+ stack->p = dest(0, p);
+ stack->s = s - p->i.aux;
+ stack->caplevel = captop;
+ stack++;
+ p++;
+ continue;
+ }
+ case ICall: {
+ if (stack >= stacklimit)
+ return (luaL_error(L, "too many pending calls/choices"), (char *)0);
+ stack->s = NULL;
+ stack->p = p + 1; /* save return address */
+ stack++;
+ p += p->i.offset;
+ continue;
+ }
+ case ICommit: {
+ assert(stack > stackbase && (stack - 1)->s != NULL);
+ stack--;
+ p += p->i.offset;
+ continue;
+ }
+ case IPartialCommit: {
+ assert(stack > stackbase && (stack - 1)->s != NULL);
+ (stack - 1)->s = s;
+ (stack - 1)->caplevel = captop;
+ p += p->i.offset;
+ continue;
+ }
+ case IBackCommit: {
+ assert(stack > stackbase && (stack - 1)->s != NULL);
+ s = (--stack)->s;
+ captop = stack->caplevel;
+ p += p->i.offset;
+ continue;
+ }
+ case IFailTwice:
+ assert(stack > stackbase);
+ stack--;
+ /* go through */
+ case IFail:
+ fail: { /* pattern failed: try to backtrack */
+ do { /* remove pending calls */
+ assert(stack > stackbase);
+ s = (--stack)->s;
+ } while (s == NULL);
+ captop = stack->caplevel;
+ p = stack->p;
+ continue;
+ }
+ case ICloseCapture: {
+ const char *s1 = s - getoff(p);
+ assert(captop > 0);
+ if (capture[captop - 1].siz == 0 &&
+ s1 - capture[captop - 1].s < UCHAR_MAX) {
+ capture[captop - 1].siz = s1 - capture[captop - 1].s + 1;
+ p++;
+ continue;
+ }
+ /* else go through */
+ }
+ case IEmptyCapture:
+ capture[captop].siz = 1; /* mark entry as closed */
+ goto capture;
+ case IOpenCapture:
+ capture[captop].siz = 0; /* mark entry as open */
+ goto capture;
+ case IFullCapture:
+ capture[captop].siz = getoff(p) + 1; /* save capture size */
+ capture: {
+ capture[captop].s = s - getoff(p);
+ capture[captop].idx = p->i.offset;
+ capture[captop].kind = getkind(p);
+ if (++captop >= capsize) {
+ capture = doublecap(L, capture, captop);
+ capsize = 2 * captop;
+ }
+ p++;
+ continue;
+ }
+ case IOpenCall:
+ luaL_error(L, "reference to unknown rule #%d", p->i.offset);
+ default: assert(0); return NULL;
+ }
+ }
+}
+
+/* }====================================================== */
+
+
+/*
+** {======================================================
+** Verifier
+** =======================================================
+*/
+
+
+static int verify (lua_State *L, Instruction *op, const Instruction *p,
+ Instruction *e, int postable, int rule) {
+ static const char dummy[] = "";
+ Stack back[MAXBACK];
+ int backtop = 0; /* point to first empty slot in back */
+ while (p != e) {
+ switch ((Opcode)p->i.code) {
+ case IRet: {
+ p = back[--backtop].p;
+ continue;
+ }
+ case IChoice: {
+ if (backtop >= MAXBACK)
+ return luaL_error(L, "too many pending calls/choices");
+ back[backtop].p = dest(0, p);
+ back[backtop++].s = dummy;
+ p++;
+ continue;
+ }
+ case ICall: {
+ assert((p + 1)->i.code != IRet); /* no tail call */
+ if (backtop >= MAXBACK)
+ return luaL_error(L, "too many pending calls/choices");
+ back[backtop].s = NULL;
+ back[backtop++].p = p + 1;
+ goto dojmp;
+ }
+ case IOpenCall: {
+ int i;
+ if (postable == 0) /* grammar still not fixed? */
+ goto fail; /* to be verified later */
+ for (i = 0; i < backtop; i++) {
+ if (back[i].s == NULL && back[i].p == p + 1)
+ return luaL_error(L, "%s is left recursive", val2str(L, rule));
+ }
+ if (backtop >= MAXBACK)
+ return luaL_error(L, "too many pending calls/choices");
+ back[backtop].s = NULL;
+ back[backtop++].p = p + 1;
+ p = op + getposition(L, postable, p->i.offset);
+ continue;
+ }
+ case IBackCommit:
+ case ICommit: {
+ assert(backtop > 0 && p->i.offset > 0);
+ backtop--;
+ goto dojmp;
+ }
+ case IPartialCommit: {
+ assert(backtop > 0 && p->i.offset > 0);
+ goto dojmp;
+ }
+ case ITestAny:
+ case ITestChar: /* all these cases jump for empty subject */
+ case ITestSet:
+ case ITestZSet:
+ case IJmp:
+ dojmp: {
+ p += p->i.offset;
+ continue;
+ }
+ case IAny:
+ case IChar:
+ case ISet:
+ case IZSet:
+ case IFailTwice: /* assume that first level failed; try to backtrack */
+ goto fail;
+ case IFail: {
+ if (p->i.aux) { /* is an 'and' predicate? */
+ assert((p - 1)->i.code == IBackCommit && (p - 1)->i.offset == 2);
+ p++; /* pretend it succeeded and go ahead */
+ continue;
+ }
+ /* else go through */
+ }
+ fail: { /* pattern failed: try to backtrack */
+ do {
+ if (backtop-- == 0)
+ return 1; /* no more backtracking */
+ } while (back[backtop].s == NULL);
+ p = back[backtop].p;
+ continue;
+ }
+ case IOpenCapture:
+ case ICloseCapture:
+ case IEmptyCapture:
+ case IFullCapture: {
+ p++;
+ continue;
+ }
+ case IFunc: {
+ const char *r = (p+1)->f((p+2)->buff, dummy, dummy, dummy);
+ if (r == NULL) goto fail;
+ p += p->i.offset;
+ continue;
+ }
+ case ILFunc: {
+ goto fail; /* be liberal in this case */
+ }
+ case IEnd: /* cannot happen (should stop before it) */
+ default: assert(0); return 0;
+ }
+ }
+ assert(backtop == 0);
+ return 0;
+}
+
+
+static void checkrule (lua_State *L, Instruction *op, int from, int to,
+ int postable, int rule) {
+ int i;
+ int lastopen = 0; /* more recent OpenCall seen in the code */
+ for (i = from; i < to; i += sizei(op + i)) {
+ if (op[i].i.code == IPartialCommit && op[i].i.offset < 0) { /* loop? */
+ int start = dest(op, i);
+ assert(op[start - 1].i.code == IChoice && dest(op, start - 1) == i + 1);
+ if (start <= lastopen) { /* loop does contain an open call? */
+ if (!verify(L, op, op + start, op + i, postable, rule)) /* check body */
+ luaL_error(L, "possible infinite loop in %s", val2str(L, rule));
+ }
+ }
+ else if (op[i].i.code == IOpenCall)
+ lastopen = i;
+ }
+ assert(op[i - 1].i.code == IRet);
+ verify(L, op, op + from, op + to - 1, postable, rule);
+}
+
+
+
+
+/* }====================================================== */
+
+
+
+
+/*
+** {======================================================
+** Building Patterns
+** =======================================================
+*/
+
+enum charsetanswer { NOINFO, ISCHARSET, VALIDSTARTS };
+
+typedef struct CharsetTag {
+ enum charsetanswer tag;
+ Charset cs;
+} CharsetTag;
+
+
+static void check2test (Instruction *p, int n) {
+ assert(ischeck(p));
+ p->i.code += ITestAny - IAny;
+ p->i.offset = n;
+}
+
+
+/*
+** invert array slice p[0]-p[e] (both inclusive)
+*/
+static void invert (Instruction *p, int e) {
+ int i;
+ for (i = 0; i < e; i++, e--) {
+ Instruction temp = p[i];
+ p[i] = p[e];
+ p[e] = temp;
+ }
+}
+
+
+/*
+** rotate array slice p[0]-p[e] (both inclusive) 'n' steps
+** to the 'left'
+*/
+static void rotate (Instruction *p, int e, int n) {
+ invert(p, n - 1);
+ invert(p + n, e - n);
+ invert(p, e);
+}
+
+
+#define op_step(p) ((p)->i.code == IAny ? (p)->i.aux : 1)
+
+
+static int skipchecks (Instruction *p, int up, int *pn) {
+ int i, n = 0;
+ for (i = 0; ischeck(p + i); i += sizei(p + i)) {
+ int st = op_step(p + i);
+ if (n + st > MAXOFF - up) break;
+ n += st;
+ }
+ *pn = n;
+ return i;
+}
+
+
+#define ismovablecap(op) (ismovable(op) && getoff(op) < MAXOFF)
+
+static void optimizecaptures (Instruction *p) {
+ int i;
+ int limit = 0;
+ for (i = 0; p[i].i.code != IEnd; i += sizei(p + i)) {
+ if (isjmp(p + i) && dest(p, i) >= limit)
+ limit = dest(p, i) + 1; /* do not optimize jump targets */
+ else if (i >= limit && ismovablecap(p + i) && ischeck(p + i + 1)) {
+ int end, n, j; /* found a border capture|check */
+ int maxoff = getoff(p + i);
+ int start = i;
+ /* find first capture in the group */
+ while (start > limit && ismovablecap(p + start - 1)) {
+ start--;
+ if (getoff(p + start) > maxoff) maxoff = getoff(p + start);
+ }
+ end = skipchecks(p + i + 1, maxoff, &n) + i; /* find last check */
+ if (n == 0) continue; /* first check is too big to move across */
+ assert(n <= MAXOFF && start <= i && i < end);
+ for (j = start; j <= i; j++)
+ p[j].i.aux += (n << 4); /* correct offset of captures to be moved */
+ rotate(p + start, end - start, i - start + 1); /* move them up */
+ i = end;
+ assert(ischeck(p + start) && iscapture(p + i));
+ }
+ }
+}
+
+
+static int target (Instruction *p, int i) {
+ while (p[i].i.code == IJmp) i += p[i].i.offset;
+ return i;
+}
+
+
+static void optimizejumps (Instruction *p) {
+ int i;
+ for (i = 0; p[i].i.code != IEnd; i += sizei(p + i)) {
+ if (isjmp(p + i))
+ p[i].i.offset = target(p, dest(p, i)) - i;
+ }
+}
+
+
+static void optimizechoice (Instruction *p) {
+ assert(p->i.code == IChoice);
+ if (ischeck(p + 1)) {
+ int lc = sizei(p + 1);
+ rotate(p, lc, 1);
+ assert(ischeck(p) && (p + lc)->i.code == IChoice);
+ (p + lc)->i.aux = op_step(p);
+ check2test(p, (p + lc)->i.offset);
+ (p + lc)->i.offset -= lc;
+ }
+}
+
+
+/*
+** A 'headfail' pattern is a pattern that can only fails in its first
+** instruction, which must be a check.
+*/
+static int isheadfail (Instruction *p) {
+ if (!ischeck(p)) return 0;
+ /* check that other operations cannot fail */
+ for (p += sizei(p); p->i.code != IEnd; p += sizei(p))
+ if (!isnofail(p)) return 0;
+ return 1;
+}
+
+
+#define checkpattern(L, idx) ((Instruction *)luaL_checkudata(L, idx, "pattern"))
+
+
+static int jointable (lua_State *L, int p1) {
+ int n, n1, i;
+ lua_getfenv(L, p1);
+ n1 = lua_objlen(L, -1); /* number of elements in p1's env */
+ lua_getfenv(L, -2);
+ if (n1 == 0 || lua_equal(L, -2, -1)) {
+ lua_pop(L, 2);
+ return 0; /* no need to change anything */
+ }
+ n = lua_objlen(L, -1); /* number of elements in p's env */
+ if (n == 0) {
+ lua_pop(L, 1); /* removes p env */
+ lua_setfenv(L, -2); /* p now shares p1's env */
+ return 0; /* no need to correct anything */
+ }
+ lua_createtable(L, n + n1, 0);
+ /* stack: p; p1 env; p env; new p env */
+ for (i = 1; i <= n; i++) {
+ lua_rawgeti(L, -2, i);
+ lua_rawseti(L, -2, i);
+ }
+ for (i = 1; i <= n1; i++) {
+ lua_rawgeti(L, -3, i);
+ lua_rawseti(L, -2, n + i);
+ }
+ lua_setfenv(L, -4); /* new table becomes p env */
+ lua_pop(L, 2); /* remove p1 env and old p env */
+ return n;
+}
+
+
+#define copypatt(p1,p2,sz) memcpy(p1, p2, (sz) * sizeof(Instruction));
+
+#define pattsize(L,idx) (lua_objlen(L, idx)/sizeof(Instruction) - 1)
+
+
+static int addpatt (lua_State *L, Instruction *p, int p1idx) {
+ Instruction *p1 = (Instruction *)lua_touserdata(L, p1idx);
+ int sz = pattsize(L, p1idx);
+ int corr = jointable(L, p1idx);
+ copypatt(p, p1, sz + 1);
+ if (corr != 0) {
+ Instruction *px;
+ for (px = p; px < p + sz; px += sizei(px)) {
+ if (isfenvoff(px) && px->i.offset != 0)
+ px->i.offset += corr;
+ }
+ }
+ return sz;
+}
+
+
+static void setinstaux (Instruction *i, Opcode op, int offset, int aux) {
+ i->i.code = op;
+ i->i.offset = offset;
+ i->i.aux = aux;
+}
+
+#define setinst(i,op,off) setinstaux(i,op,off,0)
+
+static int value2fenv (lua_State *L, int vidx) {
+ lua_createtable(L, 1, 0);
+ lua_pushvalue(L, vidx);
+ lua_rawseti(L, -2, 1);
+ lua_setfenv(L, -2);
+ return 1;
+}
+
+
+static Instruction *newpatt (lua_State *L, size_t n) {
+ Instruction *p;
+ if (n >= MAXPATTSIZE - 1)
+ luaL_error(L, "pattern too big");
+ p = (Instruction *)lua_newuserdata(L, (n + 1) * sizeof(Instruction));
+ luaL_getmetatable(L, "pattern");
+ lua_setmetatable(L, -2);
+ setinst(p + n, IEnd, 0);
+ return p;
+}
+
+
+static void fillcharset (Instruction *p, Charset cs) {
+ switch (p[0].i.code) {
+ case IZSet: case ITestZSet:
+ assert(testchar(p[1].buff, '\0'));
+ /* go through */
+ case ISet: case ITestSet: {
+ loopset(i, cs[i] = p[1].buff[i]);
+ break;
+ }
+ case IChar: case ITestChar: {
+ loopset(i, cs[i] = 0);
+ setchar(cs, p[0].i.aux);
+ break;
+ }
+ default: { /* any char may start unhandled instructions */
+ loopset(i, cs[i] = 0xff);
+ break;
+ }
+ }
+}
+
+
+/*
+** Function 'tocharset' gets information about which chars may be a
+** valid start for a pattern.
+*/
+
+static enum charsetanswer tocharset (Instruction *p, CharsetTag *c) {
+ if (ischeck(p)) {
+ fillcharset(p, c->cs);
+ if ((p + sizei(p))->i.code == IEnd && op_step(p) == 1)
+ c->tag = ISCHARSET;
+ else
+ c->tag = VALIDSTARTS;
+ }
+ else
+ c->tag = NOINFO;
+ return c->tag;
+}
+
+
+static int exclusiveset (Charset c1, Charset c2) {
+ /* non-empty intersection? */
+ loopset(i, {if ((c1[i] & c2[i]) != 0) return 0;});
+ return 1; /* no intersection */
+}
+
+
+static int exclusive (CharsetTag *c1, CharsetTag *c2) {
+ if (c1->tag == NOINFO || c2->tag == NOINFO)
+ return 0; /* one of them is not filled */
+ else return exclusiveset(c1->cs, c2->cs);
+}
+
+
+#define correctset(p) { if (testchar(p[1].buff, '\0')) p->i.code = IZSet; }
+
+static Instruction *newcharset (lua_State *L) {
+ Instruction *p = newpatt(L, CHARSETINSTSIZE);
+ p[0].i.code = ISet;
+ loopset(i, p[1].buff[i] = 0);
+ return p;
+}
+
+
+static int set_l (lua_State *L) {
+ size_t l;
+ const char *s = luaL_checklstring(L, 1, &l);
+ Instruction *p = newcharset(L);
+ while (l--) {
+ setchar(p[1].buff, (unsigned char)(*s));
+ s++;
+ }
+ correctset(p);
+ return 1;
+}
+
+
+static int range_l (lua_State *L) {
+ int arg;
+ int top = lua_gettop(L);
+ Instruction *p = newcharset(L);
+ for (arg = 1; arg <= top; arg++) {
+ int c;
+ size_t l;
+ const char *r = luaL_checklstring(L, arg, &l);
+ luaL_argcheck(L, l == 2, arg, "range must have two characters");
+ for (c = (byte)r[0]; c <= (byte)r[1]; c++)
+ setchar(p[1].buff, c);
+ }
+ correctset(p);
+ return 1;
+}
+
+
+static int nter_l (lua_State *L) {
+ Instruction *p = newpatt(L, 1);
+ luaL_checkany(L, 1);
+ setinst(p, IOpenCall, value2fenv(L, 1));
+ return 1;
+}
+
+
+
+static void checkfield (lua_State *L) {
+ Instruction *p = (Instruction *)lua_touserdata(L, -1);
+ if (p != NULL) { /* value is a userdata? */
+ if (lua_getmetatable(L, -1)) { /* does it have a metatable? */
+ lua_getfield(L, LUA_REGISTRYINDEX, "pattern");
+ if (lua_rawequal(L, -1, -2)) { /* does it have the correct mt? */
+ lua_pop(L, 2); /* remove both metatables */
+ return;
+ }
+ }
+ }
+ luaL_error(L, "invalid field in grammar");
+}
+
+
+static Instruction *fix_l (lua_State *L, int t) {
+ Instruction *p;
+ int i;
+ int totalsize = 2; /* include initial call and jump */
+ int n = 0; /* number of rules */
+ int base = lua_gettop(L);
+ lua_newtable(L); /* to store relative positions of each rule */
+ lua_pushinteger(L, 1); /* default initial rule */
+ /* collect patterns and compute sizes */
+ lua_pushnil(L);
+ while (lua_next(L, t) != 0) {
+ int l;
+ if (lua_tonumber(L, -2) == 1 && lua_isstring(L, -1)) {
+ lua_replace(L, base + 2); /* use this value as initial rule */
+ continue;
+ }
+ checkfield(L);
+ l = pattsize(L, -1) + 1; /* space for pattern + ret */
+ if (totalsize >= MAXPATTSIZE - l)
+ luaL_error(L, "grammar too large");
+ luaL_checkstack(L, LUA_MINSTACK, "grammar has too many rules");
+ lua_insert(L, -2); /* put key on top */
+ lua_pushvalue(L, -1); /* duplicate key (for lua_next) */
+ lua_pushvalue(L, -1); /* duplicate key (to index positions table)) */
+ lua_pushinteger(L, totalsize); /* position for this rule */
+ lua_settable(L, base + 1); /* store key=>position in positions table */
+ totalsize += l;
+ n++;
+ }
+ luaL_argcheck(L, n > 0, t, "empty grammar");
+ p = newpatt(L, totalsize); /* create new pattern */
+ p++; /* save space for call */
+ setinst(p++, IJmp, totalsize - 1); /* after call, jumps to the end */
+ for (i = 1; i <= n; i++) { /* copy all rules into new pattern */
+ p += addpatt(L, p, base + 1 + i*2);
+ setinst(p++, IRet, 0);
+ }
+ p -= totalsize; /* back to first position */
+ totalsize = 2; /* go through each rule's position */
+ for (i = 1; i <= n; i++) { /* check all rules */
+ int l = pattsize(L, base + 1 + i*2) + 1;
+ checkrule(L, p, totalsize, totalsize + l, base + 1, base + 2 + i*2);
+ totalsize += l;
+ }
+ lua_pushvalue(L, base + 2); /* get initial rule */
+ lua_gettable(L, base + 1); /* get its position in postions table */
+ i = lua_tonumber(L, -1); /* convert to number */
+ lua_pop(L, 1);
+ if (i == 0) /* is it defined? */
+ luaL_error(L, "initial rule not defined in given grammar");
+ setinst(p, ICall, i); /* first instruction calls initial rule */
+ /* correct calls */
+ for (i = 0; i < totalsize; i += sizei(p + i)) {
+ if (p[i].i.code == IOpenCall) {
+ int pos = getposition(L, base + 1, p[i].i.offset);
+ p[i].i.code = (p[target(p, i + 1)].i.code == IRet) ? IJmp : ICall;
+ p[i].i.offset = pos - i;
+ }
+ }
+ optimizejumps(p);
+ lua_replace(L, t); /* put new pattern in old's position */
+ lua_settop(L, base); /* remove rules and positions table */
+ return p;
+}
+
+
+static Instruction *getpatt (lua_State *L, int idx, int *size) {
+ Instruction *p;
+ switch (lua_type(L, idx)) {
+ case LUA_TSTRING: {
+ size_t i, len;
+ const char *s = lua_tolstring(L, idx, &len);
+ p = newpatt(L, len);
+ for (i = 0; i < len; i++)
+ setinstaux(p + i, IChar, 0, (unsigned char)s[i]);
+ lua_replace(L, idx);
+ break;
+ }
+ case LUA_TNUMBER: {
+ int n = lua_tointeger(L, idx);
+ if (n == 0) /* empty pattern? */
+ p = newpatt(L, 0);
+ else if (n > 0) {
+ Instruction *p1 = p = newpatt(L, (n - 1)/UCHAR_MAX + 1);
+ for (; n > UCHAR_MAX; n -= UCHAR_MAX)
+ setinstaux(p1++, IAny, 0, UCHAR_MAX);
+ setinstaux(p1, IAny, 0, n);
+ }
+ else if (-n <= UCHAR_MAX) {
+ p = newpatt(L, 2);
+ setinstaux(p, ITestAny, 2, -n);
+ setinst(p + 1, IFail, 0);
+ }
+ else {
+ int na = (-n - 1)/UCHAR_MAX;
+ Instruction *p1 = p = newpatt(L, 2 + na + 1);
+ setinstaux(p1++, ITestAny, na + 3, UCHAR_MAX);
+ setinstaux(p1++, IChoice, na + 2, UCHAR_MAX);
+ for (n += UCHAR_MAX; -n > UCHAR_MAX; n += UCHAR_MAX)
+ setinstaux(p1++, IAny, 0, UCHAR_MAX);
+ setinstaux(p1++, IAny, 0, -n);
+ setinst(p1, IFailTwice, 0);
+ }
+ lua_replace(L, idx);
+ break;
+ }
+ case LUA_TBOOLEAN: {
+ if (lua_toboolean(L, idx)) /* true? */
+ p = newpatt(L, 0); /* empty pattern (always succeeds) */
+ else {
+ p = newpatt(L, 1);
+ setinst(p, IFail, 0);
+ }
+ lua_replace(L, idx);
+ break;
+ }
+ case LUA_TTABLE: {
+ p = fix_l(L, idx);
+ break;
+ }
+ case LUA_TFUNCTION: {
+ p = newpatt(L, 1);
+ setinst(p, ILFunc, value2fenv(L, idx));
+ lua_replace(L, idx);
+ break;
+ }
+ default: {
+ p = checkpattern(L, idx);
+ break;
+ }
+ }
+ if (size) *size = pattsize(L, idx);
+ return p;
+}
+
+
+static int getpattl (lua_State *L, int idx) {
+ int size;
+ getpatt(L, idx, &size);
+ return size;
+}
+
+
+static int pattern_l (lua_State *L) {
+ lua_settop(L, 1);
+ getpatt(L, 1, NULL);
+ return 1;
+}
+
+
+static int concat_l (lua_State *L) {
+ /* p1; p2; */
+ Instruction *p;
+ int l1 = getpattl(L, 1);
+ int l2 = getpattl(L, 2);
+ Instruction *op = newpatt(L, l1 + l2);
+ p = op + addpatt(L, op, 1);
+ addpatt(L, p, 2);
+ optimizecaptures(op);
+ return 1;
+}
+
+
+static int diff_l (lua_State *L) {
+ int l1, l2;
+ Instruction *p1 = getpatt(L, 1, &l1);
+ Instruction *p2 = getpatt(L, 2, &l2);
+ CharsetTag st1, st2;
+ if (tocharset(p1, &st1) == ISCHARSET && tocharset(p2, &st2) == ISCHARSET) {
+ Instruction *p = newcharset(L);
+ loopset(i, p[1].buff[i] = st1.cs[i] & ~st2.cs[i]);
+ correctset(p);
+ }
+ else if (isheadfail(p2)) {
+ Instruction *p = newpatt(L, l2 + 1 + l1);
+ p += addpatt(L, p, 2);
+ check2test(p - l2, l2 + 1);
+ setinst(p++, IFail, 0);
+ addpatt(L, p, 1);
+ }
+ else { /* !e2 . e1 */
+ /* !e -> choice L1; e; failtwice; L1: ... */
+ Instruction *p = newpatt(L, 1 + l2 + 1 + l1);
+ Instruction *pi = p;
+ setinst(p++, IChoice, 1 + l2 + 1);
+ p += addpatt(L, p, 2);
+ setinst(p++, IFailTwice, 0);
+ addpatt(L, p, 1);
+ optimizechoice(pi);
+ }
+ return 1;
+}
+
+
+static int unm_l (lua_State *L) {
+ lua_pushliteral(L, "");
+ lua_insert(L, 1);
+ return diff_l(L);
+}
+
+
+static int pattand_l (lua_State *L) {
+ /* &e -> choice L1; e; backcommit L2; L1: fail; L2: ... */
+ int l1 = getpattl(L, 1);
+ Instruction *p = newpatt(L, 1 + l1 + 2);
+ setinst(p++, IChoice, 1 + l1 + 1);
+ p += addpatt(L, p, 1);
+ setinst(p++, IBackCommit, 2);
+ setinstaux(p, IFail, 0, 1);
+ return 1;
+}
+
+
+static int firstpart (Instruction *p, int l) {
+ if (istest(p)) {
+ int e = p[0].i.offset - 1;
+ if ((p[e].i.code == IJmp || p[e].i.code == ICommit) &&
+ e + p[e].i.offset == l)
+ return e + 1;
+ }
+ else if (p[0].i.code == IChoice) {
+ int e = p[0].i.offset - 1;
+ if (p[e].i.code == ICommit && e + p[e].i.offset == l)
+ return e + 1;
+ }
+ return 0;
+}
+
+
+static Instruction *auxnew (lua_State *L, Instruction **op, int *size,
+ int extra) {
+ *op = newpatt(L, *size + extra);
+ jointable(L, 1);
+ *size += extra;
+ return *op + *size - extra;
+}
+
+
+static int nofail (Instruction *p, int l) {
+ int i;
+ for (i = 0; i < l; i += sizei(p + i)) {
+ if (!isnofail(p + i)) return 0;
+ }
+ return 1;
+}
+
+
+static int interfere (Instruction *p1, int l1, CharsetTag *st2) {
+ if (nofail(p1, l1)) /* p1 cannot fail? */
+ return 0; /* nothing can intefere with it */
+ if (st2->tag == NOINFO) return 1;
+ switch (p1->i.code) {
+ case ITestChar: return testchar(st2->cs, p1->i.aux);
+ case ITestSet: return !exclusiveset(st2->cs, (p1 + 1)->buff);
+ default: assert(p1->i.code == ITestAny); return 1;
+ }
+}
+
+
+static Instruction *basicUnion (lua_State *L, Instruction *p1, int l1,
+ int l2, int *size, CharsetTag *st2) {
+ Instruction *op;
+ CharsetTag st1;
+ tocharset(p1, &st1);
+ if (st1.tag == ISCHARSET && st2->tag == ISCHARSET) {
+ Instruction *p = auxnew(L, &op, size, CHARSETINSTSIZE);
+ setinst(p, ISet, 0);
+ loopset(i, p[1].buff[i] = st1.cs[i] | st2->cs[i]);
+ correctset(p);
+ }
+ else if (exclusive(&st1, st2) || isheadfail(p1)) {
+ Instruction *p = auxnew(L, &op, size, l1 + 1 + l2);
+ copypatt(p, p1, l1);
+ check2test(p, l1 + 1);
+ p += l1;
+ setinst(p++, IJmp, l2 + 1);
+ addpatt(L, p, 2);
+ }
+ else {
+ /* choice L1; e1; commit L2; L1: e2; L2: ... */
+ Instruction *p = auxnew(L, &op, size, 1 + l1 + 1 + l2);
+ setinst(p++, IChoice, 1 + l1 + 1);
+ copypatt(p, p1, l1); p += l1;
+ setinst(p++, ICommit, 1 + l2);
+ addpatt(L, p, 2);
+ optimizechoice(p - (1 + l1 + 1));
+ }
+ return op;
+}
+
+
+static Instruction *separateparts (lua_State *L, Instruction *p1, int l1,
+ int l2, int *size, CharsetTag *st2) {
+ int sp = firstpart(p1, l1);
+ if (sp == 0) /* first part is entire p1? */
+ return basicUnion(L, p1, l1, l2, size, st2);
+ else if ((p1 + sp - 1)->i.code == ICommit || !interfere(p1, sp, st2)) {
+ Instruction *p;
+ int init = *size;
+ int end = init + sp;
+ *size = end;
+ p = separateparts(L, p1 + sp, l1 - sp, l2, size, st2);
+ copypatt(p + init, p1, sp);
+ (p + end - 1)->i.offset = *size - (end - 1);
+ return p;
+ }
+ else { /* must change back to non-optimized choice */
+ Instruction *p;
+ int init = *size;
+ int end = init + sp + 1; /* needs one extra instruction (choice) */
+ int sizefirst = sizei(p1); /* size of p1's first instruction (the test) */
+ *size = end;
+ p = separateparts(L, p1 + sp, l1 - sp, l2, size, st2);
+ copypatt(p + init, p1, sizefirst); /* copy the test */
+ (p + init)->i.offset++; /* correct jump (because of new instruction) */
+ init += sizefirst;
+ setinstaux(p + init, IChoice, sp - sizefirst + 1, 1); init++;
+ copypatt(p + init, p1 + sizefirst, sp - sizefirst - 1);
+ init += sp - sizefirst - 1;
+ setinst(p + init, ICommit, *size - (end - 1));
+ return p;
+ }
+}
+
+
+static int union_l (lua_State *L) {
+ int l1, l2;
+ int size = 0;
+ Instruction *p1 = getpatt(L, 1, &l1);
+ Instruction *p2 = getpatt(L, 2, &l2);
+ CharsetTag st2;
+ if (p1->i.code == IFail) /* check for identity element */
+ lua_pushvalue(L, 2);
+ else if (p2->i.code == IFail)
+ lua_pushvalue(L, 1);
+ else {
+ tocharset(p2, &st2);
+ separateparts(L, p1, l1, l2, &size, &st2);
+ }
+ return 1;
+}
+
+
+static Instruction *repeatheadfail (lua_State *L, int l1, int n) {
+ /* e; ...; e; L2: e'(L1); jump L2; L1: ... */
+ int i;
+ Instruction *p = newpatt(L, (n + 1)*l1 + 1);