---

# 8. Generalized Parsing
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, March 2024**

---

#### General Context-free Parsing

[Earley's parser](https://doi-org.libaccess.lib.mcmaster.ca/10.1145/362007.362035) works with an arbitrary context-free grammar without backtracking. If the grammar is unambiguous, it produces a parse tree in quadratic time; if the grammar is ambiguous, it produces all parse trees in cubic time (in the length of the input). For most “practical” grammars, it produces a parse tree in linear time.

We assume that the start symbol `S` appears only on the left-hand side of one rule, `S → π`; if that is not the case, a rule `S' → S` with a new start symbol `S'` can be added. Earley's parser is a top-down parser that constructs all possible derivations simultaneously: starting with `S`, nonterminals are eagerly expanded according to the all possible productions, rather than just a single production.

Let `P` be the set of productions and let the input be given by `x₁, …, xₙ`. Assume `xₙ₊₁ = $`, where `$` is a symbol that does not occur anywhere in the grammar. For each position of the input a set `sᵢ` of *Earley items* is maintained. An (Earley) item is a grammar rule with the right-hand side split, visualized by `•`, together with an index into the input string. An item `(A → σ • ω, j)`at position `i` means that `A` is attempted to be recognized at input position `j + 1` and up to `i` the input `xⱼ₊₁…xᵢ` can be derived from `σ`, formally `σ ⇒* xⱼ₊₁…xᵢ`. At each position `i`, the algorithm adds items to `sᵢ` in *predict* and *complete* steps and to `sᵢ₊₁` in *match* steps. The algorithm iterates over all items at one position. Since items are being added, a set, `v`, of visited items is maintained.

```
s₀ := {(S → • π, 0)}; for i = 1 to n do sᵢ := {}
for i = 0 to n do
    v := {}
    while v ≠ sᵢ do
        e :∈ sᵢ - v; v := v ∪ {e}
        case e of
            (A → σ • a ω, j) and a = xᵢ₊₁:        -- match (M)
                sᵢ₊₁ := sᵢ₊₁ ∪ {(A → σ a • ω, j)} 
            (A → σ • B ω, j):                            -- predict (P)
                for B → μ ∈ P do
                    sᵢ := sᵢ ∪ {(B → • μ, i)} 
            (A → σ •, j):                                   -- complete (C)
                for (B → μ • A ξ, k) ∈ sⱼ do
                    sᵢ := sᵢ ∪ {(B → μ A • ξ, k)}
accept := (S → π •, 0) ∈ sₙ
```

<div style="float:left">

Consider the grammar:

    E → T | E + T
    T → F | T × F
    F → a

The input `a + a × a` is accepted as `(S → E •, 0) ∈ s₅`.

Lines in bold correspond to the derivation.
</div>

<div style="float:right">

|            | item                | step      |
|:-----------|:--------------------|:----------|
| `s₀`:      | `S → • E, 0`        |           |
| `(x₁ = a)` | `E → • T, 0`        | P         |
|            | `E → • E + T, 0`    | P         |
|            | `T → • F, 0`        | P         |
|            | `T → • T × F, 0`    | P         | 
|            | `F → • a, 0`        | P         |
| `s₁`:      | **`F → a •, 0`**        | **M at `0`**  |
| `(x₂ = +)` | **`T → F •, 0`**        | C         |
|            | **`E → T •, 0`**        | C         |
|            | `T → T • × F, 0`    | C         |
|            | `S → E •, 0`        | C         |
|            | `E → E • + T, 0`    | C         |
| `s₂`:      | `E → E + • T, 0`    | M at `1`  |
| `(x₃ = a)` | `T → • T × F, 2`    | P         |
|            | `T → • F, 2`        | P         |
|            | `F → • a, 2`        | P         |
| `s₃`:      | **`F → a •, 2`**        | **M at `2`**  |
| `(x₄ = ×)` | **`T → F •, 2`**        | **C**         |
|            | `E → E + T •, 0`    | C         |
|            | `T → T • × F, 2`    | C         |
|            | `S → E •, 0`        | C         |
|            | `E → E • + T, 0`    | C         |
| `s₄`:      | `T → T × • F, 2`    | M at `3`  |
| `(x₅ = a)` | `F → • a, 4`        | P         |
| `s₅`:      | **`F → a •, 4`**        | **M at `4`**  |
| `(x₆ = $)` | **`T → T × F •, 2`**    | **C**         |
|            | **`E → E + T •, 0`**    | C         |
|            | `T → T • × F, 2`    | C         |
|            | **`S → E •, 0`**        | **C**         |
|            | `E → E • + T, 0`    | C         |

</div>

The Python implementation below assumes that each terminal and nonterminal is a single character, the grammar is represented by a tuple of productions, and each production is a string of the form `A→τ` where `A` is a nonterminal. The first production, `g[0]` in the implementation, defines the start symbol. Since in Python strings are indexed starting from `0`, an extra character, `^`, is prepended to the input. The sequence `a ω` in the algorithm corresponds to `τ` in the implementation and `A ξ` corresponds to `ν`. 

In [7]:
def parse(g: "grammar", x: "input"):
    global s
    n = len(x); x = '^' + x + '$'; S, π = g[0][0], g[0][2:]
    s = [{(S, '', π, 0)}] + [set() for _ in range(n)]; print('   s[ 0 ]:', S, '→ •', π, ', 0')
    for i in range(n + 1):
        v = set() # visited items
        while v != s[i]:
            e = (s[i] - v).pop(); v.add(e) # pick an arbirary un-visited item
            A, σ, τ, j = e
            if len(τ) > 0 and τ[0] == x[i + 1]: # match, a == τ[0]
                f = (A, σ + τ[0], τ[1:], j)
                s[i + 1].add(f); print('M  s[', i + 1, ']:', f[0], '→', f[1], '•', f[2], ',', f[3])
            elif len(τ) > 0: # predict, B == ω[0]
                for f in ((r[0], '', r[2:], i) for r in g if r[0] == τ[0]):
                    s[i].add(f); print('P  s[', i, ']:', f[0], '→', f[1], '•', f[2], ',', f[3])
            else: # complete, len(τ) == 0
                for f in ((B, μ + ν[0], ν[1:], k) for (B, μ, ν, k) in s[j] if len(ν) > 0 and ν[0] == A):
                    s[i].add(f); print('C  s[', i, ']:', f[0], '→', f[1], '•', f[2], ',', f[3])
    return (S, π, '', 0) in s[n]

In [8]:
G1 = ("S→E", "E→a", "E→E+E")

In [9]:
parse(G1, "a+a+a")

   s[ 0 ]: S → • E , 0
P  s[ 0 ]: E →  • a , 0
P  s[ 0 ]: E →  • E+E , 0
M  s[ 1 ]: E → a •  , 0
P  s[ 0 ]: E →  • a , 0
P  s[ 0 ]: E →  • E+E , 0
C  s[ 1 ]: E → E • +E , 0
C  s[ 1 ]: S → E •  , 0
M  s[ 2 ]: E → E+ • E , 0
P  s[ 2 ]: E →  • a , 2
P  s[ 2 ]: E →  • E+E , 2
P  s[ 2 ]: E →  • a , 2
P  s[ 2 ]: E →  • E+E , 2
M  s[ 3 ]: E → a •  , 2
C  s[ 3 ]: E → E+E •  , 0
C  s[ 3 ]: E → E • +E , 2
C  s[ 3 ]: E → E • +E , 0
C  s[ 3 ]: S → E •  , 0
M  s[ 4 ]: E → E+ • E , 0
M  s[ 4 ]: E → E+ • E , 2
P  s[ 4 ]: E →  • a , 4
P  s[ 4 ]: E →  • E+E , 4
P  s[ 4 ]: E →  • a , 4
P  s[ 4 ]: E →  • E+E , 4
P  s[ 4 ]: E →  • a , 4
P  s[ 4 ]: E →  • E+E , 4
M  s[ 5 ]: E → a •  , 4
C  s[ 5 ]: E → E • +E , 4
C  s[ 5 ]: E → E+E •  , 0
C  s[ 5 ]: E → E+E •  , 2
C  s[ 5 ]: E → E • +E , 0
C  s[ 5 ]: S → E •  , 0
C  s[ 5 ]: E → E+E •  , 0
C  s[ 5 ]: E → E • +E , 2


True

In [4]:
grammar = ("S→E", "E→T", "E→E+T", "T→F", "T→T×F", "F→a")

In [5]:
parse(grammar, "a+a×a")

True

The algorithm can be "animated" by uncommenting the `print` statements; the resulting set of items can also be observed:

In [6]:
s

[{('E', '', 'E+T', 0),
  ('E', '', 'T', 0),
  ('F', '', 'a', 0),
  ('S', '', 'E', 0),
  ('T', '', 'F', 0),
  ('T', '', 'T×F', 0)},
 {('E', 'E', '+T', 0),
  ('E', 'T', '', 0),
  ('F', 'a', '', 0),
  ('S', 'E', '', 0),
  ('T', 'F', '', 0),
  ('T', 'T', '×F', 0)},
 {('E', 'E+', 'T', 0),
  ('F', '', 'a', 2),
  ('T', '', 'F', 2),
  ('T', '', 'T×F', 2)},
 {('E', 'E', '+T', 0),
  ('E', 'E+T', '', 0),
  ('F', 'a', '', 2),
  ('S', 'E', '', 0),
  ('T', 'F', '', 2),
  ('T', 'T', '×F', 2)},
 {('F', '', 'a', 4), ('T', 'T×', 'F', 2)},
 {('E', 'E', '+T', 0),
  ('E', 'E+T', '', 0),
  ('F', 'a', '', 4),
  ('S', 'E', '', 0),
  ('T', 'T', '×F', 2),
  ('T', 'T×F', '', 2)}]

For efficiency, instead of using a set of items for an Earley state, a lists with a marker separating the items  that have been visited and that still need to be visited can be used.

The number of items in `sᵢ` is proportional to `i` in the worst case. Matching and predicting need at most `i` steps for `sᵢ`, but completing may need `i²` steps, as adding an item may cause a previous set to be revisited. Summing `i²` for `i` from `0` to `n` is `n³`, thus Earley's parser needs `n³` steps in the worst case.

#### Parsing Expression Grammars and Packrat Parsing

The productions of context-free grammars are *generative*. [Parsing expression grammars](https://bford.info/pub/lang/peg/) are an alternative to context-free grammars that describe how practical parsers *recognize:*
- They do not allow nondeterminism in productions: these are resolved in the grammar using *prioritized choice* and *greedy operators*.
- *Syntactic predicates* allow to express certain non-context-free languages.
- Parsing allows unlimited lookahead and backtracking but still runs in time linear to the length of the input.
- Parsers can be implemented by recursive descend and are simple enough to be written by hand.

Productions are written as `A ← E`, and `E` is a parsing expression. Parsing according to a parsing expression may *succeed* or *fail:*

| expression            | name                |    |
|:----------------|:-----------------------------|:------|
| `'ε'`         |empty string | succeed without consuming |
| `'a'`         |literal string | consume `a` literally, otherwise fail |
| `B`             | nonterminal `B` | consume `B`, otherwise fail |
| `(E)`          | grouping | consume `E`, otherwise fail |
| `E?`          | optional  | consume `E` if possible |
| `E*`          | zero-or-more | consume `E` as often as possible |
| `E+`          | one-or-more | consume `E` once, otherwise fail, and then as often as possible |
| `&E`          | and-predicate | match `E` and do not consume, otherwise fail  |
| `!E`          | not-predicate | match anything but `E`  and do not consume, otherwise fail |
| `E₁ E₂`    | sequence | consume `E₁`, then `E₂`, otherwise fail |
| `E₁ / E₂` | prioritized choice | consume `E₁`, otherwise consume `E₂`, otherwise fail |

In EBNF, 

  `A → a b | a`    and    `A → a | a b`

are equivalent. The PEG rules

  `A ← a b / a`    and    `A → a / a b`

are different: the second rule will never match `a b` as the first alternative is given priority. For example, the EBNF production

    statement → "if" expression "then" statement ["else" statement"]
    
allows `if E then if F then S else T` to be parsed as either `if E then (if F then S) else T` or as `if E then (if F then S else T)`, known as the *dangling else* problem. In EBNF, this is resolved informally or by complicating the grammar. In PEG, a prioritized choice or optional expression resolves the ambiguity:

    statement ← "if" expression "then" statement "else" statement" / "if" expression "then" statement
    statement ← "if" expression "then" statement ("else" statement")?

That is, `E?` is *greedy*: `E` must be consumed if possible; it is a shorthand for `E / ε`.

Consider the definition of symbols in terms of characters:

    operator → '<' ' =' | '<' | '=' | '<' '<'

The sentence `<<=` is an ambiguous sequence of symbols. This is informally resolved by applying the *longest match rule*. The longest match can be expressed in PEG:

    operator ← '<' ' =' / '<' '<' / '<' / '=' 
    
The regular expression `a* a` matches a non-empty sequence of `a`s. The PEG expression `a* a` will not match any sequence of `a`s as `a*` matches the whole sequence, and the final `a` cannot be matched. Greedy repetition is equivalent to recursion with prioritized choice:

  `A ← E*`    is equivalent to    `A ← E A / ε`  
  `A ← E+`    is equivalent to    `A ← E E*`

For example,

  `('0' / '1')*`    matches    <code><u>110</u>+10</code>

The and-predicates and not-predicate match but do not consume their operands. For example,

  `'f' 'o' 'o' 'd' &('i' 'e')`    matches    <code><u>food</u>ie</code>    and fails on    `foodchain`  
  `'f' 'o' 'o' 'd' !('i' 'e')`    fails on     `foodie`    and matches   <code><u>food</u>chain</code>

A parsing expression grammar `G = (T, N, P, S)` is specified by 
- a finite set `T` of *terminal symbols*,
- a finite set `N` of *nonterminal symbols*,
- a finite set `P` of *productions*,
- a symbol `S ∈ N`, the *start symbol*

where `N ∩ T = {}` and `V = T ∪ N` is its *vocabulary*. Productions are  pairs, written `A ← E`, where `A` is a nonterminal and `E` is a parsing expression.

The language accepted by `G` is the language accepted by a parser for `G`. A *packrat* parser consists of a set of mutually recursive parsing procedures. A parsing procedure for nonterminal `A` takes a parameter, `k`, at which parsing the input should start and returns either the index to where `A` was recognized or `Fail`:

| `p`             | `pr(p)`                      |
|:----------------|:-----------------------------|
| `B → E`         | `procedure B(k: integer) → integer \| Fail` <br> `pr(E)` <br> `return k` |

The input `src` is a global variable. The rules for constructing `pr(E)` for recognizing `E` starting at position `k` in `src` are:

| `E`             | `pr(E)`                             |
|:----------------|:------------------------------------|
| `'a'`           | `if prefix('a', src[k:]) then k := k + len(a) else k := Fail` |
| `B`             | `k ← B(k)`                               |
| `(E₁)`          | `pr(E₁)`                            |
| `E₁?`          | `var r  := k` <br> `pr(E₁)` <br> `if k = Fail then k := r`    |
| `E₁*`          | `var r` <br> `while k ≠ Fail do` <br>  `r := k ; pr(E₁)` <br> `k := r`   |
| `&E`          |  `var r := k` <br> `pr(E₁)` <br> `if k ≠ Fail then k := r`   |
| `!E`          |  `var r := k` <br> `pr(E₁)` <br> `if k = Fail then k := r else k := Fail`   |
| `E₁ E₂  …`       | `pr(E₁)` <br>`if k ≠ Fail then pr(E₂)` <br>`…`       |
| `E₁ / E₂ /  …`   | `var r := k` <br> `pr(E₁)` <br> `if k = Fail then` <br>  `k := r ; pr(E₂)` <br>  `…`  |

The procedure of the start symbol has to be called for recognizing a sentence of the language.

Consider the following EBNF grammar; it cannot be parsed by recursive descent with `k` symbols lookahead for any `k`:

    S → A | B
    A → x A | y
    B → x B | z

The equivalent PEG `P` is:

    S ← A / B
    A ← x A / y
    B ← x B / z

The corresponding packrat parser is expressed as a Python class with the source as a field; a failing parsing procedure returns `None`:

In [10]:
class PBacktrack:
    src: str

    def literal(self, k : int, a: str):
        if self.src.startswith(a, k): return k + len(a) # else return None

    def S(self, k: int):
        r = k; k = self.A(k)
        if k == None: k = self.B(r)
        return k

    def A(self, k: int):
        r = k; k = self.literal(k, 'x')
        if k != None: k = self.A(k)
        if k == None: k = self.literal(r, 'y')
        return k

    def B(self, k: int):
        r = k; k = self.literal(k, 'x')
        if k != None: k = self.B(k)
        if k == None: k = self.literal(r, 'z')
        return k

    def parse(self, s: str):
        self.src = s; return self.S(0)

In [11]:
P = PBacktrack()
assert P.parse('') == None and P.parse('x') == None and P.parse('y') == 1 and P.parse('z') == 1 and P.parse('xy') == 2 and P.parse('xz') == 2 and  P.parse('xxxxxxxxz') == 9

Parsing is greedy and stops when no further input can be recongized, even if there is more input: 

In [13]:
assert P.parse('xxyx') == 3