---

# 3. Analysis of Context-free Languages
**[Emil Sekerinski](http://www.cas.mcmaster.ca/~emil/), McMaster University, February 2024**

---

> This notebook contains [type hints](https://www.python.org/dev/peps/pep-0484/) that allow type-checking with [mypy](http://mypy-lang.org/). See also this [introduction](https://www.python.org/dev/peps/pep-0483/), the Python [typing](https://docs.python.org/3/library/typing.html) library, and this [cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html). The [nb_mypy](https://pypi.org/project/nb-mypy/) notebook extension type-checks notebook cells with mypy as they are executed. The extension can be installed by `python3 -m pip install nb_mypy`, which also installs mypy, and then has to be enabled by running the line magic below. 

In [None]:
%load_ext nb_mypy

### Pushdown Automata

Context-free languages can contain nested structures, e.g.

    S → a S c | b

Recognizing this language requires matching an unbounded number of `a` symbols with the same number of `c` symbols, which finite state automata cannot do.

Context-free languages can be recognized by _pushdown automata_ that operate on a stack: transitions can push on and pop from the stack. The size of the stack is not bounded.

A pushdown automaton `P = (T, S, R, s₀)` is specified by

- a finite set `T` of *input symbols*,
- a finite set `S` of *stack symbols*,
- a finite set `R` of *transitions*,
- an _initial stack symbol_ `s₀ ∈ S ∪ {ε}`,

where `T` is also called the _vocabulary_ and each transition is a triple with a sequence `σ ∈ S*`, an input symbol `t ∈ T ∪ {ε}`, and a sequence `σ' ∈ S*`, written:

	  σ t → σ'

The pushdown automaton starts with just `s₀` on the stack. A transition `σ t → σ'` can be taken if the top of the stack is `σ` and `t` can be consumed from the input: the transition will pop `σ` from the stack and push `σ'` on the stack. If there are no more input symbols when the stack is empty, the input string is accepted; otherwise, it is rejected.

<div style="float:right;border-left:4em solid white">

| transition | stack | input  |
|:-----------|------:|-------:|
| ` `        | `ε`   | `aabbba` |
| (1)        | `a`   | `abbba`  |
| (1)        | `aa`   | `bbba`  |
| (4)        | `a`   | `bba`   |
| (4)        | `ε`   | `ba`   |
| (3)        | `b`   | `a`    |
| (2)        | `ε`   | `ε`    |

</div>

For example, an automaton accepting sequences over `T = {a, b}` with the same number of occurrences of `a` and `b`, formally `{τ ∈ T* | a#τ = b#τ}`, is `P = (T, S, R, s₀)` where `S = {a, b}`, `s₀ = ε`, and the transitions `R` are:

  `ε a → a` <span style="float:right">(1)</span>  
  `b a → ε` <span style="float:right">(2)</span>  
  `ε b → b` <span style="float:right">(3)</span>  
  `a b → ε` <span style="float:right">(4)</span>

Here, the stack symbols coincide with the input symbols. The input `aabbba` is accepted by `P₀` as in the table. By convention, the top of the stack is to the left.

**Question.** What is a pushdown automaton for accepting the language generated by `S → a S c | b`?

<div style="float:right;border-left:4em solid white">

| transition | stack | input   |
|:-----------|------:|:--------|
| ` `        | `s`   | `aabcc` |
| (1)        | `s.`  | `abcc`  |
| (1)        | `s..` | `bcc`   |
| (2)        | `..`  | `cc`    |
| (3)        | `.`   | `c`     |
| (3)        | `ε`   | `ε`     |

</div>

_Answer._ The automaton is `P = (T, S, R, s₀)` where `T = {a, b, c}`, `S = {s, .}`, `s₀ = s`, and the transitions `R` are:

  `s a → s.` <span style="float:right">(1)</span>  
  `s b → ε` <span style="float:right">(2)</span>  
  `. c → ε` <span style="float:right">(3)</span>

The input `aabcc` is accepted as in the table.

**Exercise:** What is the pushdown automaton for accepting palindromes over `{a, b, c}`, i.e. sequences that read backward the same as forward?

For every finite state automaton, an equivalent pushdown automaton can be constructed. For example, the finite state automaton accepting `E₁ = ab|ac` is `A₁ = (T, Q, R, q₀, F)` with `T = {a, b, c}`, `Q = {0, 1, 2, 3, 4}`, `q₀ = 0`, `F = {3, 4}`, and transitions `R`:

    0 a → 1
    0 a → 2
    1 b → 3
    2 c → 4

The equivalent pushdown automaton `P₁ = (T, S, R, s₀)` has the same vocabulary `T`, has `S = {0, 1, 2}`, `s₀ = 0`, and has transitions `R`:

  `0 a → 1` <span style="float:right">(1)</span>  
  `0 a → 2` <span style="float:right">(2)</span>  
  `1 b → ε` <span style="float:right">(3)</span>  
  `2 c → ε` <span style="float:right">(4)</span>

That is, the stack is initialized with the initial state of `A₁`, the transitions of `A₁` and `P₁` are the same, except that transitions in `A₁` to final states pop the state from the stack in `P₁`.

**Question.** What are the steps to accept `ab` with `P₁`?

*Answer.*

| transition | stack | input  |
|:-----------|------:|:-------|
| ` `        | `0`   | `ab`   |
| (1)        | `1`   | `b`    |
| (3)        | `ε`   | `ε`    |

For every context-free grammar, an equivalent pushdown automaton can be constructed and vice versa.

As with finite state automata, pushdown automata can be deterministic or nondeterministic. Unlike finite state automata, it is generally impossible to make a pushdown automaton deterministic and run in linear time. The best one can achieve in general is to accept in approximately `n³` time, where `n` is the length of the input.

For accepting in linear time, restrictions on the languages must be imposed and, therefore, on the grammars generating these languages. There are different ways of constructing a pushdown automaton given a grammar, each with different restrictions on the grammar. Since our ultimate goal is to determine the meaning of a sentence through its parse tree and not just to accept it, we have to be careful with grammar modifications to suit the construction of the pushdown automaton.

### Top-down and Bottom-up Parsing

<div style="float:right;border-left:4em solid white">

| step   | stack     | input         |
|:-------|----------:|:--------------|
|        | `E`       | `x × (y + z)` |
| P (1)  | `T`       | `x × (y + z)` |
| P (4)  | `T × F`   | `x × (y + z)` |
| P (3)  | `F × F`   | `x × (y + z)` |
| P (5)  | `id × F`  | `x × (y + z)` |
| M (7)  | `× F`     | `× (y + z)`   |
| M (9)  | `F`       | `(y + z)`     |
| P (6)  | `(E)`     | `(y + z)`     |
| M (10) | `E)`      | `y + z)`      |
| P (2)  | `E + T)`  | `y + z)`      |
| P (1)  | `T + T)`  | `y + z)`      |
| P (3)  | `F + T)`  | `y + z)`      |
| P (5)  | `id + T)` | `y + z)`      |
| M (7)  | `+ T)`    | `+ z)`        |
| M (8)  | `T)`      | `z)`          |
| P (3)  | `F)`      | `z)`          |
| P (5)  | `id)`     | `z)`          |
| M (7)  | `)`       | `)`           |
| M (11) | ` `       | ` `           |

</div>

_Top-down parsing_ starts to build the parse tree with the start symbol as the goal, which is split into subgoals for each non-terminal according to the grammar rules until the terminals match the input.

Consider parsing the sentence `x × (y + z)` with grammar `G₂`:

    E → T | E + T
    T → F | T × F
    F → id | ( E )

The equivalent top-down pushdown automaton `P₂ = (T, S, R, s₀)` has the same vocabulary `T = {+, ×, id, (, )}`, has stack symbols `S = {E, T, F, +, ×, id, (, )}`, `s₀ = E`, and has transitions `R`:

  `E ε → T` <span style="float:right">(1)</span>  
  `E ε → E + T` <span style="float:right">(2)</span>  
  `T ε → F` <span style="float:right">(3)</span>  
  `T ε → T × F` <span style="float:right">(4)</span>  
  `F ε → id` <span style="float:right">(5)</span>  
  `F ε → ( E )` <span style="float:right">(6)</span>  
  `id id → ε` <span style="float:right">(7)</span>  
  `+ + → ε` <span style="float:right">(8)</span>  
  `× × → ε` <span style="float:right">(9)</span>  
  `( ( → ε` <span style="float:right">(10)</span>  
  `) ) → ε` <span style="float:right">(11)</span>

A top-down parser takes *produce (expand) steps* for transitions (1) to (6) and *match steps* for transitions (7) to (11).

Because of the direct correspondence between a context-free grammar and its bottom-up pushdown automaton, we omit from now on the explicit definition of the automaton.

<div style="float:right;border-left:4em solid white">

| step   | stack         | input           |
|:-------|--------------:|:----------------|
|        |               | `x × ( y + z )` |
| S (7)  | `x`           | `× ( y + z )`   |
| R (5)  | `F`           | `× ( y + z )`   |
| R (3)  | `T`           | `× ( y + z )`   |
| S (9)  | `× T`         | `( y + z )`     |
| S (10) | `( × T`       | `y + z )`       |
| S (7)  | `y ( × T`     | `+ z )`         |
| R (5)  | `F ( × T`     | `+ z )`         |
| R (3)  | `T ( × T`     | `+ z )`         |
| R (1)  | `E ( × T`     | `+ z )`         |
| S (8)  | `+ E ( × T`   | `z )`           |
| S (7)  | `z + E ( × T` | `)`             |
| R (5)  | `F + E ( × T` | `)`             |
| R (3)  | `T + E ( × T` | `)`             |
| R (2)  | `E ( × T`     | `)`             |
| S (11) | `) E ( × T`   |                 |
| R (6)  | `F × T`       |                 |
| R (3)  | `T`           |                 |
| R (1)  | `E`           |                 |

</div>

_Bottom-up parsing_ proceeds without a specific goal; the parse tree grows from bottom to top; the input is accepted if it is reduced to the start symbol by two kinds of steps:

- Shift steps shift the next input symbol on the stack.
- Reduce steps reduce a sequence of symbols on the stack according to a transition.

Consider parsing the sentence `x × (y + z)` with grammar `G₂`:

    E → T | E + T
    T → F | T × F
    F → id | ( E )

The equivalent bottom-up pushdown automaton `P₂' = (T, S, R, s₀)` has the same vocabulary `T = {+, ×, id, (, )}`, has stack symbols `S = {E, T, F, +, ×, id, (, )}`, `s₀ = ε`, and has transitions `R`:

  `T ε → E` <span style="float:right">(1)</span>  
  `T + E ε → E` <span style="float:right">(2)</span>  
  `F ε → T` <span style="float:right">(3)</span>  
  `F × T ε → T` <span style="float:right">(4)</span>  
  `id ε → F` <span style="float:right">(5)</span>  
  `) E ( ε → F` <span style="float:right">(6)</span>  
  `ε id → id` <span style="float:right">(7)</span>  
  `ε + → +` <span style="float:right">(8)</span>  
  `ε × → ×` <span style="float:right">(9)</span>  
  `ε ( → (` <span style="float:right">(10)</span>  
  `ε ) → )` <span style="float:right">(11)</span>

The parser takes *shift steps* for transitions (7) to (11) and *reduce steps* for transitions (1) to (6).

As in pushdown automata, an input is accepted if the stack is empty; this can be achieved by adding one more transition, `E ε → ε`.

### Conditions for Top-down Parsing

<div style="float:right;border-left:2em solid white">

| step | stack | input  |
|:-----|------:|:-------|
| ` `  | `S`   | `xxxz` |
| P    | `A`   | `xxxz` |
| P    | `xA`  | `xxxz` |
| M    | `A`   | `xxz`  |
| P    | `xA`  | `xxz`  |
| M    | `A`   | `xz`   |
| P    | `xA`  | `xz`   |
| M    | `A`   | `z`    |
    
</div>

We continue with top-down parsing, also called _predictive parsing_ since we have to predict which production to expand at each P step. The key to deterministic parsing is to select the right production. For this, we only allow the parser to select a production with _one symbol lookahead_.

Consider parsing `xxxz` in grammar `G₃`:

    S → A | B
    A → x A | y
    B → x B | z

After an unfortunate initial P step, we get stuck. One would need to look ahead to the last input symbol to prevent that. However, with this grammar there is no bound on the number of symbols one would need to look ahead in general.

The required restrictions on the grammar are expressed in terms of the _first_ and _follow sets_.

The set `first(ω)` is the set of all terminals that can appear in the first position of sentences derived from `ω`:

    first(ω) = {t ∈ T | ω ⇒* t ν, for some ν ∈ V*}
	
The set `follow(ω)` is the set of all terminal symbols that may follow `ω` in any sentence:

    follow(ω) = {t ∈ T | S ⇒* μ ω t ν, for some μ, ν ∈ V*}

For example, in `G₃` with `S → A | B`, `A → x A | y`, `B → x B | z` we have:

- `first(A) = {x, y}`, `first(B) = {x, z}`, `first(S) = {x, y, z}`, `first(xA) = {x}`
- `follow(x) = {x, y, z}`, `follow(xA) = {}`, `follow(A) = {}`

Two conditions are required to ensure that a deterministic, one-symbol lookahead top-down parser can be constructed.

**Condition 1.** If `A` is defined by the production

    A → χ₁ | χ₂ | …

then the initial symbols of all sentences that can be derived from all `χᵢ` must be distinct, i.e.:

  `first(χᵢ) ∩ first(χⱼ) = {}`  for all  `i ≠ j`

For example, for `G₃` with  `S → A | B`, `A → x A | y`, `B → x B | z` we have that `first(A) = {x, y}` and `first(B) = {x, z}`. Production `S → A | B` does not satisfy Condition 1. An equivalent grammar with production `S → x S | y | z` satisfies the condition.

<div style="float:right;border-left:2em solid white">

| step | stack | input |
|:-----|------:|:------|
| ` `  | `S`   | `x`   |
| P    | `Ax`  | `x`   |
| P    | `xx`  | `x`   |
| M    | `x`   |       |

</div>

Consider parsing `x` in grammar `G₄`; we again may get stuck:

	S → A x
    A → x | ε

**Condition 2.** For every nonterminal `A` from which the empty sequence can be derived, the set of initial symbols must be disjoint from the set of symbols that may follow any sequence generated from `A`:

  `first(A) ∩ follow(A) = {}`  for all `A` such that `A ⇒* ε`

If `A ⇒* ε`, then `A` is called _nullable_.

For example, in `G₄` with `S → A x`, `A → x | ε` we have:

- `first(A) = {x} = follow(A)`
- `A ⇒* ε`

Hence, Condition 2 is violated.

### Recursive Descent Parsing

The appeal of top-down parsing is that an acceptor can be directly expressed in a programming language with mutually recursive procedures by a parsing technique known as _recursive descent_. There is no need to represent the stack of the pushdown automaton explicitly; the stack of the programming language is sufficient. Recursive descent parsing assumes that the grammar is in EBNF.

For each production `p`, a parsing procedure `pr(p)` is constructed. For production `B → E`, the name of the procedure is `B`, and its body is `pr(E)`, a parser recognizing EBNF expression `E`:

| `p`             | `pr(p)`                      |
|:----------------|:-----------------------------|
| `B → E`         | `procedure B()` <br> `pr(E)` |

Assume that procedure `next` reads and assigns the next input symbol to global variable `sym`. The rules for constructing `pr(E)` for recognizing `E` are:

| `E`             | `pr(E)`                             |
|:----------------|:------------------------------------|
| `'a'`           | `if sym = 'a then next else error` |
| `B`             | `B()`                               |
| `(E₁)`          | `pr(E₁)`                            |
| `[E₁]`          | `if sym ∈ first(E₁) then pr(E₁)`    |
| `{E₁}`          | `while sym ∈ first(E₁) do pr(E₁)`   |
| `E₁ E₂ …`       | `pr(E₁) ; pr(E₂) ; …`               |
| `E₁ │ E₂ │ …`   | `case sym of`<br> `first(E₁): pr(E₁)`<br> `first(E₂): pr(E₂)`<br> `…`<br> `otherwise error` |

The procedure of the start symbol has to be called for recognizing a sentence of the language.

For example, consider grammar `G₅`:

    A → a A c | b

Condition 1 requires `first(a A c) ∩ first(b) = {}`, which holds. Condition 2 applies only for nullable nonterminals, but `A` is not nullable; both conditions are satisfied. Applying above rules to `A` results in:

    procedure A()
        case sym of
            'a': next; A() ; if sym = 'c' then next else error
            'b': next
        otherwise error

Above, we have already applied several simplifications. Generally useful transformations are:  

| Parser          | Simplified Parser                   |
|:----------------|:------------------------------------|
|<code>case sym of<br>    L: if sym ∈ L then S else error<br>    …</code>|<code>case sym of<br>    L: S<br>    …</code> |
|<code>while sym ∈ L do<br>    if sym ∈ L then S else error</code>|<code>while sym ∈ L do<br>    S</code>|
|<code>case sym of<br>    L1: S1<br>    L2: S2<br>    …<br> otherwise error</code>|<code>if sym = L1 then S1<br> else if sym = L2 then S2<br> …<br> else error<br> </code>

The implementation below of the parser for `G₅` consists of a single recursive parsing procedure `A`. The input is a sequence of characters stored in the global variable `src`. An index to the next symbol is maintained. An end-of-input symbol that does not occur otherwise is appended to the input. If that symbol is encountered prematurely, the parser exists from the recursion without consuming further input symbols. After returning from the recursion, there is a test that all input symbols have been processed:

In [4]:
src: str; pos: int; sym: str

def nxt():
    global pos, sym
    if pos < len(src): sym, pos = src[pos], pos + 1
    else: sym = chr(0) # end of input symbol

def A(): # A → a A c | b
    if sym == 'a':
        nxt(); A();
        if sym == 'c': nxt()
        else: raise Exception("'c' expected at " + str(pos))
    elif sym == 'b': nxt()
    else: raise Exception("'a' or 'b' expected at " + str(pos))

def parse(s: str):
    global src, pos;
    src, pos = s, 0; nxt(); A()
    if sym != chr(0): raise Exception("unexpected characters at " + str(pos))

parse("aaabccc")

Python has a generalized `case` statement that can match a sequence against patterns:

In [5]:
for s in (['a', 'b', 'c'], ['b', 'c', 'd', 'e'], ['b'], ['a', 'b', 'c', 'd'], 'abc'):
    match s:
        case ['a', second, third]: print('second =', second, ' third =', third) # matches exactly 3
        case ['b', *tail]: print('tail = ', tail) # matches 1 or more
        case _: print('default case')

second = b  third = c
tail =  ['c', 'd', 'e']
tail =  []
default case
default case


Patterns can be used for matching the first symbol of alternatives. In Python, patterns cannot involve strings, only sequences like lists. Here, the source string is first converted to a list:

In [7]:
src: list

def A(): # A → a A c | b
    global src
    match src:
        case ['a', *src]:
            A()
            match src:
                case ['c', *src]: pass
                case _: raise Exception("'c' expected")
        case ['b', *src]: pass
        case _: raise Exception("'a' or 'b' expected")

def parse(s: str):
    global src; src = list(s); A()
    if src != []: raise Exception("unexpected characters")

parse("aabcc")

The pattern `case ['a', *src]` overwrites global variable `src` with the tail of `src` after matching. A variation is to pass `src` as a parameter to each parsing procedure and have that return the tail. This style of *functional parsing* avoids global variables altogether: 

In [8]:
def A(src: list): # A → a A c | b
    match src:
        case ['a', *src]:
            match A(src):
                case ['c', *src]: return src
                case _: raise Exception("'c' expected")
        case ['b', *src]: return src
        case _: raise Exception("'a' or 'b' expected")

def parse(s: str):
    src = A(list(s))
    if src != []: print(src); raise Exception("unexpected characters")

parse("aabcc")

A sentence is parsed if the parsing procedure of the start symbol returns the empty sequence. While this style of parsing is common in purely functional languages, we continue with imperative parsing in Python: strings do not have to be converted to lists, and it is easier to keep the position of the next input in global variables to produce error messages. 

Now consider grammar `G₆` with:

    A → A a | b

Condition 1 for recursive descent parsing requires that `first(A a) ∩ first(b) = {}`:

    first(A a) = {b}
    first(b) = {b}

Hence, a recursive descent parser cannot be constructed. In general, a recursive descent parser cannot be constructed for any grammar with left-recursive productions.

However, any grammar with left-recursive productions can be rewritten into an equivalent grammar with right-recursive productions. Grammar `G₆` can be rewritten with EBNF without recursion at all:

    A → b {a}

For an EBNF grammar, the two conditions for recursive descent parsing are rephrased as follows:

| `E`           | `condition(E)`                               |
|:--------------|:---------------------------------------------|
| `[E₁]`        | `first(E₁) ∩ follow(E) = {}`                 |
| `{E₁}`        | `first(E₁) ∩ follow(E) = {}`                 |
| `E₁ E₂ …`     | `first(Eᵢ) ∩ first(Eᵢ₊₁ Eᵢ₊₂ …) = {}` for any nullable `Eᵢ`, provided `Eᵢ₊₁ Eᵢ₊₂ …` is not nullable |
| `E₁ │ E₂ │ …` | `first(Eᵢ) ∩ first(Eⱼ) = {}` for all `i ≠ j` |

For the production `A → b {a}`, the conditions are:

1. `first(b) ∩ first({a}) = {}` if `b` is nullable; as `b` is not nullable, the condition holds,
2. `first(a) ∩ follow({a}) = {}`, which holds as `first(a) = {a}` and `follow(A) = {}`.

As both conditions hold, a parser can be constructed:

    procedure A()
        if sym = 'b' then next else error
        while sym = 'a' do next

Here is an implementation in Python:

In [9]:
src: str; pos: int; sym: str

def nxt():
    global pos, sym
    if pos < len(src): sym, pos = src[pos], pos+1
    else: sym = chr(0) # end-of-input symbol

def A(): # A → b {a}
    if sym == 'b': nxt()
    else: raise Exception("'a' expected at " + str(pos))
    while sym == 'a': nxt()

def parse(s: str):
    global src, pos;
    src, pos = s, 0; nxt(); A()
    if sym != chr(0): raise Exception("unexpected characters at " + str(pos))

parse("baa")

Let us construct a parser for regular expressions. The symbols are characters:

    expression  →  term { '|' term }
    term  →  factor { factor }
    factor  →  atom [ '*' | '+' | '?' ]
    atom  →  plainchar | escapedchar | '(' expression ')'
    plainchar  →  ' ' | '!' | '"' | '#' | '$' | '%' | '&' | '\'' | ',' | '-' | '.' | '/' |
         '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | ';' | '<' | '=' | '>' | 
         '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' |
         'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | '[' | ']' | '^' | '_' |
         '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' |
         'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | '{' | '}' | '~'
    escapedchar  → '\' ( '(' | ')' | '*' | '+' | '?' | '\' | '|' )


We need to check the conditions for recursive descend parsing:

- For `term { '|' term }`:   `term` is not nullable, condition holds.
- For `{ '|' term }`:  `first('|' term) ∩ follow({ '|' term }) = {}`, condition holds.
- For `'|' term`:  `'|'` is not nullable, condition holds.
- For `factor { factor }`:  `first(factor) ∩ follow({ factor }) = {}`, condition holds.
- For `atom [ '*' | '+' | '?' ]`:  `atom` is not nullable, condition holds.
- For `[ '*' | '+' | '?' ]`:  `first('*' | '+' | '?') ∩ follow([ '*' | '+' | '?' ]) = {}`, condition holds.
- For `'*' | '+' | '?'`:  `first` of any two of `'*'`, `'+'`, `'?'` are disjoint, conditions hold.
- For `plainchar | escapedchar | '(' expression ')'`:  `first` of any two of `plainchar`, `escapedchar`, `'(' expression ')'` are disjoint, conditions hold.
- For `'(' expression ')'`: neither `(` nor `expression` is nullable, conditions hold.
- For `' ' | '!' | '"' | '#' | '$' | '%' | ...`:  `first` of any two are disjoint, conditions hold.
- For `'\' ( '(' | ')' | '*' | '+' | '?' | '\' | '|' )`:  `'\'` is not nullable, condition holds.
- For `'(' | ')' | '*' | '+' | '?' | '\' | '|'`:  `first` of any two of `'('`, `')'`, `'*'`, `'+'`, `'?'`, `'\\'`, `'|'` are disjoint, conditions hold.

As the symbols are characters, the implementation uses strings rather than sets of characters for the `first` sets.

In [10]:
PlainChars = ' !"#$%&\',-./0123456789:;<=>@ABCDEFGHIJKLMNO' + \
                       'PQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{}~'
EscapedChars = '()*+?\\|'
FirstFactor = PlainChars + '\\('

src: str; pos: int; sym: str

def nxt():
    global pos, sym
    if pos < len(src): sym, pos = src[pos], pos+1
    else: sym = chr(0) # end-of-input symbol

def expression(): # expression → term { '|' term }
    term()
    while sym == '|': nxt(); term()

def term(): # term → factor { factor } 
    factor()
    while sym in FirstFactor: factor()

def factor(): # factor → atom [ '*' | '+' | '?' ]
    atom()
    if sym in '*+?': nxt()

def atom(): # atom → plainchar | escapedchar | '(' expression ')'
    if sym in PlainChars: nxt()
    elif sym == '\\':
        nxt()
        if sym in EscapedChars: nxt()
        else: raise Exception("invalid escaped character at " + str(pos))
    elif sym == '(':
        nxt(); expression()
        if sym == ')': nxt()
        else: raise Exception("')' expected at " + str(pos))
    else: raise Exception("invalid character at " + str(pos))

def parse(s: str):
    global src, pos;
    src, pos = s, 0; nxt(); expression()
    if sym != chr(0): raise Exception("unexpected character at " + str(pos))

#parse("a\$") # Exception: invalid escaped character at 3
#parse("a(b") # Exception: ')' expected at 3
#parse("a(" + chr(5) + ")") # invalid character at 3
#parse("a" + chr(5)) # unexpected character at 2
parse("(a*)*abcc")