In [1]:
from IPython.core.display import HTML
with open('../style.css', 'r') as file:
    css = file.read()
HTML(css)

The following cells loads the `mypy` extension for notebooks.  This enables us to check the type annotation of cells.

In [None]:
%load_ext nb_mypy

# A Parser for Regular Expression

This notebook implements a parser for regular expressions. The parser that is implemented in the function `parseExpr` parses a regular expression 
according to the following <em style="color:blue">EBNF grammar</em>.
```
   regExp  → product ('+' product)*
   product → factor factor*
   factor  → atom '*'?
   atom    → '(' expr ')' | CHAR | '𝜀' | '∅'
```
The parse tree is represented as a nested tuple.
- letters are represented by themselves,
- The character `'∅'` is interpreted as the regular expression $\emptyset$ denoting the empty set,
- The character `'𝜀'` represents the regular expression $\varepsilon$ denoting the empty string,
- $r_1 r_2$ is represented as `(`$r_1$ `, '⋅',` $r_2$ `)`, 
- $r_1 + r_2$ is represented as `(`$r_1$ `, '+',` $r_2$ `)`,
- $r^*$ is represented as `(` $r$ `, '*')`.

The parser is implemented as a recursive *top-down* parser.

As we have annotated our function with types, we need to import several items from the module `typing`.  
The type `Match` is the type of the object returned by the method `fullmatch` that is used later.

In [2]:
from typing import TypeVar

We start with a definition of the type of the parse trees that are generated.  A parse tree is either
* an integer,
* a string,
* a tuple of parse trees.

Hence, this type is a *recursive type*.  First, we define a type variable.

In [3]:
ParseTree = TypeVar('ParseTree')

Next, we give the recursive definition of this type variable.
The *ellipsis* `...` specifies that the tuple can be of any length.

In [4]:
ParseTree = int | str | tuple[ParseTree, ...]

In order to tokenize strings, we need regular expressions from the module `re`.

In [5]:
import re

The function `tokenize(s)` partitions the string `s` into a list of tokens.
It recognizes 
- the operator symbols `+` and `*`, 
- the parentheses `(`, `)`, 
- single upper or lower case letters, 
- `0`, 
- the empty string `""`.

All whitespace characters (and, indeed, all characters that could not be matched) are discarded.

In [6]:
def tokenize(s: str) -> list[str]:
    regExp = r'''
              [+*()]   |  # operators and parentheses
              [a-zA-Z] |  # single characters from the alphabet
              ∅        |  # empty regular expression
              𝜀           # epsilon
              '''
    return [t for t in re.findall(regExp, s, flags=re.VERBOSE)]

In [7]:
tokenize('a*bc + ba*c + (𝜀+c*) + ∅*')

['a',
 '*',
 'b',
 'c',
 '+',
 'b',
 'a',
 '*',
 'c',
 '+',
 '(',
 '𝜀',
 '+',
 'c',
 '*',
 ')',
 '+',
 '∅',
 '*']

Below we have defined forward declarations of some functions that are used later. 
This is necessary, since these functions are mutually recursive.  

As these are only stubs, there is no need to type check their body.  
Therefore, we switch of the type checking for the return statement.  
This is done via the *pragma* `# type: ignore`.

In [None]:
def parseRegExp(TokenList: list[str])  -> tuple[ParseTree, list[str]]: 
    return None # type: ignore

def parseProduct(TokenList: list[str]) -> tuple[ParseTree, list[str]]: 
    return None # type: ignore

def parseFactor(TokenList: list[str])  -> tuple[ParseTree, list[str]]: 
    return None # type: ignore

def parseAtom(TokenList: list[str]) -> tuple[str | int | ParseTree, list[str]]:
    return None # type: ignore

The function `parse` takes a string `s` and tries to parse it as a regular expression.  
It returns the parse tree.

In [None]:
def parse(s: str) -> ParseTree:
    TokenList = tokenize(s)
    regExp, Rest = parseRegExp(TokenList)
    assert Rest == [], f'Parse Error: could not parse {TokenList}'
    return regExp

The function `parseRegExp` takes a token list `TokenList` and tries to interpret this list as a regular expression.  
It returns the regular expression in the form of a nested tuple and a list of those tokens that could not be parsed.  
It is implemented as a <em style="color:blue">top-down-parser.</em> 

The function `parseRegExp` implements the following grammar rule:
```
regExp → product ('+' product)*
```

In [None]:
def parseRegExp(TokenList: list[str]) -> tuple[ParseTree, list[str]]:
    result, Rest = parseProduct(TokenList)
    while len(Rest) >= 2 and Rest[0] == '+':
        arg, Rest = parseProduct(Rest[1:])
        result = (result, '+', arg)
    return result, Rest

The function `parseProduct` implements the following grammar rule:
```
product → factor factor*
```

In [None]:
def parseProduct(TokenList: list[str]) -> tuple[ParseTree, list[str]]:
    result, Rest = parseFactor(TokenList)
    while len(Rest) > 0 and not (Rest[0] in ["+", ")"]):
        arg, Rest = parseFactor(Rest)
        result = (result, '⋅', arg)
    return result, Rest

The function `parseFactor` implements the following grammar rule:
```
factor → atom '*'?
```

In [None]:
def parseFactor(TokenList: list[str]) -> tuple[ParseTree, list[str]]:
    atom, Rest = parseAtom(TokenList)
    if len(Rest) > 0 and Rest[0] == "*":
        return (atom, '*'), Rest[1:]
    return atom, Rest

The function `parseAtom` implements the following grammar rule:
```
atom  → '(' expr ')'
      | '∅'            # denotes empty set ∅
      | '𝜀'            # denotes empty string 𝜀
      | CHAR           # any other character denotes itself
```

In [None]:
def parseAtom(TokenList: list[str]) -> tuple[str | int | ParseTree, list[str]]:
    if TokenList[0] == '(':
        regExp, Rest = parseRegExp(TokenList[1:])
        assert Rest[0] == ")", "Parse Error"
        return regExp, Rest[1:]
    if TokenList[0] == '∅':
        return 0, TokenList[1:]
    if TokenList[0] == '𝜀':
        return '𝜀', TokenList[1:]
    s = TokenList[0]
    if s not in ['+', '*', '(', ')']:
        return s, TokenList[1:]
    assert False, f'parse error: {TokenList}'

In [None]:
parse('a*+(b*+cd*)*(a+c*)')