**XLR8R: A lex/syntax assistant**

Presenting $XLR8R$ for building tokenizers and parsers; code name $XLR8R$, a mnemonic for accelerator. Its purpose is to extract structure from files with unknown formats or grammars so that the data can be further analyzed.

The simple tool works based on a feedback-control loop. Previously seen patterns are applied to the input data. As these patterns are recognized in context, new patterns emerge based on what's already been seen, the developer makes small, incremental adjustments by incorporating these new patterns back into the set of seen patterns, and the process repeats until all data is properly organized.

For example, let's bootstrap a calculator-expression parser by repeatedly feeding back into the system the output patterns produced from matching properly formed samples of these expressions. With each cycle, the parser should become more complete and accurate.

In [None]:
!pip install SNOBOL4python==0.4.5
import sys
from pprint import pprint, pformat
## Thirty one (31) flavors of patterns to choose from ...
from SNOBOL4python import ε, σ, π, λ, Λ, ζ, θ, Θ, φ, Φ, α, ω
from SNOBOL4python import ABORT, ANY, ARB, ARBNO, BAL, BREAK, BREAKX, FAIL
from SNOBOL4python import FENCE, LEN, MARB, MARBNO, NOTANY, POS, REM, RPOS
from SNOBOL4python import RTAB, SPAN, SUCCEED, TAB
# Miscellaneous
from SNOBOL4python import GLOBALS, TRACE, PATTERN, Ϩ, STRING
from SNOBOL4python import ALPHABET, DIGITS, UCASE, LCASE, NULL
from SNOBOL4python import nPush, nInc, nPop, Shift, Reduce, Pop
# Instantiate the global variable space
GLOBALS(globals())

In order to ensure an abundance of properly formed expressions, utilize an expression generator either exhaustively or randomly. For our purpose, a random expression generator should work nicely. The function below, rand_expression, returns a well-formed expression with random length and randomized content.

In [None]:
import random
def rand_item():
    r = random.randint(1, 100)
    if   r <=  13: return "x"
    elif r <=  26: return "y"
    elif r <=  39: return "z"
    elif r <=  80: return str(random.randint(0, 16))
    elif r <= 100: return "(" + rand_term() + ")"

def rand_element():
    r = random.randint(1, 100)
    if   r <=  90: return rand_item()
    elif r <=  95: return '+' + rand_element()
    elif r <= 100: return '-' + rand_element()

def rand_factor():
    r = random.randint(1, 100)
    if   r <=  70: return rand_element()
    elif r <=  85: return rand_element() + '*' + rand_factor()
    elif r <= 100: return rand_element() + '/' + rand_factor()

def rand_term():
    r = random.randint(1, 100)
    if   r <=  70: return rand_factor()
    elif r <=  85: return rand_factor() + '+' + rand_term()
    elif r <= 100: return rand_factor() + '-' + rand_term()

def rand_expression():
    return rand_term()

One of the very first steps is to determine the usage of characters within the unknown input, i.e. the set of characters and frequency of use.

In [None]:
chars = {}
for _ in range(0, 1024):
    for ch in rand_expression():
        if ch not in chars:
            chars[ch] = 1
        else: chars[ch] += 1
pprint(chars)

Another first step is to discover possible tokens. Begin by looking at sequences of traditional classes of characters like digits, upper-case letters, lower-case letters, and special character sequences as operator symbols.

In [None]:
token_statistics = \
    ( POS(0)
    + ARBNO(
        ( SPAN(DIGITS)
        | SPAN(UCASE)
        | SPAN(LCASE)
        | SPAN("+-*/")
        | LEN(1)
        ) % "token"
      + λ(f'''if token not in tokens: tokens[token] = 0''')
      + λ(f'''tokens[token] += 1''')
      )
    + RPOS(0)
    )
tokens = {}
for _ in range(0, 128):
    rand_expression() in token_statistics
pprint(tokens)

In this next code snippet, the $XRL8R\_expressions$ function, is a simple engine that produces a number of random expressions, analyzes each one with the XLR8R pattern, and returns the results of that analysis.

In [None]:
def XLR8R_expressions():
    count = 0
    while count < 8:
        subject = rand_expression()
        if len(subject) <= 10:
            if matches := subject in XLR8R("P"):
                print(f"{str(matches):5s} {subject:10s}  {P}")
            else: print(f"{str(matches):5s} {subject:10s}")
            count += 1

What follows is the core componenent of the feedback-control mechanism. The translator, $XLR8R$, function returns a PATTERN which recognizes a sequence of known patterns and returns what it recognized. What $XLR8R$ returns is Python code which when compiled and executed creates a PATTERN that matches that specific input. Thus, the XLR8R pattern encodes **proper subsequents** within the context of a particular set of known patterns.

$XLR8R$ produces a PATTERN that matches the string that XLR8R just matched. That PATTERN is a sequence of PATTERNs from the known set of patterns.

The $XLR8R\_expressions$ function then produces a set of varying PATTERNs by iterating over several samples of properly-formed input. Each of these varying PATTERNs recognize properly formed inputs. Thus, the $XLR8R\_expressions$ function encodes the **proper alternates** within the context of a particular set of known patterns.

This $XLR8R$ function will be revised at every iteration of the feedback-control loop by incorporating patterns from the output back into the XLR8R pattern for the next iteration.

In [None]:
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( ANY(LCASE)            + λ(f'''{x}.append("ANY(LCASE)")''')
        | SPAN(DIGITS)          + λ(f'''{x}.append("SPAN(DIGITS)")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Note from the token statistics that we saw spans of digits and single lowercase letters. So within the XLR8R pattern we generate a new pattern $V$ to recognize $ANY(LCASE)$, and pattern $I$ to recognize $SPAN(DIGITS)$. Repeatedly rerun XLR8R to generate varying sets of patterns from random samples.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( V                     + λ(f'''{x}.append("V")''')
        | I                     + λ(f'''{x}.append("I")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep rerunning XLR8R. Now look for some smaller expression-pattern alternatives to inject back into XLR8R, some choices could be:

* I
* V
* σ('-') + V
* I + σ('*') + V
* V + σ('+') + I

We don't want to choose I or V patterns yet, since these could obscure results. We want less low-level constructs. After confirming by multiple XLR8R runs, that the following patterns are listed as proper alternatives each matching a FULL expression, then let's choose these next four new patterns to inject:

* V + ANY('+-*/') + V
* V + ANY('+-*/') + I
* I + ANY('+-*/') + V
* I + ANY('+-*/') + I

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
X = ( V + ANY('+-*/') + V
    | V + ANY('+-*/') + I
    | I + ANY('+-*/') + V
    | I + ANY('+-*/') + I
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | V                     + λ(f'''{x}.append("V")''')
        | I                     + λ(f'''{x}.append("I")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XLR8R. Several new smaller patterns are now emerging:

* X + σ('+') + V, X + σ('+') + I
* X + σ('-') + V, X + σ('-') + I
* X + σ('*') + V, X + σ('*') + I
* X + σ('/') + V, X + σ('/') + I

which is equivalent to:

* X + ANY('+-*/') + V
* X + ANY('+-*/') + I

So, let's choose these two new patterns to inject back into XLR8R.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
X = ( V + ANY('+-*/') + V
    | V + ANY('+-*/') + I
    | I + ANY('+-*/') + V
    | I + ANY('+-*/') + I
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | X + ANY('+-*/') + V   + λ(f'''{x}.append("X")''')
        | X + ANY('+-*/') + I   + λ(f'''{x}.append("X")''')
        | V                     + λ(f'''{x}.append("V")''')
        | I                     + λ(f'''{x}.append("I")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XLR8R. This looks promising, so let's inject these two patterns into pattern X. However, two problems arise when we attempt to do so. Firstly, the two patterns reference the X pattern, and when defining X, the variable X itself does not yet exist. Secondly, keeping X on the left versus the right of the operator will cause X to spin on an infinite loop. We solve the first problem, an undefined reference, by utilizing the unevaluated pattern expression, ζ("X"). We solve the second problem, an infinite loop, by swapping the X over to the right side of the operator since likely they are grammatically equivalent in this case.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
X = ( V + ANY('+-*/') + ζ("X")
    | I + ANY('+-*/') + ζ("X")
    | V + ANY('+-*/') + V
    | V + ANY('+-*/') + I
    | I + ANY('+-*/') + V
    | I + ANY('+-*/') + I
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | V                     + λ(f'''{x}.append("V")''')
        | I                     + λ(f'''{x}.append("I")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XLR8R. It is beginning to appear that within the grammar, both V and I behave syntactically the same. Knowing, of course, that these are calculator expressions, we could probably assume that the lowercase letters somehow also represent numbers; maybe memory register names. So let's merge these into a new pattern N, for number. Notice also that N alone is well-formed expression, so we inject N into the X pattern. And then simplify and refactor.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
N = V | I
X = ( N + ANY('+-*/') + ζ("X")
    | N
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | N                     + λ(f'''{x}.append("N")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XRL8R. Notice two new patterns emerging:

* σ('+') + X
* σ('-') + X

So now injecting these into pattern X.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
N = V | I
X = ( N + ANY('+-*/') + ζ("X")
    | ANY('+-') + ζ("X")
    | N
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | N                     + λ(f'''{x}.append("N")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XLR8R. Notice one new pattern emerging:

* σ('(') + X + σ(')')

So let's inject that into pattern X.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
N = V | I
X = ( N + ANY('+-*/') + ζ("X")
    | ANY('+-') + ζ("X")
    | N
    | σ('(') + ζ("X") + σ(')')
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | N                     + λ(f'''{x}.append("N")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Keep running XLR8R. Notice some very prevalent patterns emerging:

1. X + X
2. X + σ('+') + X
3. X + σ('/') + X

The first pattern seems strange and incorrect, so we will not choose it. However, the second and third are indicating an extension.

* N + ANY('+-*/') + ζ("X")

should somehow be extended to

* ζ("X") + ANY('+-*/') + ζ("X")

But again, that would cause an infinite loop. So we inject this new emerging pattern by rearranging and refactoring as follows:

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
N = V | I
E = N | σ('(') + ζ("X") + σ(')')
X = ( E + ANY('+-*/') + ζ("X")
    | ANY('+-') + ζ("X")
    | E
    )
def XLR8R(x): return \
    ( POS(0)                    + λ(f'''{x} = []''')
    + ARBNO(
        ( X                     + λ(f'''{x}.append("X")''')
        | E                     + λ(f'''{x}.append("E")''')
        | N                     + λ(f'''{x}.append("N")''')
        | LEN(1) % "tx"         + λ(f'''{x}.append("σ('" + tx + "')")''')
        )
      )
    + RPOS(0)                   + λ(f'''{x} = " + ".join({x})''')
    )
XLR8R_expressions()

Notice now there is now only one pattern being generated for all generated expressions. That is, of course, the pattern X. So it appears that we have a calculator-expression parser named X working, and can begin applying the new X pattern as needed. With just a few adjustments, the Python code for an expression parser is completed using only four pattern variables.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
E = V | I | σ('(') + ζ('X') + σ(')')
X = ( E + ANY('+-') + ζ("X")
    | E + ANY('*/') + ζ("X")
    | ANY('+-') + ζ('X')
    | E
    )
for _ in range(0, 16):
    subject = rand_expression()
    pprint([subject in POS(0) + X + RPOS(0), subject])

Now, decorate the pattern to calculate and print results.

In [None]:
V = ANY(LCASE) % "N"      + λ("S.append(int(globals()[N]))")
I = SPAN(DIGITS) % "N"    + λ("S.append(int(N))")
E = ( V
    | I
    | σ('(') + ζ("X") + σ(')')
    )
X = ( E + σ('+') + ζ("X") + λ("S.append(S.pop() + S.pop())")
    | E + σ('-') + ζ("X") + λ("S.append(S.pop() - S.pop())")
    | E + σ('*') + ζ("X") + λ("S.append(S.pop() * S.pop())")
    | E + σ('/') + ζ("X") + λ("S.append(S.pop() // S.pop())")
    | σ('+') + ζ("X")
    | σ('-') + ζ("X")     + λ("S.append(-S.pop())")
    | E
    )
C = POS(0) + λ("S = []") + X + λ("print(S.pop())") + RPOS(0)
x = 1; y = 2; z = 3
for s in ["x+y*z", "x+(y*z)", "(x+y)*z"]:
    s in C

You can decorate parts of the pattern with debug-tracing. The following traces everything. Note the PRINT function must return True to allow pattern matching to continue.

In [None]:
def PRINT(s): print(s, end="·"); return True
V = Θ("pos") + Λ(lambda: PRINT(f'V{pos}')) + ANY(LCASE) @ "OUTPUT"
I = Θ("pos") + Λ(lambda: PRINT(f'I{pos}')) + SPAN(DIGITS) @ "OUTPUT"
E = Θ("pos") + Λ(lambda: PRINT(f'E{pos}')) + \
    ( V
    | I
    | Θ("pos") + Λ(lambda: PRINT(f'({pos}')) + σ('(') @ "OUTPUT"
    + ζ("X")
    + Θ("pos") + Λ(lambda: PRINT(f'){pos}')) + σ(')') @ "OUTPUT"
    )
X = Θ("pos") + Λ(lambda: PRINT(f'X{pos}')) + \
    ( E + Θ("pos") + Λ(lambda: PRINT(f'+-*/{pos}')) + ANY('+-*/') @ "OUTPUT" + ζ("X")
    |     Θ("pos") + Λ(lambda: PRINT(f'+-{pos}'))   + ANY('+-')   @ "OUTPUT" + ζ("X")
    | E
    )
for _ in range(0, 3):
    subject = rand_expression()
    pprint([subject in POS(0) + X + RPOS(0), subject])

Alternatively, you can call TRACE(10), with 10, 20, 30 levels. Call TRACE(40) to revert back to normal error reporting.

In [None]:
V = ANY(LCASE)
I = SPAN(DIGITS)
E = V | I | σ('(') + ζ('X') + σ(')')
X = ( E + ANY('+-') + ζ("X")
    | E + ANY('*/') + ζ("X")
    | ANY('+-') + ζ('X')
    | E
    )
subject = rand_expression()
TRACE(50)
pprint([subject in POS(0) + X + RPOS(0), subject])