Excerpts from The SNOBOL4 Programming Language (2nd edition)\
by R. E. Griswold, J. F. Poage, I. P. Polonsky\
Copyright © Bell Telephone Laboratories, Incorporated, 1971, 1968

SNOBOL4 is a computer programming language containing many features not commonly found in other programming languages. It evolved from SNOBOL [1,2,3], a language for string manipulation, developed at Bell Telephone Laboratories, Incorporated, in 1962. Extensions to SNOBOL through various versions have made it a useful tool in such areas as compilation techniques, machine simulation, symbolic mathematics, text preparation, natural language translation, linguistics, and music analysis.

The new SNOBOL4python package provides a PATTERN data type allowing SNOBOL4-style string pattern matching within a Python program.

In [None]:
!pip install SNOBOL4python==0.4.5
import sys
from pprint import pprint
## Thirty one (31) flavors of patterns to choose from ...
from SNOBOL4python import ε, σ, π, λ, Λ, ζ, θ, Θ, φ, Φ, α, ω
from SNOBOL4python import ABORT, ANY, ARB, ARBNO, BAL, BREAK, BREAKX, FAIL
from SNOBOL4python import FENCE, LEN, MARB, MARBNO, NOTANY, POS, REM, RPOS
from SNOBOL4python import RTAB, SPAN, SUCCESS, TAB
# Miscellaneous
from SNOBOL4python import GLOBALS, TRACE, PATTERN, Ϩ, STRING
from SNOBOL4python import ALPHABET, DIGITS, UCASE, LCASE, NULL
from SNOBOL4python import nPush, nInc, nPop, Shift, Reduce, Pop
# Instantiate the global variable space
GLOBALS(globals())

Collecting SNOBOL4python==0.4.4
  Downloading snobol4python-0.4.4-py3-none-any.whl.metadata (823 bytes)
Downloading snobol4python-0.4.4-py3-none-any.whl (25 kB)
Installing collected packages: SNOBOL4python
Successfully installed SNOBOL4python-0.4.4


**1.4** Pattern Matching Statements

The operation of examining strings for the occurrence of specified substrings (i.e. pattern
matching) is fundamental to the SNOBOL4 language. Pattern matching can be specified by an expression of the following form:

* *the_subject* in *a_pattern*

where *the_subject* specifies a string that is to be examined, and *a_pattern* can be thought of as specifying a set of strings. The expression causes the subject string to be scanned from the left for the occurrence of a string specified by the PATTERN.

In the following, the first expression examines the value of TRADE for an occurrence of GRAM. The second expression is equivalent. The expressions both use the primitive pattern matching function, σ, lower-case sigma function, for matching a string, a sequence of characters.

In [None]:
TRADE = 'PROGRAMMER'
TRADE in σ('GRAM') # first expression
PART = 'GRAM'
TRADE in σ(PART) # second expression

True

The following example illustrates a pattern matching expression being applied which specifies a simple string to match literally:

In [None]:
ROW = 'K'
NO = 20
'K24' in σ(ROW + str(NO + 4))

True

**1.6** Patterns

The pattern in the preceding examples specify matching a single string. It is also possible to specify more complex patterns. There are two operations available for constructing such patterns:

* alternation, and
* concatenation.

Alternation is indicated by an expression of the form:

* P1 $|$ P2

where the two patterns P1 and P2 are separated by a vertical bar character, the *or operator*. The binary $|$ operator implements the $PATTERN alternate$ $operator$. The value of the expression is a pattern structure that matches any string specified by either P1 or P2.

For example, the following code snippet builds a pattern structure that matches either of four strings and executes a search to scan for a match within the subject string. Internally, a list of alternates is implemented using the primitive pattern function $Π$(...), $PI$, a mneumonic for productions.

In [None]:
from SNOBOL4python import Π # Used internally
print('NOUN' in σ('ADJECTIVE') | σ('ADVERB') | σ('NOUN') | σ('VERB'))
print('NOUN' in Π(σ('ADJECTIVE'), σ('ADVERB'), σ('NOUN'), σ('VERB')))

True
True


In another example of alternation, in the following code snippet.

* The first statement assigns to KEYWORD a pattern structure that matches either of two strings.
* KEYWORD may then be used wherever patterns are permitted. The second statment gives KEYWORD a new pattern value.
* The new value is equivalent to the value being used in the third statement.

In [None]:
KEYWORD = σ('COMPUTER') | σ('PROGRAM') # first statement
print('PROGRAM' in KEYWORD)
print('ALGORITHM' in KEYWORD)
KEYWORD = KEYWORD | σ('ALGORITHM') # second statement
print('PROGRAM' in KEYWORD)
print('ALGORITHM' in KEYWORD)
print('ALGORITHM' in σ('COMPUTER') | σ('PROGRAM') | σ('ALGORITHM')) # third statement

True
False
True
True
True


Concatenation of two patterns, P1 and P2, is specified in the same way as the concatenation of two strings:

* P1 $+$ P2

That is, the two patterns are separated by a plus character, the addition operator. The value of the expression is a pattern that matches a string consisting of two substrings, the first matched by P1, the second matched by P2. The binary $+$ operator implements the $PATTERN sequence$ $operator$.

For example, the following expression succeeds in finding a pattern match. Internally, a list of subsequents is implemented using the primitive pattern function $Σ$(...), $SIGMA$, signifying a sequence, subsequents.

In [None]:
from SNOBOL4python import Σ, Π # Used internally
SCALE = σ('Fixed') | σ('Float')
BASE = σ('Binary') | σ('Decimal') | σ('Hex')
ATTRIBUTE = SCALE + BASE
DCL = 'AreaFixedDecimal'
print(DCL in ATTRIBUTE)
print(DCL in SCALE + BASE)
print(DCL in Σ(Π(σ('Fixed'), σ('Float')), Π(σ('Binary'), σ('Decimal'), σ('Hex'))))

True
True
True


**1.7** Conditional Value Assignment

It is possible to associate a variable with a component of a pattern such that if the pattern matches, the variable is assigned the substring matched by the component. The operator $\%$ is the conditional value-assignment operator and it is used in an expression of the form

* P $\%$ “variable_name”

Where the variable name is specified as a string value, usually a string literal. For example:

For example, the following code snippet assigns to BASE a pattern that matches either HEX or DEC. If BASE is used successfully in a pattern match, the value of B1 is set to the substring matched by BASE.

In [None]:
BASE = (σ('HEX') | σ('DEC')) % "B1"
pprint(["HEX" in BASE, B1])
pprint(["DEC" in BASE, B1])

[True, 'HEX']
[True, 'DEC']


There is also an operator $@$ for immediate value assignment which assigns a value to a variable if the associated component of the pattern matches regardless of whether the entire pattern matches. Immediate value assignment is discussed in more detail later.

**CHAPTER 2**

Pattern Matching

**2.1** Introduction

Strings of characters can be synthesized from smaller strings by concatenation. The converse of synthesis, decomposition of strings into substrings, is performed using pattern matching. Fundamentally, pattern matching is the process of examining a subject string for a substring which is one of a set specified by a pattern. The substring and parts thereof, formed by pattern matching, can be assigned as the values of variables.

Before matching actually occurs, the pattern expression is evaluated. Its value is a PATTERN, which may be thought of as a set of strings. The structure of a PATTERN is used to drive a pattern matching procedure (the scanner) which performs the actual matching. Should any string specified by the PATTERN appear as a substring of the subject, pattern matching succeeds.

For many users of the new PATTERN datatype, a knowledge of how patterns are actually matched is of little importance. The success or failure of matching is all that matters. However, by understanding the scanning procedure, a programmer can write more efficient patterns and make use of features such as immediate value assignment and unevaluated expressions that can actually change a pattern during matching. Thus, the secondary purpose of this chapter is to describe how the scanner works.

**2.2** Alternation and Concatenation

Alternation and concatenation are used to build pattern structures that match sets of strings. Alternation, indicated by the binary operator $|$, builds a single pattern structure from its two arguments. If P1 and P2 are PATTERNs, the statement

* P3 = P1 $|$ P2

builds a new structure and assigns it as the value of P3. P3 matches any string matched by P1 or P2.

The binary operator + is used to indicate concatenation. If P4 and P5 are PATTERNs, the statement

* P6 = P4 $+$ P5

builds a pattern structure and assigns it as the value of P6. P6 matches any string that can be formed from a string matched by P4, followed by a string matched by P5. Alternation and concatenation can be used to build pattern structures that match large numbers of strings. Consider the following statements:

In [None]:
P = σ('BE') | σ('BEA') | σ('BEAR')
Q = σ('RO') | σ('ROO') | σ('ROOS')
R = σ('DS') | σ('D')
S = σ('TS') | σ('T')
PAT = P + R | Q + S

PAT matches any of the twelve strings: BEDS, BED, BEADS, BEAD, BEARDS, BEARD, ROTS, ROT, ROOTS, ROOT, ROOSTS, or  ROOST:

In [None]:
for subject in (
    'BEDS', 'BED', 'BEADS'
  , 'BEAD', 'BEARDS', 'BEARD'
  , 'ROTS', 'ROT', 'ROOTS'
  , 'ROOT', 'ROOSTS', 'ROOST'
): pprint([subject in POS(0) + PAT + RPOS(0), subject])

[True, 'BEDS']
[True, 'BED']
[True, 'BEADS']
[True, 'BEAD']
[True, 'BEARDS']
[True, 'BEARD']
[True, 'ROTS']
[True, 'ROT']
[True, 'ROOTS']
[True, 'ROOT']
[True, 'ROOSTS']
[True, 'ROOST']


**2.3** Scanning

Matching a pattern structure against a subject string is done by a procedure called the scanner. The pattern structure behaves like a program that indicates to the scanner how to examine the subject string. At any instant during scanning, the scanner uses two pieces of information:

* where in the subject string it should be looking, and
* what component of the pattern structure it should match.

The scanner has a pointer called the cursor which is positioned to the left of the character that the scanner must match. A second pointer called the needle points at the component to be matched.

Consider the following example, in which the string of characters READS is matched by a pattern structure that is the value of BR.

In [None]:
BR = (σ('B') | σ('R')) + (σ('E') | σ('EA')) + (σ('D') | σ('DS'))
pprint(["READS" in BR])

[True]


For illustrative purposes, It is convenient to think of components of a pattern structure as a set of beads that the scanner is trying to thread using the needle. A bead diagram representing BR is shown in the slide show.

In bead diagrams, left-to-right order of concatenation is preserved. Alternation is represented top to bottom. The needle points at the bead that the scanner is currently trying to match. If a bead matches, the needle passes through and moves upward as far as it can go without crossing a horizontal line. If a bead does not match, the needle moves down to an alternative bead, provided one exists. Downward movement may not cross a horizontal line. If no alternative exists, the needle is pulled back through the last successfully matched bead, and an alternative is sought there.

The following figure illustrates the steps in matching BR against “READS”. The arrow pointing at “READS” represents the cursor, while the arrow pointing at the beads represents the needle. Failure in the fifth step causes the needle to be pulled back. The cursor is moved back at the same time.

Bead diagrams graphically illustrate one important control that the programmer has over the scanner.  In a pattern-valued expression alternatives are matched by the scanner in left-to-right order (top to bottom in the bead diagram). Thus, regarding BR, the scanner attempts to match 'B' before 'R', 'E' before 'EA', and 'D' before 'DS'. By positioning alternatives correctly a programmer can control the order in which the scanner looks at them.

**25.1** Conditional Value Assignment

The binary operator % is used to indicate conditional value assignment. The value of the expression

* P $\%$ "V"

is a pattern in which V is associated with the pattern P. This pattern is the same as P, except that upon successful completion of pattern matching, the substring matched by P is assigned as the value of the variable V. Thus, by associating several variables with portions of a pattern, it is possible to ascertain what the overall pattern matches, and also which components of the pattern are used in the match. For example, rewriting BR as:

In [None]:
BR = ((σ('B') | σ('R')) + (σ('E') | σ('EA')) + (σ('D') | σ('DS'))) % "BRVAL"
pprint(["READS" in BR, BRVAL])

[True, 'READ']


associates the variable BRVAL with the entire pattern. On successful completion of matching, the entire substring matched is assigned as value of BRVAL. Rewriting still further, variables can be associated with pieces of the pattern.

In [None]:
BR = (
    (σ('B') | σ('R'))  % "FIRST"
  + (σ('E') | σ('EA')) % "SECOND"
  + (σ('D') | σ('DS')) % "THIRD"
) % "BRVAL"
pprint(["READS" in BR, FIRST, SECOND, THIRD, BRVAL])

[True, 'R', 'EA', 'D', 'READ']


A successful match causes the entire substring to be assigned as the value of BRVAL. "B" or "R" becomes the value of FIRST, "E" or "EA" becomes the value of SECOND, and "D" or "DS" becomes the value of THIRD. Failure to match leaves the values of all variables unchanged.

**25.2** Immediate Value Assignment

The binary operator $@$ signifies immediate value assignment. The expression

* P $@$ "V"

associates a variable V with a pattern P so that whenever P matches a substring, the substring immediately becomes the new value of V. It is possible, by using $@$, to associate variables with parts of a large pattern, to see how far scanning progressed in the event of failure. Value assignment is done for those parts of the pattern that match, even if the overall match fails. Suppose BR is rewritten using $@$ instead of $\%$ where shown.

In the following statement, pattern matching fails.

In [None]:
BR = (
    (σ('B') | σ('R')) @ "FIRST"
  + (σ('E') | σ('EA')) @ "SECOND"
  + (σ('D') | σ('DS')) @ "THIRD"
) % "BRVAL"
FIRST = SECOND = THIRD = BRVAL = None
pprint(["BEATS" in BR, FIRST, SECOND, THIRD, BRVAL])

[False, 'B', 'EA', None, None]


However, since immediate assignment is performed whenever the associated part of the pattern matches, the following assignments are made.

* FIRST = 'B'
* SECOND = 'E'
* SECOND = 'EA'

Values of THIRD and BRVAL are unchanged. If conditional assignment is used, values of all four variables are unchanged. In the following example, the pattern matches.

In [None]:
FIRST = SECOND = THIRD = BRVAL = None
pprint(["READS" in BR, FIRST, SECOND, THIRD, BRVAL])

[True, 'R', 'EA', 'D', 'READ']


Values assigned both during and after scanning are:

* FIRST = 'B'
* FIRST = 'R'
* SECOND = 'E'
* SECOND = 'EA'
* THIRD = 'D'
* BRVAL = 'READ'

The outcome is the same as if conditional value assignment had been used. Immediate value assignment is less efficient in this case because two redundant assignments are made. As a general rule, conditional value assignment should be used whenever possible. Immediate value assignment should be used only in those cases where intermediate results are important.

**2.5.4** Association with the Variable $OUTPUT$

The pseudo-variable $OUTPUT$ is provided as a convenience, it may be associated with any portion of a pattern. An association with $OUTPUT$ causes the successful alternative to be printed. Using $@$ to associate $OUTPUT$ with several parts of a pattern achieves the effect of tracing the progress of the scanner. By constructing BR as the output resulting from execution of the statement.

In [None]:
BR = (
    (σ('B') | σ('R'))  @ "OUTPUT"
  + (σ('E') | σ('EA')) @ "OUTPUT"
  + (σ('D') | σ('DS')) @ "OUTPUT"
)
pprint(["READ" in BR])

R·E·EA·D·
[True]


**2.6** The epsilon pattern, ε(), in Pattern Matching

The epsilon pattern, ε(), matches the null string, a string of zero length. Attempts by the scanner to match the epsilon pattern always succeed. The ε() pattern is frequently used in more complex patterns. For example, a pattern that matches the eight strings, C, D, AC, AD, BC, BD, ABC, ABD, can be written as

* (ε() $|$ σ('A')) $+$ (ε() $|$ σ('B')) $+$ (σ('C') $|$ σ('D'))

Matching a pattern of the form

* ε() $@$ "X" $@$ "Y" $+$ PAT

sets the values of X and Y to the null string before matching of PAT begins.

2.7 The theta pattern, $Θ$("variable_name"), Cursor Position

The theta pattern, $Θ$, is the cursor position pattern. Its argument is a variable. The value of $Θ$(“X”) is a pattern structure that matches the null string and assigns the current cursor position as an integer value of the variable X. Assignment of the cursor position to the argument of the $Θ$ function takes place as immediate value assignment. Value is assigned when the cursor position operator is encountered during pattern matching, not following successful completion.

Execution of the following statements assigns the integers 0, 1 , 2, 3, 4, and finally 5 to the variable OUTPUT, effectively tracing the progress of the cursor.

In [None]:
HEAD = None
pprint(["Test THETA" in ARB() + Θ("OUTPUT") + σ('THETA'), HEAD])

0·1·2·3·4·5·
[True, None]


Pattern matching succeeds when the cursor is initially positioned to the left of the THETA. The cursor position at this point is 5, the final value assigned to HEAD.

**2.8** $LEN$

$LEN$(integer) is a primitive function whose value is a pattern structure that matches any string of the specified length. The argument of $LEN$ must have a non-negative integer value when pattern matching is performed.

**2.9** $SPAN$ and $BREAK$

$SPAN$ and $BREAK$ are primitive functions whose values are pattern structures that match runs of characters. Patterns described by a

* run of blanks,
* a string of digits, and
* a word (run of letters),

can be formed using $SPAN$ as

* $SPAN$(' ')
* $SPAN$('0123456789')
* $SPAN$('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

Patterns described by

* everything up to the next blank,
* everything up to the next punctuation mark, and
* everything up to the next number,

can be formed using $BREAK$ as

* $BREAK$(' ')
* $BREAK$(',.;:!?')
* $BREAK$('+-0123456789')

Arguments of $BREAK$ and $SPAN$ must a non-null string when pattern matching is performed.

The pattern structure for $SPAN$ matches the longest string beginning at the cursor that consists solely of characters appearing in the argument. $SPAN$ may be thought of as streaming from the cursor until a character not included in the argument is found. $SPAN$ must match at least one character, or it fails.

$BREAK$ generates a pattern structure that matches the longest string beginning at the cursor that does not contain a character of the argument. Thus, regarding its argument as a list of 'break' characters, BREAK streams from the cursor up to, but not including, the first break character. $BREAK$ must find a break character, or it fails. If the cursor is positioned immediately to the left of a break character, $BREAK$ matches the null string.

A bead diagram for the following statement illustrates how the cursor is moved by $SPAN$ and $BREAK$: ...

In [None]:
SUBJECT = PREDICATE = None
pprint([
  'It runs.' in BREAK(' ') % "SUBJECT" + SPAN(' ') + BREAK('.') % "PREDICATE" + σ('.')
, SUBJECT
, PREDICATE
])

[True, 'It', 'runs']


**2.10** $ANY$ and $NOTANY$

$ANY$(string) and $NOTANY$(string) are primitive functions whose values are pattern structures that match single characters. $ANY$ matches any character appearing in its argument. $NOTANY$ matches any character not appearing in its argument. Thus, the pattern structure for $ANY$('AEIOU') matches any vowel. The pattern for $NOTANY$('AEIOU') matches any character that is not a vowel. Arguments of $ANY$ and $NOTANY$ must be non-null strings when pattern matching is performed. $ANY$ and $NOTANY$ are fast ways of looking for one of a set of single characters. For example,

* $ANY$('AEIOU')

is preferable to

* σ('A') $|$ σ('E') $|$ σ('I') $|$ σ('O') $|$ σ('U')

**2.11** $TAB$, $RTAB$, and $REM$

$TAB$(integer) and $RTAB$(integer) are primitive functions whose values are pattern structures that match all characters from the current cursor position up to a specific point in the subject string. $TAB$(N) matches up through the Nth character of the subject string. $RTAB$(N) matches up to but not including the Nth character from the right end of the subject string. Stated another way, $TAB$(N) ensures that N characters are matched by positioning the cursor to the right of the Nth character. $RTAB$(N) ensures that all but N characters are matched by positioning the cursor to the left of the Nth character from the end.

$RTAB$(0) is particularly useful for matching everything to the end of the subject string. For backwards compatibility, the primitive function $REM$() has the same pattern structure as $RTAB$(0).

$TAB$ and $RTAB$ require integer arguments when pattern matching is performed. If the argument of $TAB$ or $RTAB$ is negative, a program error occurs. An argument that would require moving the cursor left causes failure.

$TAB$ and $RTAB$ are particularly valuable in breaking fields out of structured data:

* TAB(3) + TAB(30) % "NAME" + TAB(35) + REM() % "PO"

**2.12** $POS$ and $RPOS$

$POS$(integer) and $RPOS$(integer) are primitive functions whose values are pattern structures. These pattern structures match the null string if the cursor is at a point in the subject string specified by the integer argument. $POS$(N) succeeds, matching the null string, only if the cursor is positioned just at the right of the Nth character. $RPOS$(N) succeeds, matching the null string, only if the cursor is positioned just at the left of the Nth character from the end of the subject string. POS and RPOS never cause the cursor to be moved; they test its position.

$POS$(0) is a pattern that succeeds only if the cursor is at the left of the subject string. $RPOS$(0) succeeds only if the cursor is at the right of the subject string. $POS$(0) and $RPOS$(0) can serve as left and right anchors for any pattern P, as in

* ENTIRE = POS(0) + P + RPOS(0)

Arguments for $POS$ and $RPOS$ must have nonnegative integer values when pattern matching is performed. Negative or non-integer arguments cause a program error.

The following program uses $POS$, $RPOS$, $SPAN$, and $BREAK$ to list cards that do not conform to a specified format. Cards, to be valid, must have four columns of data right justified at columns 10, 20, 30, and 40. Data in any field must contain no more than nine non-blank characters. $SPAN$ and $BREAK$ are used to locate fields on a card while $POS$ and $RPOS$ verify the location of the fields.

In [None]:
FIELD = SPAN(' ') + BREAK(' ')
FORMAT = (
    POS(0)  + FIELD
  + POS(10) + FIELD
  + POS(20) + FIELD
  + POS(30) + FIELD
  + POS(40) + SPAN(' ')
  + RPOS(0)
)

**2.13** $FAIL$

$FAIL$ is a primitive pattern function that always fails to match. FAIL() causes the scanner to seek alternatives.

**2.14** $FENCE$

The primitive pattern function $FENCE$ has a pattern structure which matches the null string when encountered by the scanner moving left to right through a pattern. If a subsequent failure causes the scanner to back up into $FENCE$, seeking an alternative, the pattern match is terminated. Considering $FENCE$ as a bead, the needle passes freely from left to right. Attempting to pull the needle back through $FENCE$ causes failure of pattern matching.

**2.15** $ABORT$

$ABORT$ is a primitive pattern function with a pattern structure that causes immediate failure of the entire pattern match. No alternatives are tried, and the statement fails. $ABORT$() is useful in constructing conditional pattern matching statements. For instance, in processing source decks as data, the following pattern ignores comment cards, but matches all others against the pattern CARD. Similarly, the pattern SHORT_PAT permits an attempt to match PAT only if the subject string is less than 12 characters long.

In [None]:
#CARD = σ('?')
#PAT = σ('?')
#CARD_FORM = σ('*') + ABORT() | CARD # ABORT not yet implemented
#SHORT_PAT = LEN(12) + ABORT() | PAT # ABORT not yet implemented

**2.16** Unevaluated Expressions

There are four forms of unevaluated expressions:

1. the lambda expression, (lambda: expression): The built-in Python lambda function, when used as an argument to a primitive pattern matching function, postpones evaluation of the argument until after pattern building is complete, and during pattern matching process.

2. the zeta function, ζ("pattern_variable") or ζ(lambda: pattern_variable): The primitive pattern matching function ζ postpones the evaluation of its argument, the name of a variable containing a PATTERN as its value. If P is a variable, then ζ("P") is an unevaluated expression. The unevaluated expression is evaluated when the scanner encounters a primitive function with an unevaluated expression as its argument within a pattern structure.

3. the LAMBDA function, Λ("boolean expression") or Λ(lambda: bool_expression): The primitive pattern matching function, LAMBDA, is used for general conditional matching. A boolean value of False causes failure and backtracking. A value of True allows scanning to continue. This pattern pairs well with the @ operator, immediate assignment, which could have previously captured results useful in the bool_expression.

4. The lambda function, λ("Python line of code"): This primitive pattern-matching function allows specifying free-form Python code to execute after a successful pattern match, i.e. conditional command execution. This pattern pairs well with the % operator, conditional assignment, which could have previously captured results useful in the Python code.

If an unevaluated expression appears as part of a pattern, the expression is evaluated when encountered during pattern matching. If evaluation of the expression is successful, the value becomes part of the pattern structure and pattern matching continues. If evaluation of the expression fails, the scanner backs up, seeking alternatives. Failure during evaluation of an expression does not necessarily cause failure of the entire pattern match.

A typical use for unevaluated expressions is motivated by the following example. Two
strings are read as input data and a list is made of the words appearing in both strings.
The list is generated by obtaining words one at a time from the first string using the
pattern WORD and using the pattern

* (POS(0) | ANY(' .,')) + W + ANY(' .,')

to determine if each word appears in the second string.

In [None]:
WORD = BREAK(" .,") % "W" + SPAN(" .,")
STRING1 = "THESE TWO STRINGS ARE ALMOST ALIKE."
STRING2 = "THE TWO STRINGS AREN'T ALIKE."
LIST = POS(0) + λ("WORDS = []") + ARBNO(WORD + λ("WORDS.append(W)")) + RPOS(0)
if STRING1 in LIST:
    for W in WORDS:
        if STRING2 in (POS(0) | ANY(" .,")) + σ(W) + ANY(' .,'):
            print(W)
else: print("Boo!")

TWO
STRINGS
ALIKE


As programmed above, a pattern structure for

* (POS(0) | ANY(' .,')) + W + ANY(' .,')

must be built during each pass through the loop because of the structure for ANY. Constructing the pattern outside of the loop is not appropriate either, since the value of W changes for each iteration of the loop. Using an unevaluated expression in place of the variable W does permit the pattern structure to be constructed outside of the loop. As illustrated below, the pattern structure for FINDW contains *w in place of W. The expression *w is not evaluated until needed in pattern matching. The value of W used during pattern matching is the current value, in this case the value just assigned by matching WORD.

In [None]:
WORD = BREAK(" .,") % "W" + SPAN(" .,")
FINDW = (POS(0) | ANY(" .,")) + σ(lambda: W) + ANY(' .,')
STRING1 = "THESE TWO STRINGS ARE ALMOST ALIKE."
STRING2 = "THE TWO STRINGS AREN'T ALIKE."
LIST = POS(0) + λ("WORDS = []") + ARBNO(WORD + λ("WORDS.append(W)")) + RPOS(0)
if STRING1 in LIST:
    for W in WORDS:
        if STRING2 in FINDW:
            print(W)
else: print("Boo!")

TWO
STRINGS
ALIKE


Unevaluated lambda expressions are valid arguments for prImltIve pattern-valued functions. A pattern structure for the function is built, but the argument remains unevaluated until pattern matching is performed.

In pattern matching, unevaluated expressIons can be used In a variety of ways as illustrated by the following examples.

**2.16.1** Example 1

PAIR is a pattern that matches any two consecutive identical characters. PAIR uses LEN (1) to match any character, and immediate value assignment to assign the character as value of X. The expression (lambda: X) is then evaluated and must match the same character as LEN(1).

In [None]:
PAIR = (LEN(1) @ "X" + σ(lambda: X)) % "OUTPUT"
'COOK' in PAIR
'COMMON' in PAIR
'AARON' in PAIR
'CHICKADEE' in PAIR

OO
MM
AA
EE


True

**2.16.2** Example 2

Given any subject string STR and any pattern P, BIGP finds the longest substring of STR that P matches.

* BIG = ""
* BIGP = (ζ("P") @ "TRY" + λ(lambda: len(TRY) > len(BIG))) @ "BIG" + FAIL()

BIGP uses two variables, BIG and TRY. During pattern matching, the value of BIG is the longest substring found. Before pattern matching, BIG must be initialized to the null string. TRY is assigned every substring that the pattern P matches. If TRY I is longer than BIG, the value of BIG is updated.

BIGP utilizes unevaluated expressions in two ways. ζ("P") allows BIGP to be constructed without specifying the value of P. The value of P is determined during pattern matching. The predicate λ(lambda: len(TRY) > len(BIG)) is evaluated during pattern matching whenever ζ("P") matches a substring. It compares the size of TRY with the size of BIG. If the new substring is shorter, the predicate fails. Failure of a predicate or function during pattern matching causes the scanner to back up seeking alternatives. If the new substring is longer, the predicate succeeds, returning the null string as value. This null string is
immediately matched. The variable BIG is then assigned the new substring as value.
FAIL causes the scanner to back up and look for another substring matched by P.
The following is a test program for BIGP.

In [None]:
BIGP = (ζ("P") @ "TRY" + Λ(lambda: len(TRY) > len(BIG))) @ "BIG" + σ('\n')
STR = 'IN 1964 NFL ATTENDANCE JUMPED TO 4,807,884; AN INCREASE OF 401,810.'
P = SPAN("0123456789,")
BIG = ""; STR in BIGP
print("Largest number is:", BIG)
P = SPAN(UCASE)
BIG = ""; STR in BIGP
print("Largest word is:", BIG)

Largest number is: 4,807,884
Largest word is: ATTENDANCE


**2.16.3** Example 3

Recursive definitions of patterns are possible using recursive pattern functions. The pattern structure for P is constructed referencing a future invocation of P giving rise to a recursive definition. The pattern P matches either Y or anything matched by P followed by Z. Therefore, since P matches Y, it also matches YZ. Since P matches YZ, it also matches YZZ, etc. Thus, P matches strings of the form Y, YZ, YZZ, YZZZ.

In [None]:
P = σ('Y') | ζ('P') + σ('Z')
PO = POS(0) + P @ "OUTPUT" + RPOS(0)
for subject in ['Y', 'YZ', 'YZZ', 'YZZZ']:
    subject in PO

Y·
Y·YZ·
Y·YZ·YZZ·
Y·YZ·YZZ·YZZZ·


Recursive definitions can be quite complicated, as in the following example, which recognizes a simple class of arithmetic expressions.

In [None]:
VAR     = ANY('XYZ')
ADDOP   = ANY('+-')
MULOP   = ANY('*/')
FACTOR  = VAR | σ('(') + ζ('EXP') + σ(')')
TERM    = FACTOR | FACTOR + MULOP + ζ('TERM')
EXP     = ADDOP + TERM | TERM | TERM + ADDOP + ζ('EXP')
for subject in ["X+Y*(Z+X)", "X+Y+Z", "XY"]:
    if    subject in POS(0) + EXP + RPOS(0):
          print(subject, "is an expression.")
    else: print(subject, "is NOT an expression!")

X+Y*(Z+X) is an expression.
X+Y+Z is an expression.
XY is NOT an expression!


**2.17** $ARB$

$ARB$ is a function whose initial value is a primitive pattern structure that matches zero or more characters. When first encountered by the scanner moving from left to right, $ARB$ matches the null string. When 'backed into' on subsequent occasions, $ARB$ increases the size of the substring it matches by one. $ARB$ fails only when it can no longer increase the length of the substring it matches.

The pattern structure for $ARB$ has implicit alternatives. When 'backed into' because of failure, $ARB$ attempts to find another suitable substring rather than fail. Only when all implicit alternatives have failed is the needle passed to an explicit alternative or back to a previously successful component.

The following definition of $ARB$ is equivalent to the definition given to $ARB$().

* $ARB$ = ε() | LEN(1) + $ζ$(lambda: $ARB$)

A bead diagram for $ARB$ is: ...  

If the bead for $ARB$ is replaced with the bead diagram for $ARB$, an expanded bead diagram for $ARB$ becomes ...

It can be seen from the bead diagram that (1) the null string is matched on the first attempt, (2) subsequent attempts increase the substring matched by one character, and (3) failure occurs when the size of the substring cannot be increased.

**2.18** $ARBNO$

$ARBNO$ is a mnemonic for 'arbitrary number of'. $ARBNO$(pattern) is a primitive function whose value is a pattern structure that matches zero or more consecutive occurrences of strings matched by its argument. When encountered by the scanner in the forward direction, $ARBNO$(pattern) matches the null string. When backed into, it tries to increase the length of the substring matched by its argument.

The value of $ARBNO$(P) is a pattern structure with implicit alternatives. It is equivalent to the pattern structure defined in the statement
* $ARBNO$ = ε() | P + $ζ$(lambda: $ARBNO$)

A bead diagram having a form similar to that for ARB illustrates the implicit alternatives of $ARBNO$(P).

**2.19** $BAL$

The initial value of the variable $BAL$ is a primitive pattern structure that matches any non-null string of characters which is balanced with respect to parentheses. $BAL matches

* X
* XYZ
* (A+B)
* A(B*C)
* (E/F)G+H

$BAL$ does not match
* )A+B(
* ( (A+B)

A pair of patterns that are equivalent to $BAL$ are

* BALEXP = NOTANY('()') | σ('(') + ARBNO(ζ("BALEXP")) + σ(')')
* BAL = BALEXP + ARBNO(BALEXP)

The value of $BALEXP$ matches a single balanced expression that consists of a single character that is neither an open nor closed parenthesis, or it matches an arbitrary number of balanced expressions (possibly none) enclosed in parentheses. $BAL$ itself matches the concatenation of one or more single balanced expressions.

**2.20** $SUCCEED$

The variable $SUCCEED$ has a pattern structure as its initial value. $SUCCEED$ matches the null string when first encountered by the scanner moving left to right through a pattern. If a subsequent failure causes the scanner to back up to $SUCCEED$, seeking an alternative, $SUCCEED$ again matches the null string. Thus, $SUCCEED$ always matches the null string, both in the forward direction and when alternatives are sought. $SUCCEED$ has a bead representation where all implicit alternatives are the null string.

Since the number of implied alternatives is infinite, the scanner can never back through $SUCCEED$. Practical uses for $SUCCEED$ seem limited.
