Excerpts from VANILLA SNOBOL4 TUTORIAL AND REFERENCE MANUAL\
by Mark B. Emmer [marke@snobol4.com]\
© Copyright 1985, 1988 by Catspaw, Inc. \

Welcome to the world of $SNOBOL4.$ It's a world where you can manipulate text and search for patterns in a simple and natural manner. $SNOBOL4's$ pattern programming provides a new way to work with computers. If you would like to add $SNOBOL4$ to your repertoire of problem-solving tools, and learn why so many people are excited about it, read on.

This tutorial is addressed to the beginning $Python/SNOBOL4$ programmer. It assumes a modest knowledge of general programming concepts, and experience with another high-level language, such as BASIC, C, Fortran, or Pascal.

In [None]:
!pip install SNOBOL4python==0.4.5
import sys
from pprint import pprint
## Thirty one (31) flavors of patterns to choose from ...
from SNOBOL4python import ε, σ, π, λ, Λ, ζ, θ, Θ, φ, Φ, α, ω
from SNOBOL4python import ABORT, ANY, ARB, ARBNO, BAL, BREAK, BREAKX, FAIL
from SNOBOL4python import FENCE, LEN, MARB, MARBNO, NOTANY, POS, REM, RPOS
from SNOBOL4python import RTAB, SPAN, SUCCESS, TAB
# Miscellaneous
from SNOBOL4python import GLOBALS, TRACE, PATTERN, Ϩ, STRING
from SNOBOL4python import ALPHABET, DIGITS, UCASE, LCASE, NULL
from SNOBOL4python import nPush, nInc, nPop, Shift, Reduce, Pop
# Instantiate the global variable space
GLOBALS(globals())

**1.3** An example

Just to get a feel for where we're going, let's take a look at a small SNOBOL4 program. It produces a sorted list of the words in a string, along with a count of how many times each word appears. Don't be concerned if you don't understand the program; I just want to give you a taste of the language:

In [None]:
letters = LCASE + "-’"
subject = "I know what I like, and I like what I know."
subject = subject.lower()
subject in (
    POS(0) + λ("tally = {}")
  + ARBNO(
      BREAK(LCASE)
    + SPAN(letters) % "word"
    + λ("if word not in tally: tally[word] = 0")
    + λ("tally[word] += 1")
    )
  + ARBNO(NOTANY(letters))
  + RPOS(0)
)
print("Word counts:")
pprint(sorted(tally.items(), key=(lambda item: (-item[1], item[0]))))

Notice some of the things that seem to occur so effortlessly here. A word is defined to be any combination of lower case letters, hyphen, and apostrophe. Data from the file are converted to lower case. A table of word counts uses the words themselves as subscripts.

**6.1** Introduction

Pattern matching examines a subject string for some combination of characters, called a pattern. The matching process may be very simple, or extremely complex.

* The subject contains several color names. The pattern is the string "BLUE". Does the
subject string contain the word "BLUE"?
* The subject contains a nucleic acid (DNA) sequence. The pattern searches for a subsequence that is replicated in two other places in the string.
* The subject contains a paragraph of English text. The pattern describes the spacing rules
to be applied after punctuation. Does the subject string conform to the punctuation rules?
* The subject string represents the current board position in a game of Tick-Tack-Toe. The
pattern examines this string and determines the next move.
* The subject contains a program statement from a prototype computer language. The pattern contains the grammar of that language. Is the statement properly formed according
to the grammar?

Most programming languages provide rudimentary facilities to examine a string for a specific
character sequence. SNOBOL4 patterns are far more powerful, because they can specify complex
(and convoluted) interrelationships. The colors of a painting, the words of a sentence, the notes of a musical score have limited significance in isolation. It is their relationship with one another which provides meaning to the whole. Likewise, SNOBOL4 patterns can specify context; they may be qualified by what precedes or follows them, or by their position in the subject.

**6.1.1** Knowns and unknowns

Patterns are composed of known and unknown components.
Knowns are specific character strings, such as the string "BLUE" in the first example above.
We are looking for a yes/no answer to the question: 'Does this known item appear in the subject
string?'

Unknowns specify the kind of subject characters we are looking for; the specific characters are
not identifiable in advance. We might want to match only characters from a restricted alphabet,
or any substring of a certain length, or some arbitrary number of repetitions of a string. If the pattern matches, we can then capture the particular subject substring matched.

**6.2** Specifying pattern matching

A pattern match requires a subject string and a pattern. If SUBJECT is the subject string, and PATTERN is the pattern, it looks like this:

* SUBJECT in PATTERN

The pattern match succeeds if the pattern is found in the subject string; otherwise it fails. This expression returns True or False depending on success or failure.

**6.3** Subject string

The subject string may be a literal string, a variable, or an expression. If it is not a string, its string equivalent will be produced before pattern matching begins. For example, if the subject is the integer 48, integer to string conversion produces the character string "48".


**6.4** Pattern subsequents and alternates

Arithmetic expressions are composed of elements and simpler subexpressions. Similarly, patterns are composed of simpler subpatterns which are joined together as subsequents and alternates. The binary subsequent operator is the plus (+). If P1 and P2 are two subpatterns, the expression P1 + P2 is also a pattern. The subject must contain whatever P1 matches, immediately followed by whatever P2 matches. P2 is the subsequent of P1. The preceding pattern matches pattern P1 followed by pattern P2.

The binary alternation operator is the vertical bar (|). The pattern P1 | P2 matches whatever P1 matches, or whatever P2 matches. SNOBOL4 tries the various alternatives from left to right.

Normally, concatenation is performed before alternation, so the pattern P1 | P2 + P3 matches P1 alone, or P2 followed by P3. Parentheses can be used to alter the grouping of subpatterns.  For example: (P1 | P2) + P3 matches P1 or P2, followed by P3. When a pattern successfully matches a portion of the subject, the matching subject characters are bound to it. The next pattern in the statement must match beginning with the very next subject character. If a subsequent fails to match, SNOBOL4 backtracks, unbinding patterns until another alternative can be tried. A pattern match fails when SNOBOL4 cannot find an alternative that matches.

The null string, ε(), may appear in a pattern. It always matches, but does not bind any subject characters. We can think of it as matching the invisible space between two subject characters. One possible use is as the last of a series of alternatives. For example, the pattern

In [None]:
ROOT = "MATCH"
print("MATCHES" in POS(0) + σ(ROOT) + (σ('S') | σ('ES') | ε()) + RPOS(0))

matches the pattern in ROOT, with an optional suffix of 'S' or 'ES'. If ROOT matches, but is not followed by 'S' or 'ES', the null string matches and successfully completes the clause. Its presence gives the pattern match a successful escape. The conditional functions of the previous chapter may appear in patterns. If they fail when evaluated, the current alternative fails. If they succeed, they match the null string, and so do not consume any subject characters. They behave like a gate, allowing the match to proceed beyond them only if they are true. This pattern will match 'FOX' if N is 1, or 'WOLF' if N is 2:

In [None]:
for N in range(1,3):
    for subject in ['FOX', 'WOLF']:
        matches = subject in (
            Λ(lambda: N == 1) + σ('FOX')
          | Λ(lambda: N == 2) + σ('WOLF')
        )
        pprint([N, subject, matches])

Parentheses may be used to factor a pattern. The strings 'COMPATIBLE', 'COMPREHENSIBLE', and 'COMPRESSIBLE' are matched by the pattern:

In [None]:
for subject in ['COMPATIBLE', 'COMPREHENSIBLE', 'COMPRESSIBLE']:
    print(subject in
        POS(0)
      + σ('COMP')
      + (σ('AT') | σ('RE') + (σ('HEN') | σ('S')) + σ('S'))
      + σ('IBLE')
      + RPOS(0)
    )

**6.5** Simple pattern matches

Here are examples of pattern matches using a string literal or variable for the subject. The patterns consist entirely of known elements.

In [None]:
'BLUEBIRD' in σ('BIRD') # first statement
'BLUEBIRD' in σ('bird') # second statement
B = 'THE BLUEBIRD' # third statement
print(B in σ('FISH')) # fourth statement
print(B in (σ('FISH') | σ('BIRD'))) # fifth statement
print(B in (σ('GOLD') | σ('BLUE')) + (σ('FISH') | σ('BIRD'))) # last statement

Regarding the previous code snippet, the first statement shows that the matching substring ('BIRD') need not begin at the start of the subject string. This is called unanchored matching. The second statement fails because strings are case sensitive, unlike names and labels. The third statement creates a variable to be used as the subject. The fifth statement employs an alternate: we are matching for 'FISH' or 'BIRD'.

The last statement uses subsequents and alternates. We are looking for a substring in B that contains 'GOLD' or 'BLUE', followed by 'FISH' or 'BIRD'. It will match 'GOLDFISH', 'GOLDBIRD', 'BLUEFISH' or 'BLUEBIRD'. If the parentheses were omitted, concatenation of 'BLUE' and 'FISH' would be performed before alternation, and the pattern would match 'GOLD', 'BLUEFISH', or 'BIRD'.

**6.6** The PATTERN data type

If we execute the 1st statement the variable COLOR contains the string 'BLUE', and could appear in the pattern portion of the 2nd statement.

Even though it is used as a pattern, COLOR has the string data type. However, complicated patterns may be stored in a variable just like a string or numeric value. The 3rd statement will create a structure describing the pattern, and store it in the variable COLOR. COLOR now has the PATTERN data type.

In [None]:
COLOR = 'BLUE' # 1st statement
B in σ(COLOR) # 2nd statement
COLOR = σ('GOLD') | σ('BLUE') # 3rd statement

The preceding example from section 6.5 can now be written as:

In [None]:
CRITTER = σ('FISH') | σ('BIRD')
BOTH = COLOR + CRITTER
B in BOTH

**6.7** Capturing match results

If the pattern match

* B in BOTH

succeeds, we may want to know which of the many pattern alternatives were used in the match. The binary operator conditional assignment assigns the matching subject substring to a variable. The operator is called conditional, because assignment occurs only if the entire pattern match is successful. Its graphic symbol is a percent (%). It assigns the matching substring on its left to the variable on its right. Note that the direction of assignment is just the opposite of the statement assignment operator (=). Continuing with the previous example, we'll redefine COLOR and CRITTER to use conditional assignment:

In [None]:
COLOR = (σ('GOLD') | σ('BLUE')) % "SHADE"
CRITTER = (σ('FISH') | σ('BIRD')) % "ANIMAL"
BOTH = COLOR + CRITTER
SHADE = ANIMAL = None
pprint([B in BOTH, SHADE, ANIMAL])

The substrings that match the subpatterns COLOR and CRITTER are assigned to SHADE and ANIMAL respectively.

Conditional assignment may appear at any level of pattern nesting, and may include other conditional assignments within its embrace. The pattern

* ((σ('B') | σ('F') | σ('N')) % "FIRST" + σ('EA') + (σ('R') | σ('T')) % "LAST") % "WORD"

matches 'BEAR', 'FEAR', 'NEAR', 'BEAT', 'FEAT', or 'NEAT', assigning the first letter matched to FIRST, the last letter to LAST, and the entire result to WORD.

The variable OUTPUT may be used as the target of conditional assignment. Try:

In [None]:
matches = 'B2' in (σ('A') | σ('B')) % "OUTPUT" + (σ(1) | σ(2) | σ(3)) % "OUTPUT"
pprint([matches])

**6.8** Unknowns

All of the previous examples used patterns created from literal strings. We may also want to
specify the qualities of a match component, rather than its specific characters. Using unknowns
greatly increases the power of pattern matching. There are two types, primitive patterns and
pattern functions.

**6.8.1** Primitive patterns

There are seven primitive patterns built into the SNOBOL4 system. The two used most frequently will be discussed here. Chapter Advanced Topics introduces the remaining five.

**4.8.1.1** **REM** Match remainder of subject

REM is short for the remainder pattern. It will match zero or more characters at the end of the subject string. Try the following:

In [None]:
'THE WINTER WINDS' in σ('WIN') + REM() % "OUTPUT"

The subpattern 'WIN' matched its first occurrence in the subject, at the beginning of the
word 'WINTER'. REM matched from there to the end of the subject string - the characters
'TER WINDS' - and assigned them to the variable OUTPUT. If we change the pattern slightly,
to:

In [None]:
'THE WINTER WINDS' in σ('WINDS') + REM() % "tx" + λ("print('<' + tx + '>')")

then 'WINDS' matches at the end of the subject string, leaving a null remainder for REM.
REM matches this null string, assigns it to OUTPUT, and a blank line is displayed.
The pattern components to the left of REM must successfully match some portion of the
subject string. REM begins where they left off, matching all subject characters through the
end of string. There are no restrictions on the particular characters matched.


**6.8.1.2** **ARB** Match arbitrary characters

ARB matches an arbitrary number of characters from the subject string. It matches the
shortest possible substring, including the null string. The pattern components on either
side of ARB determine what is matched. Try the statements

In [None]:
print('MOUNTAIN' in σ('O') + ARB() % "OUTPUT" + σ('A')) # first statement
print('MOUNTAIN' in σ('O') + ARB() % "OUTPUT" + σ('U')) # second statement

In the first statement, the ARB pattern is constrained on either side by the known patterns
'O' and 'A'. ARB expands to match the subject characters between, 'UNT'. In the second
statement, there is nothing between 'O' and 'U', so ARB matches the null string. ARB
behaves like a spring, expanding as needed to fill the gap defined by neighboring patterns.

**6.8.2** Cursor position

During a pattern match, the cursor is SNOBOL4's pointer into the subject string. It is integer valued, and points between two subject characters. The cursor is set to zero when a pattern match begins, corresponding to a position immediately to the left of the first subject character.

As the pattern match proceeds, the cursor moves right and left across the subject to indicate where SNOBOL4 is attempting a match. The value of the cursor will be used by some of the pattern functions that follow.

The cursor position function assigns the current cursor value to a variable. It is a function whose name is the 'theta' sign (θ). Its argument is the name of a variable. By using OUTPUT as the variable, we can display the cursor position on the screen. For instance:

In [None]:
print('VALLEY' in σ('A') + Θ("OUTPUT") + ARB() + σ('E') + Θ("OUTPUT"))
print('DOUBT' in Θ("OUTPUT") + σ('B'))
print('FIX' in Θ("OUTPUT") + σ('B'))

Cursor assignment is performed whenever the pattern match encounters the operation, including retries. It occurs even if the pattern ultimately fails. The element θ("OUTPUT") behaves like the null string - it doesn't consume subject characters or interfere with the match in any way.

**6.8.3** Integer pattern functions

These functions return a pattern based on their integer argument. The pattern produced can be used directly in a pattern match statement, or stored in a variable for later retrieval.

**6.8.3.1** **LEN(integer)** Match fixed-length string

LEN(I) produces a pattern which matches a string exactly I characters long. I must be an integer greater than or equal to zero. Any characters may appear in the matched string. For example, LEN(5) matches any 5-character string, and LEN(0) matches the null string. LEN may be constrained to certain portions of the subject by other adjacent patterns:

In [None]:
S = 'ABCDA'
print(S in LEN(3) % "OUTPUT")
print(S in LEN(2) % "OUTPUT" + σ('A'))

The first pattern match had only one constraint - the subject had to be at least three characters long - so LEN(3) matched its first three characters. The second case imposes the additional restriction that LEN(2)'s match be followed immediately by the letter 'A'.

This disqualifies the intermediate match attempts 'AB' and 'BC'. Using keyword ALPHABET as the subject provides a simple way to convert a decimal character code between 0 and 255 to its one character equivalent. For example, by consulting an ASCII character code chart we find that the BEL character is decimal 7. We can load that character into variable BEEP with one statement:

In [None]:
ALPHABET in LEN(7) + LEN(1) % "BEEP"
pprint(BEEP)

ALPHABET contains all 256 members of the ASCII character set, in ascending order. LEN(7) matches the first seven characters (codes 0-6), leaving BEL as the next match position for LEN(1). This operation is analogous to the CHR$ function in BASIC.

The inverse operation, obtaining the numerical value of a character code, is also possible. If variable CHAR contains a one character string, variable N will be set to its decimal equivalent with the second statement below:

In [None]:
CHAR = 'A'
ALPHABET in ARB() + Θ("N") + σ(CHAR)
print(N)

**6.8.3.2** **POS(integer)**, **RPOS(integer)** Verify cursor position

The POS(I) and RPOS(I) patterns do not match subject characters. Instead, they succeed only if the current cursor position is a specified value. They often are used to tie points of the pattern to specific character positions in the subject.

POS(I) counts from the left end of the subject string, succeeding if the current cursor position is equal to I. RPOS(I) is similar, but counts from the right end of the subject. If the subject length is N characters, RPOS(I) requires the cursor be N - I. If the cursor is not the correct value, these functions fail, and Snobol4 tries other pattern alternatives, perhaps extending a previous substring matched by ARB, or beginning the match further along in the subject.

In [None]:
S = 'ABCDA'
print(S in POS(0) + σ('B')) # first example
print(S in LEN(3) % "OUTPUT" + RPOS(0)) # similarly
print(S in POS(3) + LEN(1) % "OUTPUT") # next example
print(S in POS(0) + σ('ABCD') + RPOS(0)) # finally

The first example requires a 'B' at cursor position 0, and fails for this subject. POS(0)
anchors the match, forcing it to begin with the first subject character. Similarly, RPOS(0)
anchors the end of the pattern to the tail of the subject. The next example matches
at a specific mid-string character position, POS(3). Finally, enclosing a pattern between
POS(0) and RPOS(0) forces the match to use the entire subject string.

At first glance these functions appear to be setting the cursor to a specified value. Actually,
they never alter the cursor, but instead wait for the cursor to come to them as various
match alternatives are attempted. This, in turn, allows other patterns in the statement
to be processed in an orderly fashion. You can demonstrate this waiting for the cursor
behavior like this:

In [None]:
S in ARB() + Θ("OUTPUT") + POS(3)

**6.8.3.3** **TAB(integer)**, **RTAB(integer)** Match to fixed position

These patterns are hybrids of ARB, POS, and RPOS. They use specific cursor positions, like POS and RPOS, but bind (match) subject characters, like ARB. TAB(I) matches any characters from the current cursor position up to the specified position I. RTAB(I) does the same, except, as in RPOS, the target position is measured from the end of the subject.

TAB and RTAB will match the null string, but will fail if the current cursor is to the right of the target. They also fail if the target position is past the end of the subject string. These patterns are useful when working with tabular data. For example, if a data file contains name, street address, city and state in columns 1, 30, 60, and 75, this pattern will break out those elements from a line:

In [None]:
P = TAB(29) % "NAME" + TAB(59) % "STREET" + TAB(74) % "CITY" + REM() % "STATE"

The pattern RTAB(0) is equivalent to primitive pattern REM. One potential source of confusion is just what it is that RTAB matches. It counts from the right end of the subject,
but matches to the left of its target cursor. Try:

In [None]:
'ABCDE' in TAB(2) % "OUTPUT" + RTAB(1) % "OUTPUT"

TAB(2) matches 'AB', leaving the cursor at 2, between 'B' and 'C'. The subject is 5
characters long, so RTAB(1) specifies a target cursor of 5 - 1, or 4, which is between the
'D' and 'E'. RTAB matches everything from the current cursor, 2, to the target, 4.

**6.8.4** Character pattern functions

These functions produce a pattern based on a string-valued argument. Once again, the pattern may be used directly or stored in a variable.

**6.8.4.1** **ANY(string)**, **NOTANY(string)** Match one character

Each function produces a pattern which matches one character based upon the subject string. ANY(S) matches the next subject character if it appears in the string S, and fails otherwise. NOTANY(S) matches a subject character only if it does not appear in S. Here are some sample uses of each:

In [None]:
VOWEL = ANY('AEIOU')
DVOWEL = VOWEL + VOWEL
NOTVOWEL = NOTANY('AEIOU')
print('VACUUM' in VOWEL % "OUTPUT")
print('VACUUM' in DVOWEL % "OUTPUT")
print('VACUUM' in (VOWEL + NOTVOWEL) % "OUTPUT")

The argument string specifies a set of characters to be used in creating the ANY or NOTANY pattern. It may contain duplicate characters, and the order of characters in S is immaterial.

**6.8.4.2** **SPAN(string)**, **BREAK(string)** Match a run of characters

These are multicharacter versions of ANY and NOTANY. Each requires a nonnull argument to specify a set of characters.

SPAN(S) matches one or more subject characters from the set in S. SPAN must match at least one subject character, and will match the longest subject string possible.

BREAK(S) matches up to but not including any character in S. The string matched must always be followed in the subject by a character in S. Unlike SPAN and NOTANY, BREAK will match the null string.

These two functions are called stream functions because each streams by a series of subject characters. SPAN is most useful for matching a group of characters with a common trait.

For example, we can say an English word is composed of one or more alphabetic characters, apostrophe, and hyphen. The statements

In [None]:
LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ’-"
WORD = SPAN(LETTERS)

produce a suitable pattern in WORD. To match the material between words (white space, punctuation, etc.), use the pattern:

In [None]:
GAP = BREAK(LETTERS)

SPAN and BREAK are two of the most useful SNOBOL4 functions. Try some examples.

In [None]:
print('SAMPLE LINE' in WORD % "OUTPUT")
print('PLUS TEN DEGREES' in σ(' ') + WORD % "OUTPUT")
GAPO = GAP % "OUTPUT"
WORDO = WORD % "OUTPUT"
print(': ONE, TWO, THREE' in GAPO + WORDO + GAPO + WORDO)
DIGITS = '0123456789'
INTEGER = (ANY('+-') | ε()) + SPAN(DIGITS)
print('SET -43 VOLTS' in INTEGER % "OUTPUT")
REAL = INTEGER + σ('.') + (SPAN(DIGITS) | ε())
print('SET -43.625 VOLTS' in REAL % "OUTPUT")
S = '0ZERO,1ONE,2TWO,3THREE,4FOUR,5FIVE,'
print(S in σ('4') + BREAK(',') % "OUTPUT")

If you require a version of SPAN which will match the null string, or a BREAK which will not match the null string, you can use the following constructions:

In [None]:
print(SPAN(S) | ε())
print(NOTANY(S) + BREAK(S))

I've introduced a lot of concepts in this chapter; it's time to see how they fit together into programs.

**6.10.1** Word Counting

The first program counts the number of words in the input string.

In [None]:
N = 0
WORD = "'-" + '0123456789' + UCASE + LCASE
WPAT = BREAK(WORD) + SPAN(WORD)
"Now is the time for all good men to come to the aid of their country." in (
    POS(0) + λ("N = 0")
  + ARBNO(WPAT + λ("N += 1"))
  + ARBNO(NOTANY(WORD))
  + RPOS(0)
)
print(N, 'words')

**7.2** Unevaluated expressions

The unevaluated expression, lambda function, may be used with the argument of the pattern functions ANY, BREAK, LEN, NOTANY, POS, RPOS, RTAB, SPAN, or TAB. Here's an example:

* LEN(lambda: POSITION)
* SPAN(lambda: CHARACTERS)

The current value of POSITION and CHARACTERS are ignored at this time. Later, when used in a pattern match, the unevaluated expression operator tells SNOBOL4 to fetch the current values.

**7.3** Immediate assignment

Our examples have made extensive use of the conditional assignment operator to capture matched substrings after a successful pattern match. The immediate assignment operator allows us to capture intermediate results during the pattern match.

Immediate assignment occurs whenever a subpattern matches, even if the entire pattern match ultimately fails. Immediate assignment is a binary operator whose graphic symbol is the at sign (@). Like conditional assignment, the matching substring on its left is assigned to the variable on its right. Here are examples where we use variable OUTPUT to reveal the work of the pattern matcher:

In [None]:
S = 'ABCDEFG'
print(S in σ('A') + ARB() @ "OUTPUT" + σ('E'))
print(S in (σ('B') + LEN(2) | σ('C') + LEN(3)) @ "OUTPUT" + σ('G'))

**7.3.1** Immediate assignment and unevaluated expressions

As useful as immediate assignment is for revealing the inner workings of a pattern match, a more powerful use is possible. It can be used with the unevaluated expression operator to develop a new class of patterns. An interesting substring at the beginning of the subject is immediately assigned to a variable, and the variable is then subsequently used in the very same pattern.

Suppose a number at the beginning of the subject specifies the length of a variable width field that follows. We would like to capture the number into variable N, then use it with the LEN function to transfer the data into variable FIELD. When used with LEN, N must be preceded by the unevaluated expression operator, so that its new value is retrieved. For instance:

In [None]:
FPAT = SPAN('0123456789') @ "N" + LEN(lambda: N) % "FIELD"
N = FIELD = None
pprint(['12ABCDEFGHIJKLMNOPQ' in FPAT, N, FIELD])

SPAN matched the field length, 12, and immediately assigned it to N. LEN(lambda: N) then matched the next 12 characters. Another subject, with a different field length, would update N appropriately.

Type conversion was working quietly behind the scenes here: N was assigned the string '12', yet it appeared as integer 12 to the LEN function.

Now here is an example which provides a glimpse of just how powerful SNOBOL4's pattern matching can be. Problem: Examine a subject for an arbitrary three-character substring which appears twice in a row, or bracketed in parentheses. Solution:

In [None]:
TWOPAT = (
      LEN(3) @ "X"
    + ( σ(lambda: X)
      | σ("(") + σ(lambda: X) + σ(")")
      )
)
X = None
pprint(['ABCDECDEFGH' in TWOPAT, X])
pprint(['ABCDE(CDE)BA' in TWOPAT, X])

**9.1** The ARBNO function

This function produces a pattern which will match zero or more consecutive occurrences of the pattern specified by its argument. As its name implies, ARBNO is useful when an arbitrary number of instances of a pattern may occur. For example, ARBNO(LEN(3)) matches strings of length 0, 3, 6, 9, ... There is no restriction on the complexity of the pattern argument.

Like the ARB pattern, ARBNO is shy, and tries to match the shortest possible string. Initially, it simply matches the null string. If a subsequent pattern component fails to match, SNOBOL4 backs up, and asks ARBNO to try again. Each time ARBNO is retried, it supplies another instance of its argument pattern. In other words, ARBNO(PAT) behaves like

* (ε() | PAT | PAT + PAT | PAT + PAT + PAT | ...)

Also like ARB, ARBNO is usually used with adjacent patterns to draw it out. Let's consider a simple example. We want to write a pattern to test for a list. We'll define a list as being one or more numbers separated by comma, and enclosed by parentheses. Use CODE.SNO to try this definition:

In [None]:
ITEM = SPAN('0123456789')
LIST = POS(0) + σ('(') + ITEM + ARBNO(σ(',') + ITEM) + σ(')') + RPOS(0)
print('(12,345,6)' in LIST)
print('(12,,34)' in LIST)

ARBNO is retried and extended until its subsequent, ')', finally matches. POS(0) and RPOS(0) force the pattern to be applied to the entire subject string.

Alternation may be used within ARBNO's argument. This pattern matches any number of pairs of certain letters:

In [None]:
PAIRS = POS(0) + ARBNO(σ('AA') | σ('BB') | σ('CC')) + RPOS(0)
print('CCBBAAAACC' in PAIRS)
print('AABBB' in PAIRS)

**9.2** Recursive patterns

This is the pattern analogue of a recursive function - a pattern is defined in terms of itself. The unevaluated expression operator makes the definition possible.

Suppose we wanted to expand the previous definition of a list to say that a list item may be a span of digits, or another list. The definition proceeds as before, except that the unevaluated expression operator is used in the first statement; the concept of a list has not yet been defined:

In [None]:
ITEM = SPAN('0123456789') | ζ('LIST')
LIST = σ('(') + ITEM + ARBNO(σ(',') + ITEM) + σ(')')
TEST = POS(0) + LIST + RPOS(0)
print('(12,(3,45,(6)),78)' in TEST)
print('(12,(34)' in TEST)

Recursion occurs because LIST is defined in terms of ITEM, which is defined in terms of LIST, and so on. Note that functions POS(0) and RPOS(0) were moved out one level, to TEST, because LIST must now match substrings within the subject.

In our previous discussion of recursive functions, we said they work because successive calls present the function with progressively simpler problems, until the problem can be solved without further recursion. Similarly, patterns ITEM and LIST are applied to successively smaller substrings, until ITEM can use its SPAN alternative instead of invoking LIST again.

In general, you will need an alternative somewhere in the recursive loop to allow the pattern matcher a way out. Also, you should place recursive objects last in a series of alternatives, so that the simpler, nonrecursive patterns are attempted first and recursive plunges can be avoided.

SNOBOL4 saves information on a pattern stack during the pattern match process. Heavily recursive patterns and long subject strings can sometimes result in stack overflow. If this occurs, you should break the problem apart into several smaller pattern matches.

**9.4** Other primitive patterns

We can accomplish quite a lot with just the primitive patterns ARB and REM. However, there are five additional patterns which you should be aware of:

**9.4.1** **ABORT** End pattern match

The ABORT pattern causes immediate failure of the entire pattern match, without seeking other alternatives. Usually a match succeeds when we find a subject sequence which satisfies the pattern. The ABORT pattern does the opposite: if we find a certain pattern, we will abort the match and fail immediately. For example, suppose we are looking for an 'A' or 'B', but want to fail if '1' is encountered first:

In [None]:
# print('--AB-1-' in (ANY('AB') | σ('1') + ABORT())) # ABORT Not yet implemented
# print('--1B-A-' in (ANY('AB') | σ('1') + ABORT())) # ABORT Not yet implemented

The last example may be confusing because the ANY function appears as the first alternative, fostering the illusion that it will find the 'B' in the subject before the other pattern alternative is tried. However, that is not the order of pattern matching; all pattern alternatives are tried at cursor position zero in the subject. If none succeed, the cursor is advanced by one, and all alternatives are tried again. When the cursor is in front of subject character '1', ANY still does not match, but the second alternative now does. As the '1's match, ABORT is reached, causing failure.

**9.4.2** **BAL** Match balanced string

The BAL pattern matches the shortest non-null string in which parentheses are balanced. (A string without parentheses is also considered to be balanced.) These strings are balanced:

* (X)
* Y
* (A!(C:D))
* (AB)+(CD)
* 9395

These are not:

* )A+B
* (A*(B+)
* (X))

BAL is concerned only with left and right parentheses. The matching string does not have to be a well-formed expression in the algebraic sense; in fact, it needn't be an algebraic expression at all. Like ARB, BAL is most useful when constrained on both sides by other pattern components.

**9.4.3** **FAIL** Seek other alternatives

The FAIL pattern signals failure of this portion of the pattern match, causing the pattern matcher to backtrack and seek other alternatives. FAIL will also suppress a successful match, which can be very useful when the match is being performed for its side effects, such as immediate assignment. For example, in unanchored mode, this statement will display the subject characters, one per line:

* SUBJECT in LEN(1) @ "OUTPUT" + FAIL()

LEN(1) matches the first subject character, and immediately assigns it to OUTPUT. FAIL tells the pattern matcher to try again, and since there are no other alternatives, the entire match is retried at the next subject character. Forced failure and retries continue until the subject is exhausted.

**9.4.4** **FENCE** Prevent match retries

Pattern FENCE matches the null string and has no effect when the pattern matcher is moving left to right in a pattern. However, if the pattern matcher is backing up to try other alternatives, and encounters FENCE, the match fails.

FENCE can be used to lock in an earlier success. Suppose we want to succeed if the first 'A' or 'B' in the subject is followed by a plus sign. In the following example, the 'A's match, we go through the FENCE, and find '+' does not match the next subject character, 'B'. Snobol4 tries to backtrack, but is stopped by the FENCE and fails:

In [None]:
print('1AB+' in ANY('AB') + FENCE() + σ('+'))

If FENCE were omitted, backtracking would match ANY to 'B', and then proceed forward again to match '+'.

If FENCE appears as the first component of a pattern, SNOBOL4 cannot back up through it to try another subject starting position. This results in an anchored pattern.

In [None]:
print('ABC' in FENCE() + σ('B'))

**9.4.5** **SUCCEED** Match always

This pattern matches the null string and always succeeds. If the scanner is backtracking when it encounters SUCCEED, it reverses and starts forward again. Placing a pattern between SUCCEED and FAIL causes the pattern matcher to oscillate.