**STEP 1**

The goal of this 12-step program is to show you an enhanced library for string pattern matching, using the Python programming language.

First, let's review regular expressions that are built in the Python programming langauge. The module is named "re", also refered as RE.

Say we are using regular expressions to analyze a text that contains some unknown program, written in a mystery programming language. Say the following is a line of text in mystery code:

* DEFINE('f(x,y,z)c,i,n,s,t,v')

You are told that this line of code invokes the DEFINE function. It declares a new function based on what is written between the single quotes. In the example above, it is declaring a function named f that has three parameters named x, y, z, and six local variables named c, i, n, s, t, and v.

Let's write Python regular expressions that extracts all of these names (the name of the function, the names of the parameters and the names of the local variables). This program will store this information in three variables and if we are successful, this will be the result:

* function = 'f',
* parameters = ['x', 'y', 'z'],
* variables = ['c', 'i', 'n', 's', 't', 'v'].

In [None]:
import re # this imports the built-in regular expression module in Python
subject = "DEFINE('f(x,y,z)c,i,n,s,t,v')" # here is the text that we are analyzing

**STEP 2**

Let's try using the *re.search* function. There are two well known elements for regular expressions:

*   the '^' (*caret*) character to specify the pattern ensuring the beginning of the text, also referred to as left-position zero, and
*   the '$' (*dollar-sign*) character to specify the pattern ensuring the end of the text, also referred to as right-position zero.

After a successful regular expression match, the *re.search* returns an *re.Match* object. To nicely display any Python object, the pretty-print function, *pprint*, is used. To gain access to any captured groups being returned, the *re.Match.groups* and *re.Match.groupdict* are used. The *groups* function returns a list of positionally captured groups and the *groupdict* function returns a dictionary of named captured groups.\
\
This is going to be amazing! You mean all I have to do is create just one regular expression pattern to verify the validity of the text and simulataneously extract all the separate elements into structured data in *one fell swoop*?

In [None]:
if results := re.search(
      r"^DEFINE\('"    # the caret denotes to match left-position zero
      r"([a-z])"       # (___) denotes to group and to capture 1st parameter name
                       # [a-z] denotes to match a single lower-case letter
      r"\("            # literal left-paren must be escaped by back-slash
      r"([a-z])"
      r"(?:,([a-z]))*" # (?:___) denotes to group but not to capture
                       # (___)* denotes matching a repetition of zero or more patterns
      r"\)"
      r"([a-z])"       # group and capture 1st variable name
      r"(?:,([a-z]))*" # also capture all remaining variable names ignoring commas
      r"'\)$"          # the dollar-sign denotes to match right-position zero
    , subject):
      print(["Matched:", results.groups()])
else: print(["Unmatched!"])

['Matched:', ('f', 'x', 'z', 'c', 'v')]


**STEP 3**

What happened!? I can't do my work.\
Where is the y parameter name?\
Where are most of the variable names?\
Where are the two lists for the repetions?\
I think I might have choosen unwisely.\
\
Can it be that the construct (?:___) specifying a non-capturing group is in some way interferring with properly capturing its nested pattern? So let's modify these to simply be capturing groups instead.

In [None]:
if results := re.search(
      r"^DEFINE\('"
      r"([a-z])"
      r"\("
      r"([a-z])"
      r"(,([a-z]))*" # (?:_(___))* becomes (_(___))*
      r"\)"
      r"([a-z])"
      r"(,([a-z]))*" # (?:_(___))* becomes (_(___))*
      r"'\)$"
    , subject):
      print(["Matched:", results.groups()])
else: print(["Unmatched!"])

['Matched:', ('f', 'x', ',z', 'z', 'c', ',v', 'v')]


**STEP 4**

What happened! I still can't do my work.\
If anything, things seem worse than before.\
I think I might have choosen unwisely.\
\
Can it be that capturing elements of a reptition just isn't possible?\
Can it be that the *re.search* function will just not return a list of elements?\
Unfortunately, the RE module is working as designed and hence this limitation will likely never to be lifted. It will only ever just return the last element of the repitition, and will always refuse to capture the remaining parts of the repitition.\
\
Unfortunately, the developers of the RE module might just have built this bug into the product as a feature. So, let's try one more time. Let's attempt to capture the entire text of these lists and not capture their nested elements.

In [None]:
if results := re.search(
      r"^DEFINE\('"
      r"([a-z])"
      r"\("
      r"([a-z])"
      r"((?:,[a-z])*)" # (_(___))* becomes ((?:_ ___))*
      r"\)"
      r"([a-z])"
      r"((?:,[a-z])*)" # (_(___))* becomes ((?:_ ___))*
      r"'\)$"
    , subject):
      print(["Matched:", results.groups()])
else: print(["Unmatched!"])

['Matched:', ('f', 'x', ',y,z', 'c', ',i,n,s,t,v')]


**STEP 5**

What happened? I'll never get my work done.\
\
So, it appears this caliber of results is the best we can accomplish using the RE module, and that our initial desire to develop one regular expression pattern to extract structured data from text will not be fulfilled. For this task, it appears more coding will be necessary. Oh but I desperately wanted to avoid procedural coding all together. I wanted the pattern to look similar to the subject. I want the solution to resemble the problem.\
\
Given that we are limited to just a single capture group returning a repitition in its entirety, and since individual elements can not be captured, let's merge the patterns which return the first and remaining parts into one pattern for a proper comma-seperated string for later processing.

In [None]:
if results := re.search(
      r"^DEFINE\('"
      r"([a-z])"
      r"\("
      r"((?:[a-z])(?:,[a-z])*)" # (___)((?:_ ___))* becomes ((?:___)(?:_ ___))*`
      r"\)"
      r"((?:[a-z])(?:,[a-z])*)" # (___)((?:_ ___))* becomes ((?:___)(?:_ ___))*`
      r"'\)$"
    , subject):
      print(["Matched:", results.groups()])
else: print(["Unmatched!"])

['Matched:', ('f', 'x,y,z', 'c,i,n,s,t,v')]


**STEP 6**

So what happens now? I've got so much work and not enough time.\
I must write more Python code! Ooh, somebody stop me.\
If the elements of a list, referred below as items, are processed in Python code seperately and revalidating these items would be necessary, the following code is representative of what's necessary.

In [None]:
if results := re.search(
      r"^DEFINE\('"
      r"(?P<func>[a-z])" # (?P<name>___) denotes to match, group and capture by name
      r"\("
      r"(?P<params>(?:[a-z])(?:,[a-z])*)" # (?P<params>___) denotes to capture group named ps
      r"\)"
      r"(?P<vars>(?:[a-z])(?:,[a-z])*)" # 'func', 'params', and 'vars' are keys of groupdict()
      r"'\)$"
    , subject):
      function = results.groupdict()['func']
      parameters = []
      for item_results in re.finditer(r",?(?P<item>[a-z])", results.groupdict()['params']):
          parameters.append(item_results.groupdict()['item'])
      variables = []
      for item_results in re.finditer(r",?(?P<item>[a-z])", results.groupdict()['vars']):
          variables.append(item_results.groupdict()['item'])
      print([function, parameters, variables])
else: print(["Unmatched!"])

['f', ['x', 'y', 'z'], ['c', 'i', 'n', 's', 't', 'v']]


**STEP 7**

So what happened there? I created a beautiful mess.\
\
But it works. It validates the input text and produces three variables, one containing the function name, and two containing the list of names for parameters and variables.\
\
Also, if revalidation is not necessary, this code can be simplified further.

In [None]:
if results := re.search(
      r"^DEFINE\('"
      r"(?P<func>[a-z])"
      r"\("
      r"(?P<params>(?:[a-z])(?:,[a-z])*)"
      r"\)"
      r"(?P<vars>(?:[a-z])(?:,[a-z])*)"
      r"'\)$"
    , subject):
      function   = results.groupdict()['func']
      parameters = results.groupdict()['params'].split(',')
      variables  = results.groupdict()['vars'].split(',')
      print(["Matched:", function, parameters, variables])
else: print(["Unmatched!"])

['Matched:', 'f', ['x', 'y', 'z'], ['c', 'i', 'n', 's', 't', 'v']]


**STEP 8**

Can there be another way?\
Let's try using the SNOBOL4python library as an alternative.\
The following code will mount and import the SNOBOL4python library.

In [None]:
!pip install SNOBOL4python==0.4.5
import sys
from pprint import pprint
## Thirty one (31) flavors of patterns to choose from ...
from SNOBOL4python import ε, σ, π, λ, Λ, ζ, θ, Θ, φ, Φ, α, ω
from SNOBOL4python import ABORT, ANY, ARB, ARBNO, BAL, BREAK, BREAKX, FAIL
from SNOBOL4python import FENCE, LEN, MARB, MARBNO, NOTANY, POS, REM, RPOS
from SNOBOL4python import RTAB, SPAN, SUCCESS, TAB
# Miscellaneous
from SNOBOL4python import GLOBALS, TRACE, PATTERN, Ϩ, STRING
from SNOBOL4python import ALPHABET, DIGITS, UCASE, LCASE, NULL
from SNOBOL4python import nPush, nInc, nPop, Shift, Reduce, Pop
# Instantiate the global variable space
GLOBALS(globals())

Collecting SNOBOL4python==0.4.4
  Downloading snobol4python-0.4.4-py3-none-any.whl.metadata (823 bytes)
Downloading snobol4python-0.4.4-py3-none-any.whl (25 kB)
Installing collected packages: SNOBOL4python
Successfully installed SNOBOL4python-0.4.4


**STEP 9**

To use the new PATTERN datatype provided by the SNOBOL4python module:

*   r"^" becomes *POS*(0)
*   r"$" becomes *RPOS*(0)
*   r"[a-z]" becomes *ANY*(LCASE)
*   r"xyz" becomes σ('xyz'), or alternatively
*   r"xyz" becomes σ('x') + σ('y') + σ('z')
*   r"(\_\_\_)*" becomes *ARBNO*(___)
*   re.search(pattern, subject) becomes subject in PATTERN

Let's start by just getting the PATTERN to work, and for now not deal with capturing any results.

In [None]:
if subject in \
      ( POS(0)
      + σ("DEFINE('")
      + ANY(LCASE)
      + σ("(")
      + ANY(LCASE) + ARBNO(σ(',') + ANY(LCASE))
      + σ(")")
      + ANY(LCASE) + ARBNO(σ(',') + ANY(LCASE))
      + σ("')")
      + RPOS(0)
      ):
      print(["Matched."])
else: print(['Unmatched!'])

['Matched.']


**STEP 10**

What just happened? It matched! Is there any hope I can complete my work?

Now, let's decorate the above pattern with Python code to capture the PATTERN matching results into variables containing strings and lists.

* r"(?P<name>\_\_\_) becomes ___ % "name"\
* r"<no-can-do>" becomes λ(python_code_string)

In [None]:
if subject in \
      ( POS(0)
      + σ("DEFINE('")
      + ANY(LCASE) % "function"
      + σ("(")
      + ( ANY(LCASE) % "param" + λ("parameters = [param]")
        + ARBNO(σ(',') + ANY(LCASE) % "param" + λ("parameters.append(param)"))
        )
      + σ(")")
      + ( ANY(LCASE) % "var" + λ("variables = [var]")
        + ARBNO(σ(',') + ANY(LCASE) % "var" + λ("variables.append(var)"))
        )
      + σ("')")
      + RPOS(0)
      ):
      print(["Matched.", function, parameters, variables])
else: print(['Unmatched!'])

['Matched.', 'f', ['x', 'y', 'z'], ['c', 'i', 'n', 's', 't', 'v']]


**STEP 11**

What happened? My work is done. It's a miracle.\
The solution does seem to resemble the problem.\
Can it really be that easy?

Now introducing the PATTERN phi, φ(r'___'). It will match a regular expression. And now a solution using regular expression patterns as an integral part of the new PATTERN datatype.

In [None]:
if subject in \
      ( φ(r"^DEFINE\('")
      + φ(r'(?P<function>[a-z])')
      + φ(r'\(')
      + ( φ(r'(?P<param>[a-z])') + λ("parameters = [param]")
        + ARBNO(φ(r',(?P<param>[a-z])') + λ("parameters.append(param)"))
        )
      + φ(r'\)')
      + ( φ(r'(?P<var>[a-z])') + λ("variables = [var]")
        + ARBNO(φ(r',(?P<var>[a-z])') + λ("variables.append(var)"))
        )
      + φ(r"'\)$")
      ):
      print(["Matched:", function, parameters, variables])
else: print(['Unmatched!'])

['Matched:', 'f', ['x', 'y', 'z'], ['c', 'i', 'n', 's', 't', 'v']]


**STEP 12**

What happens next! You can do any work in which you want!\
This SNOBOL4python Python module can process all four levels of the Chompsky heirarchy.\
This concludes this 12-step program. Enjoy Nirvana.