# Parser 1.00 - Unsigned Decimal Integer Literals

The first goal of the parser is to have it accept unsigned decimal integer literals. That is, strings of characters consisting of one or more of the digits '0' to '9'. If such string is recognized, then convert it to internal binary form. This is usually (but not always) some form of [two's complement](https://en.wikipedia.org/wiki/Two's_complement).

It is true that Python will happily [covert decimal literals expressed as strings into its internal form](https://docs.python.org/3.4/library/functions.html?highlight=int#int) without any fuss at all. As a learning exercise we're simply going to ignore that inconvenient fact in favor of doing the job manually.

We also want the parser to reject any string which does not consist of only decimal digit characters.


## Library

In [None]:
import re         # Python regular expressions

The easiest way to recognize unsigned decimal literals is to use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). The Python library module [re](https://docs.python.org/3/library/re.html) provides [Perl-style](https://www.regular-expressions.info/perl.html) regular expressions (as opposed to [Posix-style](https://www.regular-expressions.info./posix.html)).

The parser will use regular expressions extensively to simplify the job of matching things we might be looking for in expressions. This effectively sweeps most of the mechanics of pattern matching "under the rug". We do this partly to avoid cluttering up the overall presentation and partly because the mechanics of matching is not really what this series is about anyway.

It is not the purpose of this series to teach how to use regular expressions, although some effort will be made to explain what any particular one is supposed to do. If you are already familiar with regular expressions you know how useful they can be. If you are not, a simple search using any popular engine will turn up many tutorials and references of varying quality.

>Python's own documentation includes both a [tutorial](https://docs.python.org/3/howto/regex.html) and a [reference](https://docs.python.org/3/library/re.html). Another good source is [www.regular-expressions.info](https://www.regular-expressions.info/).


## User output

In [None]:
def UIwriteln(this):
    '''write a single line to output'''
    print( f'{this}\n' )
    
def UIwriteSep():
    '''write a visual seperator'''
    UIwriteln( '--------------')

def UIshow(tag, value):
    '''write a tagged value to output'''
    UIwriteln( f'{tag}: {value}' )

def UIerror(this):
    '''write an error message to output'''
    UIshow( 'Error', this )

Since the parser should detect incorrect input, we add a function to allow it to inform the user of errors.

# Parser

In [None]:
versionNumber = '1.00'

# operands accepted:
# - decimal integer literals

# errors detected:
# - unrecognized input
# - out of range numeric input

# result tuple:
# - (True, number)
# - (False, None)

def PEdoparse(this):
    
    uintMax = 4294967295                  # 2**32-1, for range checking
    
    # convert unsigned decimal literal to internal form
    # - all chars in input known to be decimal characters
    # - checks that value is within range
    
    def convertUint(ulit):
        
        uint = 0
        for digit in ulit:
            digval = '0123456789'.find(digit)
            if uint <= (uintMax - digval)/10:
                uint =  uint * 10 + digval
            else:
                UIerror(f'\'{ulit}\' is out of range')
                return (False, None)
        
        return (True, uint)
        
    # top level
    
    if re.fullmatch('[0-9]+', this):
        return convertUint(this)
    else:
        UIerror(f'\'{this}\' not recognized')
        return (False, None)


### How it works

The regular expression used here simply means *a string consisting of one or more decimal characters*. Any other character included in the string causes the input string to be rejected as erroneous (in particular, any leading or trailing spaces).

The nested conversion function uses a fairly simple algorithm. Set the current value to zero. Beginning at the first character of the input string and for each character thereafter, extract it, multiply the current value by 10 (because this is a decimal value, each digit represents a power of 10) and add the value represented by the current digit character.

A slight complication arises because decimal digit *characters* are not the same as decimal digit *values*.  In the [ASCII character set](https://en.wikipedia.org/wiki/ASCII), for example, the character ‘0’ is the 49th member and has a value of 48 (starting from zero). Put another way, if the bit pattern in a computer memory cell is interpreted as an ASCII character to display, a value of 48 in that cell will cause a picture (or *glyph*) of the character ‘0’ to be shown.

To convert from character value to decimal value we use the *find()* string method. We've arranged the string to be searched so that each decimal digit appears at the index which represents its value. We don't need to worry that *find()* will return -1, indicating that no match with the current digit character occurred, since we already ruled that possibility out at the top level.

Another complication arises from the possibility of overflow. As the input string is allowed to be any length, it is conceivable that as a string it represents a value larger than can be represented internally. Since there are an infinite number of integers but digital computers are finite, every one must have some limit on how large an integer it can programmed to handle.

The parser has this limit set artificially low, to a little over four billion (4,294,967,295). This is the value stored in the variable *uintMax* (unsigned integer Maximum). It is much less than Python’s real upper limit, but is a convenient value for studying the problem.

>For most programming languages, anyway. Python 3 is actually unusual in that it has no explicitly defined maximum integer. In theory integers are allowed to grow with unlimited precision. In practice there must always be some hardware limit beyond which there is simply no more room to store a value.

Overflow during conversion occurs if the result of multiplying the current value by 10 and adding a new digit exceeds the maximum representable value:

```Python
uint * 10 + digval > uintMax
```

But this cannot be checked after performing the operations. The very definition of “maximum representable value” means that it is not possible to represent a larger value. Whatever the result of an overflow condition actually is, it is not larger than *uintMax*.

In practice what happens most often in cases of overflow is *wrapping*. If the largest representable value has **M** bits (where at present **M** is typically 32 or 64), results of addition can have at most **M + 1** bits and results of multiplication **2 * M** bits. Wrapping occurs if the result of every arithmetic operation is truncated to at most **M** bits, which can always be represented. Another way to look at it is that computer arithmetic is often modular.

If the truncation happens silently (as it most often does), the inequality above cannot help to detect overflow. Because under conventional precedence rules the comparison operator has the lowest priority in the expression, by the time it is executed any wrapping involved has already occurred. Thus the value to its left is always less than or equal to the representable maximum on its right.

All is not lost. The inequality can be re-arranged to the equivalent form:

```Python
uint > (unintMax - digval) / 10
```

It is clear the right hand side must always be less than *unintMax*, and so can never overflow. This is still not much help if *uint*  has already overflowed, but suppose we apply the test *before* attempting to add the next digit rather than after? At that point *uint* is also always less then *uintMax*. If the test condition is false, the result of multiplying the current value by 10 and adding the next digit will always be less than or equal to *unintMax*. If the condition is true, the result would overflow the largest representable value.

What is really detected here is not that overflow has occurred, but that it is *about* to occur. Which is quite often the best way to handle the problem from the point of view of computer arithmetic.

In the parser the condition of the overflow test is reversed:

```Python
uint <= (unintMax - digval) / 10
```

Written this way a true result rather than a false one allows us to safely add the next digit to the accumulating result. This enables us to put the shorter branch closest to the test in the source code, which is often the most readable way to arrange things.

The shorter branch also happens to be the most often executed – the longer branch will execute at most once – and there is sometimes a (very) slight speed advantage to placing the most often executed branch closest to the test. If there is any advantage, it derives from the specific way the underlying hardware handles conditional tests and control flow changes based on them. In most cases it’s better to focus on readability.

# Evaluator

In [None]:
# a null evaluator that does nothing

def EEdoeval(this):
    return (True, this)


# Interacting with the parser

In [None]:
passCnt = failCnt = 0                       # most useful for test input files, but never any harm

def startUp():
    '''begin execution'''
    global passCnt, failCnt
    UIshow( 'Parser', versionNumber )
    passCnt = failCnt = 0
    
def shutDown():
    '''terminate execution'''
    UIwriteSep()
    UIshow( 'Pass', passCnt )
    UIshow( 'Fail', failCnt )
    
# run parser

def parseOne(this):
    '''parse/evaluate one expression'''
    global passCnt, failCnt
    UIwriteSep()
    UIshow( 'Input', this )
    ok, res = PEdoparse( this )
    if ok:
        UIshow( 'Final Parse', res )
        ok, res = EEdoeval( res )
        if ok:
            UIshow( 'Final Eval', res )
    if ok:
        passCnt += 1
    else:
        failCnt += 1
 

We are going to use files containing expression tests to help development. Here we introduce two global variables, *passCnt* and *failCnt*, to let us keep track of how many of our tests passed or failed, respectively.

## Interactive use

In [None]:
def parse():
    
    startUp()
    while True:
        inp = input( 'Expression: ' )
        UIwriteln( '' )                      # looks better with a blank line here
        if inp.upper()[0] == 'Q':
            break
        elif inp.strip():
            parseOne( inp )
    shutDown()

## Batch processing

In [None]:
testDir = '..\\ParserTest\\'            # directory holding test input files (empty string if same as notebook directory)
    
# run one test

def runTest(this):
    
    # create the name of the test file
    # - done this way so we can update only the version number and everything still works

    head = versionNumber[:len(versionNumber)-3]
    tail = versionNumber[-2:]
    fn = f'{testDir}{this}{head:0>2}{tail}.txt'
    
    UIwriteln(f'Parser {versionNumber} vs {fn[-12:-4]}')
    
    with open(fn) as f:
        data = f.readlines()              # read the whole test file
    for line in data:
        test = line.strip()
        if test and test[0] != '#':       # skip blank and comment lines
            parseOne(test)
    
# run a test which should succeed
    
def good():
    
    startUp()
    runTest('pass')
    shutDown()
    
# run a test which should fail

def bad():
    
    startUp()
    runTest('fail')
    shutDown()
            

### Testing the parser

It is almost not possible to over-emphasize how important testing is in program development. While it is always gratifying to know that a program behaves as expected, it is also usually quite instructive to understand exactly why it does not. By developing tests for each new version of the parser, we can also help guarantee that our most recent changes do not break anything that previously worked.

Beyond the null version, there are three files associated with most parser versions, the parser itself and two test files:

- `parserXXXX.ipynb`

- `passXXXX.txt`

- `failXXXX.txt`

where **XXXX** is a four-digit number corresponding to the parser version. For example, this parser is version 1.00 and the associated Jupyter notebook file is named **parser0100.ipynb**. The two test files for this version are simple text files  called **pass0100.txt** (all expressions in it should be accepted) and **fail0100.txt** (all expressions should be rejected).

It is often helpful to make testing as easy as possible (and thus more likely to actually happen). To this end two new no-argument functions are added to the interaction section: *good()*, which tests the current version of the parser against its own **passXXXX.txt** file, and *bad()*, which does the same for the **failXXXX.txt** file.

In [None]:
parse()  # interactive, one expression at a time

In [None]:
good()   # run 'all expressions pass' test file for this parser

In [None]:
bad()    # run 'all expressions fail' test file for this parser