# Regular Expression Engine

My implementation follows Thompson's Construction of NFA [https://en.wikipedia.org/wiki/Thompson%27s_construction]:
- First, I add implict characters representing concatenation (`•`) and convert sqaure brackets into unions: `[abc]` => `(a|b|c)`
- Then, I convert the expression to postfix notation to avoid problems with order
- Next, I convert to NFA
- And perform searching as described in the paper.


Supported operations:
- dot (`.`) - represents any character
- letters, digts, whitesymbols - literals
- operators:
    - zero or more `*`
    - one or more `+`
    - zero or one `?`
- parentheses `(foo)`:
    - can be nested
    - define precedense 
    - discarded if none of the above operator follows

- square brackets `[foo]`:
    - matches any symbol within
    - may include characters, digits.. e.g `[ab12]`
    - or character classes:
        - `\d` matches any digit
        - `\w` matches any word character (alphanumeric & underscode)
        - `\s` matches any whitespace character (spaces, tabs, linebreaks)

## Examples



In [29]:
def show_steps(regex, tests):
    print('-' * 18 + ' Compiling ' + '-' * 18)
    parsed = initial_parse(regex)
    postfix = to_postfix(parsed)
    nfa = compile(regex)

    print('Regex:   ', regex)
    print('Parsed:  ', parsed)
    print('Postfix: ', postfix)
    print('-' * 18 + ' Matching ' + '-' * 19)

    for test in tests:
        print(f'Case: {test}, matched => {nfa.matches(test)}')
    
    print('-' * 47)

In [35]:
from regex_engine import *

regex = '[a-c]?d'
tests = [
            'foo',
            'abd',
            'ad',
            'bd'
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [a-c]?d
Parsed:   (c|a|b)?•d
Postfix:  ca|b|?d•
------------------ Matching -------------------
Case: foo, matched => False
Case: abd, matched => False
Case: ad, matched => True
Case: bd, matched => True
-----------------------------------------------


In [42]:
regex = 'ala[\d]*'
tests = [
            '189',
            'ala',
            'ala89',
            'ala123',
            'ala678678055807606906707',
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    ala[\d]*
Parsed:   a•l•a•(8|5|4|7|1|3|2|6|9|0)*
Postfix:  al•a•85|4|7|1|3|2|6|9|0|*•
------------------ Matching -------------------
Case: 189, matched => False
Case: ala, matched => True
Case: ala89, matched => True
Case: ala123, matched => True
Case: ala678678055807606906707, matched => True
-----------------------------------------------


In [60]:
# Example with nested parentheses
regex = '((a|b)*c)?'
tests = [
            'aab',
            '',
            'c',
            'aaabc',
            'ababc',
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    ((a|b)*c)?
Parsed:   ((a|b)*•c)?
Postfix:  ab|*c•?
------------------ Matching -------------------
Case: aab, matched => False
Case: , matched => True
Case: c, matched => True
Case: aaabc, matched => True
Case: ababc, matched => True
-----------------------------------------------


In [59]:
regex = '[hc]?at'

tests = [
    'hcat',
    'at', # 0
    'hat', # 2
    'cat', # 3
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [hc]?at
Parsed:   (h|c)?•a•t
Postfix:  hc|?a•t•
------------------ Matching -------------------
Case: hcat, matched => False
Case: at, matched => True
Case: hat, matched => True
Case: cat, matched => True
-----------------------------------------------


In [56]:
# Matching binary numbers which are multiple of 3
regex = '(0|(1(01*0)*1))+'

tests = [
    '0000', # 0
    '0010', # 2
    '0011', # 3
    '1001', # 9
    '1111'  # 15
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    (0|(1(01*0)*1))+
Parsed:   (0|(1•(0•1*•0)*•1))+
Postfix:  0101*•0•*•1•|+
------------------ Matching -------------------
Case: 0000, matched => True
Case: 0010, matched => False
Case: 0011, matched => True
Case: 1001, matched => True
Case: 1111, matched => True
-----------------------------------------------


In [61]:
# Matching numeral (I used , since . is reserved for any character)
regex = '[-]?[\d]*[,]?[\d]*'

tests = [
    'not-numeric',
    '13',
    '123,12',
    '-123,12'
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [-]?[\d]*[,]?[\d]*
Parsed:   (-)?•(8|5|4|7|1|3|2|6|9|0)*•(,)?•(8|5|4|7|1|3|2|6|9|0)*
Postfix:  -?85|4|7|1|3|2|6|9|0|*•,?•85|4|7|1|3|2|6|9|0|*•
------------------ Matching -------------------
Case: not-numeric, matched => False
Case: 13, matched => True
Case: 123,12, matched => True
Case: -123,12, matched => True
-----------------------------------------------
