# Regular Expression Engine

My implementation follows Thompson's Construction of NFA [https://en.wikipedia.org/wiki/Thompson%27s_construction]:
- First, I add implict characters representing concatenation (`•`) and convert square brackets into unions: `[abc]` => `(a|b|c)`
- Then, I convert the expression to postfix notation to avoid problems with order
- Next, I convert to NFA
- And perform searching as described in the paper.


Supported operations:
- dot (`.`) - represents any character
- letters, digts, whitesymbols - literals
- operators:
    - zero or more `*`
    - one or more `+`
    - zero or one `?`
- parentheses `(foo)`:
    - can be nested
    - define precedense 
    - discarded if none of the above operator follows

- square brackets `[foo]`:
    - matches any symbol within
    - may include characters, digits.. e.g `[ab12]`
    - or character classes:
        - `\d` matches any digit
        - `\w` matches any word character (alphanumeric & underscode)
        - `\s` matches any whitespace character (spaces, tabs, linebreaks)

This implementation could be extended to support other operations.


## Examples

In [1]:
from regex_engine import *

def show_steps(regex, tests):
    print('-' * 18 + ' Compiling ' + '-' * 18)
    parsed = initial_parse(regex)
    postfix = to_postfix(parsed)

    print('Regex:   ', regex)
    print('Parsed:  ', parsed)
    print('Postfix: ', postfix)
    print('-' * 18 + ' Matching ' + '-' * 19)
    
    nfa = compile(regex)
    for test in tests:
        print(f'Case: {test}, matched => {nfa.matches(test)}')
    
    print('-' * 47)

In [10]:
regex = 'python[ ]*((sucks)|(rules)|(<3))?[!]?'

tests = [
    ' ',
    'python',
    'python sucks!', 
    'python rules!', 
    'python    <3!'
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    python[ ]*((sucks)|(rules)|(<3))?[!]?
Parsed:   p•y•t•h•o•n•( )*•((s•u•c•k•s)|(r•u•l•e•s)|(<•3))?•(!)?
Postfix:  py•t•h•o•n• *•su•c•k•s•ru•l•e•s•|<3•|?•!?•
------------------ Matching -------------------
Case:  , matched => False
Case: python, matched => True
Case: python sucks!, matched => True
Case: python rules!, matched => True
Case: python    <3!, matched => True
-----------------------------------------------


In [3]:
regex = '[hc]+at'

tests = [
    'at', 
    'hat', 
    'cat', 
    'hcat',
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [hc]+at
Parsed:   (h|c)+•a•t
Postfix:  hc|+a•t•
------------------ Matching -------------------
Case: at, matched => False
Case: hat, matched => True
Case: cat, matched => True
Case: hcat, matched => True
-----------------------------------------------


In [4]:
regex = '[a-f]?d'
tests = [
            'foo',
            'abd',
            'ad',
            'bd'
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [a-f]?d
Parsed:   (b|a|c|e|f|d)?•d
Postfix:  ba|c|e|f|d|?d•
------------------ Matching -------------------
Case: foo, matched => False
Case: abd, matched => False
Case: ad, matched => True
Case: bd, matched => True
-----------------------------------------------


In [5]:
regex = 'ala[\d]*'
tests = [
            '189',
            'ala',
            'ala89',
            'ala123',
            'ala678678055807606906707',
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    ala[\d]*
Parsed:   a•l•a•(3|2|7|8|6|1|5|9|4|0)*
Postfix:  al•a•32|7|8|6|1|5|9|4|0|*•
------------------ Matching -------------------
Case: 189, matched => False
Case: ala, matched => True
Case: ala89, matched => True
Case: ala123, matched => True
Case: ala678678055807606906707, matched => True
-----------------------------------------------


In [6]:
# Example with nested parentheses
regex = '((a|b)*c)?'
tests = [
            'aab',
            '',
            'c',
            'aaabc',
            'ababc',
        ]
show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    ((a|b)*c)?
Parsed:   ((a|b)*•c)?
Postfix:  ab|*c•?
------------------ Matching -------------------
Case: aab, matched => False
Case: , matched => True
Case: c, matched => True
Case: aaabc, matched => True
Case: ababc, matched => True
-----------------------------------------------


In [7]:
# Matching words separated by whitesymbols without whitesymbols at the end.
regex = '([\w]*[\s]?)*[\w]'

tests = [
    'Ala', # 0
    'Alachcekota', # 2
    'Ala ma kota',
    'Ala         ma         kota',
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    ([\w]*[\s]?)*[\w]
||
)?)*•(I|U|V|N|s|m|x|l|1|5|4|F|B|E|f|L|w|T|t|C|Z|Q|j|9|R|W|k|G|p|q|H|P|0|n|a|J|o|K|M|6|z|X|d|u|b|3|2|D|Y|S|O|7|v|e|r|y|8|h|i|c|_|g|A)
||
|?•*IU|V|N|s|m|x|l|1|5|4|F|B|E|f|L|w|T|t|C|Z|Q|j|9|R|W|k|G|p|q|H|P|0|n|a|J|o|K|M|6|z|X|d|u|b|3|2|D|Y|S|O|7|v|e|r|y|8|h|i|c|_|g|A|•
------------------ Matching -------------------
Case: Ala, matched => True
Case: Alachcekota, matched => True
Case: Ala ma kota, matched => True
Case: Ala         ma         kota, matched => True
-----------------------------------------------


In [8]:
# Matching binary numbers which are multiple of 3
regex = '(0|(1(01*0)*1))+'

tests = [
    '0000', # 0
    '0010', # 2
    '0011', # 3
    '1001', # 9
    '1111'  # 15
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    (0|(1(01*0)*1))+
Parsed:   (0|(1•(0•1*•0)*•1))+
Postfix:  0101*•0•*•1•|+
------------------ Matching -------------------
Case: 0000, matched => True
Case: 0010, matched => False
Case: 0011, matched => True
Case: 1001, matched => True
Case: 1111, matched => True
-----------------------------------------------


In [9]:
# Matching numeral (I used , since . is reserved for any character)
regex = '[-]?[\d]*[,]?[\d]*'

tests = [
    'non-numeric',
    '13',
    '123,12',
    '-123,12'
]

show_steps(regex, tests)

------------------ Compiling ------------------
Regex:    [-]?[\d]*[,]?[\d]*
Parsed:   (-)?•(3|2|7|8|6|1|5|9|4|0)*•(,)?•(3|2|7|8|6|1|5|9|4|0)*
Postfix:  -?32|7|8|6|1|5|9|4|0|*•,?•32|7|8|6|1|5|9|4|0|*•
------------------ Matching -------------------
Case: non-numeric, matched => False
Case: 13, matched => True
Case: 123,12, matched => True
Case: -123,12, matched => True
-----------------------------------------------
