# Regex Examples

## Useful resources:
https://www.rexegg.com/regex-quickstart.html

https://regex101.com/

http://www.regular-expressions.info/

http://regex.info/book.html

## Cheat sheet:
http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf

In [None]:
# Add the path to the folder contains all the text-files  
folderPath = r""
if(folderPath[-1] != "\\"):
    folderPath += "\\"

In [None]:
import re

def getListMatches(pattern, text):
    #return re.findall(pattern, text)
    matches = re.finditer(pattern, text)
    list_matches = []

    for _, match in enumerate(matches, start=1):
        list_matches.append( match.group() )
    
    return list_matches

In [None]:
from IPython.core.display import HTML

def addColouredHTMLText(text, colour):
    return "<font color='{}'>{}</font> ".format(colour, text)

def printSearchedExpressions(file, pattern):
    filePath = folderPath + file
    
    html_string = ""
    
    with open(filePath) as txt:
        for line in txt:
            findings = getListMatches(pattern, line)
            if(len(findings) != 0):
                for finding in findings:
                    html_string += addColouredHTMLText(finding, "green")
                #html_string += addColouredHTMLText(line, "green")
                html_string += "<br> "
                    
            else:
                html_string += addColouredHTMLText(line, "red")
                html_string += "<br> "
    
    return HTML(html_string)

def printSearchedReplacedExpressions(file, pattern, outputFormat):
    filePath = folderPath + file
    
    html_string = ""
    
    with open(filePath) as txt:
        for line in txt:
            
            replacement = re.sub(pattern, outputFormat, line)
            html_string += addColouredHTMLText(replacement, "black")
            html_string += "<br> "
        return HTML(html_string)

# regex01.txt

<font color='green'>fooaaaabar</font> <br>
<font color='green'>fooabar</font> <br>
<font color='green'>foobar</font> <br>
<font color='green'>fooaabar</font> <br>
<font color='red'>fooxxxbar</font> <br>
<font color='red'>fooxbar</font> <br>

<b>a*</b> stands for zero or more occurences of 'a'

In [None]:
file = 'regex01.txt'
pattern = r"fooa*bar"

printSearchedExpressions(file, pattern)

# regex02.txt

<font color='green'>fooabar</font> <br>
<font color='green'>fooxbar</font> <br>
<font color='red'>baryfoo</font> <br>
<font color='red'>foobar</font> <br>
<font color='red'>fooxybar</font> <br>
<font color='green'>foocbar</font> <br>

**.** stands for **one** character, regardless of a letter or a number.<br>
Therefore, it is a single wildcard. Can represent any character at exactly the location.

In [None]:
file = 'regex02.txt'
pattern = 'foo.bar'

printSearchedExpressions(file, pattern)

# regex03.txt

<font color='green'>foobar</font> <br>
<font color='red'>barfoo</font> <br>
<font color='green'>fooabcbar</font> <br>
<font color='green'>foobxcbar</font> <br>
<font color='red'>barcbyfoo</font> <br>
<font color='green'>foozbar</font> <br>
<font color='red'>barafoo</font> <br>
<font color='red'>barabfoo</font> <br>

<b>.*</b> stands for zero or more occurrences of the wildcard.<br>
In other words: Zero or more occurences of **any** character.

In [None]:
file = 'regex03.txt'
pattern = 'foo.*bar'

printSearchedExpressions(file, pattern)

# regex04.txt

<font color='red'>fooxxxbar</font> <br>
<font color='green'>foo   bar</font> <br>
<font color='red'>fooxbar</font> <br>
<font color='red'>fooxxbar</font> <br>
<font color='green'>foo bar</font> <br>
<font color='green'>foo       bar</font> <br>
<font color='green'>foobar</font> <br>
<font color='red'>fooyyybar</font> <br>

\s represent whitespace. \s* represents zero or more occurences.

In [None]:
file = 'regex04.txt'
pattern = 'foo\s*bar'

printSearchedExpressions(file, pattern)

# regex05.txt - Part 1

<font color='green'>foo</font> <br>
<font color='red'>moo</font> <br>
<font color='green'>coo</font> <br>
<font color='red'>moo</font> <br>
<font color='red'>doo</font> <br>
<font color='green'>poo</font> <br>
<font color='red'>boo</font> <br>
<font color='red'>hoo</font> <br>

In [None]:
file = 'regex05.txt'
pattern = '[f,c,p]oo'

printSearchedExpressions(file, pattern)

# regex05.txt - Part 2

<font color='green'>foo</font> <br>
<font color='red'>moo</font> <br>
<font color='green'>coo</font> <br>
<font color='green'>moo</font> <br>
<font color='green'>doo</font> <br>
<font color='green'>poo</font> <br>
<font color='green'>boo</font> <br>
<font color='red'>hoo</font> <br>

In [None]:
file = 'regex05.txt'
pattern = '[^mh]oo'

printSearchedExpressions(file, pattern)

# regex08.txt - Part 1

<font color='green'>joo</font> <br>
<font color='red'>boo</font> <br>
<font color='green'>koo</font> <br>
<font color='green'>loo</font> <br>
<font color='red'>woo</font> <br>
<font color='green'>moo</font> <br>
<font color='green'>zoo</font> <br>
<font color='red'>coo</font> <br>

In [None]:
file = 'regex08.txt'
pattern = '[^bwzc]oo'
pattern = '[j-mz]oo'

printSearchedExpressions(file, pattern)

# regex10.txt

<font color='green'>joo</font> <br>
<font color='red'>boo</font> <br>
<font color='green'>Koo</font> <br>
<font color='green'>Loo</font> <br>
<font color='red'>woo</font> <br>
<font color='green'>moo</font> <br>
<font color='green'>zoo</font> <br>
<font color='red'>coo</font> <br>

In [None]:
file = 'regex10.txt'
pattern = '[j-mzJ-M]oo'

printSearchedExpressions(file, pattern)

# regex11.txt

<font color='green'>xxx.yy</font> <br>
<font color='green'>xx.yyyy</font> <br>
<font color='green'>x.yy</font> <br>
<font color='red'>xy</font> <br>
<font color='red'>xxyy</font> <br>
<font color='red'>yyxx</font> <br>
<font color='red'>yx</font> <br>
<font color='red'>yxxx</font> <br>

x* zero or multiple occurences of x <br>
\. followed by a dot, which needs to be escaped <br>
y* zero or multiple occurences of y

In [None]:
file = 'regex11.txt'
pattern = 'x*\.y*'

printSearchedExpressions(file, pattern)

# regex12.txt

<font color='green'>x#y</font> <br>
<font color='green'>x:y</font> <br>
<font color='green'>x.y</font> <br>
<font color='red'>x&y</font> <br>
<font color='red'>x%y</font> <br>

In [None]:
file = 'regex12.txt'
pattern = 'x[#:.]y'
pattern = 'x[^&%]y'

printSearchedExpressions(file, pattern)

# regex13.txt

<font color='green'>x#y</font> <br>
<font color='green'>x:y</font> <br>
<font color='green'>x^y</font> <br>
<font color='red'>x&y</font> <br>
<font color='red'>x%y</font> <br>

In [None]:
file = 'regex13.txt'
pattern = 'x[#:\^]y'

printSearchedExpressions(file, pattern)

# regex14.txt

<font color='green'>x#y</font> <br>
<font color='green'>x\y</font> <br>
<font color='green'>x^y</font> <br>
<font color='red'>x&y</font> <br>
<font color='red'>x%y</font> <br>

In [None]:
file = 'regex14.txt'
#pattern = 'x[^&%]y'
pattern = 'x[#\\\^]y'

printSearchedExpressions(file, pattern)

# regex15.txt

<font color='green'>foo bar baz</font> <br>
<font color='red'>bar foo baz</font> <br>
<font color='red'>baz foo bar</font> <br>
<font color='red'>bar baz foo</font> <br>
<font color='green'>foo baz bar</font> <br>
<font color='red'>baz bar foo</font> <br>

^ is a placeholder that indicates the beginning of a line. <br> 
The interpreation of ^ differs within square and outside of brackets. <br>
Outside, it is a placeholder for beginning of a line.

In [None]:
file = 'regex15.txt'
pattern = '^foo.*'

printSearchedExpressions(file, pattern)

# regex16.txt

<font color='red'>foo bar baz</font> <br>
<font color='red'>bar foo baz</font> <br>
<font color='green'>baz foo bar</font> <br>
<font color='red'>bar baz foo</font> <br>
<font color='green'>foo baz bar</font> <br>
<font color='red'>baz bar foo</font> <br>

$ means "end of the line".

In [None]:
file = 'regex16.txt'
pattern = '.*bar$'

printSearchedExpressions(file, pattern)

# regex17.txt

<font color='green'>foo</font> <br>
<font color='red'>foo bar</font> <br>
<font color='red'>baz foo</font> <br>
<font color='red'>foo bar baz</font> <br>
<font color='red'>baz bar foo</font> <br>

In [None]:
file = 'regex17.txt'
pattern = '^foo$'

printSearchedExpressions(file, pattern)

# regex18.txt

<font color='green'>834</font> <br>
<font color='green'>519</font> <br>
<font color='red'>4874</font> <br>
<font color='red'>5</font> <br>
<font color='red'>89</font> <br>
<font color='red'>45687</font> <br>
<font color='red'>25</font> <br>
<font color='green'>645</font> <br>

{n} repeats the pattern n times

In [None]:
file = 'regex18.txt'
pattern = '^[0-9][0-9][0-9]$'
pattern = '^[0-9]{5}$'

printSearchedExpressions(file, pattern)

# regex19.txt

<font color='green'>lion</font> <br>
<font color='green'>tiger</font> <br>
<font color='red'>leopard</font> <br>
<font color='red'>fox</font> <br>
<font color='red'>kangaroo</font> <br>
<font color='red'>bat</font> <br>
<font color='green'>mouse</font> <br>
<font color='green'>cuckoo</font> <br>
<font color='green'>deer</font>

{n,m} repeats the pattern for n, n+1 and .... m times. 

In [None]:
file = 'regex19.txt'
pattern = '^[a-z]{4,6}$'

printSearchedExpressions(file, pattern)

# regex20.txt

<font color='red'>ha</font> <br>
<font color='green'>hahahahaha</font> <br>
<font color='red'>hahaha</font> <br>
<font color='green'>hahahaha</font> <br>
<font color='red'>haha</font> <br>
<font color='red'></font> <br>
<font color='green'>hahahahahaha</font> <br>
<font color='green'>hahahahahahahaha</font> <br>
<font color='green'>hahahahahahahahaha</font>

{n,} at least n repetitions of the pattern.

In [None]:
file = 'regex20.txt'
pattern = '^(ha){4,}$'

printSearchedExpressions(file, pattern)

# regex21.txt

<font color='green'>ha</font> <br>
<font color='green'>haha</font> <br>
<font color='red'>hahahahaha</font> <br>
<font color='red'>hahahaha</font> <br>
<font color='red'>hahaha</font> <br>
<font color='red'>hahahahahahaha</font> <br>
<font color='red'>hahahahahaha</font> <br>

{,m} stands for maximal m repetitions

In [None]:
file = 'regex21.txt'
pattern = '^(ha){,2}$'

printSearchedExpressions(file, pattern)

# regex22.txt

<font color='green'>fooaaaabar</font> <br>
<font color='green'>fooabar</font> <br>
<font color='red'>foobar</font> <br>
<font color='green'>fooaabar</font> <br>
<font color='red'>fooxxxbar</font> <br>
<font color='red'>fooxbar</font>

In [None]:
file = 'regex22.txt'
pattern = '^foo[a]{1,}bar$'

printSearchedExpressions(file, pattern)

**a+** means one or more occurences of a

In [None]:
file = 'regex22.txt'
pattern = '^foo[a]+bar$'

printSearchedExpressions(file, pattern)

# regex23.txt

<font color='green'>https://website.com</font> <br>
<font color='green'>http://website.com</font> <br>
<font color='red'>httpss://website.de</font> <br>
<font color='red'>httpx://website.de</font> <br>
<font color='red'>httpxx://website.com</font> <br>

**s?** means zero or one occurences. <br>
**( | )** means the logical conjunction "or".

In [None]:
file = 'regex23.txt'
pattern = r'(http)s?://website.(com|de)'

printSearchedExpressions(file, pattern)

# regex24.txt

<font color='red'>sapwood</font> <br>
<font color='red'>rosewood</font> <br>
<font color='green'>logwood</font> <br>
<font color='red'>teakwood</font> <br>
<font color='green'>plywood</font> <br>
<font color='red'>redwood</font>

In [None]:
file = 'regex24.txt'
pattern = r'^(log|ply)wood$'

printSearchedExpressions(file, pattern)

# regex25.txt

<font color='black'>1280x720 convert to 1280 pix by 720 pix</font> <br>
<font color='black'>1920x1080</font> <br>
<font color='black'>1600x900</font> <br>
<font color='black'>1280x1024</font> <br>
<font color='black'>800x600</font> <br>
<font color='black'>1024x768</font>

**( )** groups the findings.

In [None]:
file = 'regex25.txt'
pattern = r'([0-9]+)x([0-9]+)'
outputFormat = r'\1 pix by \2 pix'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# regex26.txt

<font color='black'>John Wallace convert to Wallacelace, John</font> <br>
<font color='black'>Steve King</font> <br>
<font color='black'>Martin Cook</font> <br>
<font color='black'>Adam Smith</font> <br>
<font color='black'>Irene Peter</font> <br>
<font color='black'>Alice Johnson</font> 

In [None]:
file = 'regex26.txt'
pattern = r'([a-zA-Z]+)\s([a-zA-Z]+)'
outputFormat = r'\2, \1'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# regex27.txt

<font color='black'>7:32 convert to 32 mins past 7</font> <br>
<font color='black'>6:12</font> <br>
<font color='black'>12:23</font> <br>
<font color='black'>1:23</font> <br>
<font color='black'>11:33</font> <br>
<font color='black'>4:21</font> 

In [None]:
file = 'regex27.txt'
pattern = r'([0-9]{1,2}):([0-9]{1,2})'
outputFormat = r'\2 mins past \1'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# regex28.txt

<font color='black'>914.582.3013 convert to xxx.xxx.3013</font> <br>
<font color='black'>873.334.2589</font> <br>
<font color='black'>521.589.3147</font> <br>
<font color='black'>625.235.3698</font> <br>
<font color='black'>895.568.2145</font> <br>
<font color='black'>745.256.3369</font> 

In [None]:
file = 'regex28.txt'
pattern = r'[0-9]{3}\.[0-9]{3}\.([0-9]{4})'
outputFormat = r'xxx.xxx.\1'

printSearchedReplacedExpressions(file, pattern, outputFormat)

Group patterns

In [None]:
file = 'regex28.txt'
pattern = r'([0-9]{3}\.){2}([0-9]{4})'
outputFormat = r'xxx.xxx.\2'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# regex29.txt

<font color='black'>Jan 5th 1987 convert to 5-Jan-87</font> <br>
<font color='black'>Dec 26th 2010 </font> <br>
<font color='black'>Mar 2nd 1923</font> <br>
<font color='black'>Oct 1st 2008</font> <br>
<font color='black'>Aug 3rd 2009</font> <br>
<font color='black'>Jun 10th 2001</font> 

In [None]:
file = 'regex29.txt'
pattern = r'([A-Za-z]{3})\s([0-9]{1,2})[A-Za-z]{2}\s[0-9]{2}([0-9]{2})'
outputFormat = r'\2-\1-\3'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# regex30.txt

<font color='black'>(914).582.3013 convert to 914.582.3013</font> <br>
<font color='black'>(873).334.2589</font> <br>
<font color='black'>(521).589.3147</font> <br>
<font color='black'>(625).235.3698</font> <br>
<font color='black'>(895).568.2145</font> <br>
<font color='black'>(745).256.3369</font> 

In [None]:
file = 'regex30.txt'
pattern = r'\(([0-9]{3})\)(\.[0-9]{3}\.[0-9]{4})'
outputFormat = r'\1\2'

printSearchedReplacedExpressions(file, pattern, outputFormat)

# Own example 1:
taken from: https://stackoverflow.com/questions/65649033/regex-how-to-ignore-dots-in-connected-words/65649084

## Input
08-01-2021: There is a System.InvalidCalculationException and a System.OutOfboundsException- System reboots <br>
09-01-2021: SuperSystem recognised a System.IO.WritingException ask user what to do next <br>
10-01-2021: Hello again, how are you today? <br>
10-01-2021: Oh no, not again an InternalException.NullReference.NonCritical.User we should fix it! <br>

## Output
System.InvalidCalculationException System.OutOfboundsException <br>
System.IO.WritingException <br>
10-01-2021: Hello again, how are you today? <br>
InternalException.NullReference.NonCritical.User

### Explanation:
**\b** is looking for word boundaries <br>
**?:** states that the following term is not a group even the statement is within ()-brackets <br>
**\w** looks for **all** letters [a-zA-Z], digits [0-9] as well as _ <br>
**\w+** states 1 or more repetitions of \w <br>
<b>(term)*</b> states 1 or more repetition of the given term <br>
<b>\\.</b> describes a real "." Normally, a point within a regex pattern states anything (except line breaks). <br>
But here you are looking for real ".". Therefore, you need to escape it.

In [None]:
file = 'ownExample01.txt'
pattern = r'\b(?:\w+\.)*\w*Exception(?:\.\w+)*\b'

printSearchedExpressions(file, pattern)