# Regular Expressions

The Good 😍, The Bad 😈, The Ugly 😱

# Ein bunter Blumenstrauß an Themen 💐

- RegEx Basics
- RegEx Lookarounds


- Beispiele
- Greediness
- Best Practices
- Flavors
- Mehr Beispiele
- Wir bauen uns eine RegEx-Engine

# RegEx Basics

- Meta Characters
- Quantifier
- Character Classes
- Anchors


# Wozu RegEx? 🤷
- Arbeiten mit Daten
- Extrahieren von Daten aus Daten
- Validieren von Daten (z.B. Benutzereingaben)

# Einsatzbereiche
- In der Shell
- Im Editor/IDE
- Im Source-Code

# Funktions 1️⃣0️⃣1️⃣

- String-Literale matchen String-Literale
- Der Treffer am weitesten links gewinnt
- Die Engine arbeitet von links nach rechts

In [10]:
import re
m = re.search(r"fire", "Dear Sir/Madam. Im am writing to inform you of a fire which has broken out ...")
print(m)

m = re.search(r"fire", "Dear Sir/Madam. Fire! Fire! Help me")
print(m)

<re.Match object; span=(49, 53), match='fire'>
None


# Engine-Simulation mit PowerPoint! 🚂

# Meta Characters

- String-Literal matched String-Literal
- Ausnahmen: **Meta-Characters**

        Meta Characters:
        . ^ $ * + ? { } [ ] \ | ( )
        
        Meta Character . (dot)
        Der Meta Character . matched auf ein beliebiges Zeichen
        
        Möchte ich Meta-Characters matchen, müssen diese mit \ (Backslash) escaped werden
        

# Simple filename validator

Ein Dateiname besitzt in diesem Beispiel:

- Genau 8 beliebige Zeichen
- Gefolgt von einem .
- Gefolgt von genau 3 beliebigen Zeichen

In [71]:
import re

def simple_validator(filename):
    # Replace ... with valid RegEx
    m = re.match(r"...", filename)
    return m is not None

assert simple_validator("test.txt") is False
assert simple_validator("autofile.cmd") is True
assert simple_validator("test0010.txt") is True
assert simple_validator("test001.txt") is False
assert simple_validator("test.tar.gz") is False
assert simple_validator("test00100.tx") is False
print("Good RegEx!")

AssertionError: 

# Quantifier

    Quant. | Anzahl      |  Bedeutung 
    -----------------------------------------
    *        0..x           beliebig
    +        1..x           mindestens eins
    ?        0..1           optional
    {3}      3              genau 3
    {42}     42             genau 42
    {3,}     3..x           mindestens 3
    {10,20}  10..20         zwischen 10 und 20

In [16]:
import re

def simple_validator(filename):
    # Replace ... with valid RegEx
    m = re.match(r".{8}\..{3}", filename)
    return m is not None

assert simple_validator("test.txt") is False
assert simple_validator("autofile.cmd") is True
assert simple_validator("test0010.txt") is True
assert simple_validator("test001.txt") is False
assert simple_validator("test.tar.gz") is False
assert simple_validator("test00100.tx") is False
print("Good RegEx!")

Good RegEx!


More on quantifier: https://alexanderkosik.github.io/kickstart_regex/content/quantifier.html

# Character classes

- Was tun, wenn wir nur bestimmte Zeichen matchen wollen?
    - Nur Zahlen
    - Nur Großbuchstaben
    - ...

- `[abc]` matched a or b or c
- `[xyz]` matched x or y or z
- `[0123456789]` matched 0 or 1 or ...9
- `[0-9]` same
- `[a-zA-Z]` matched a or b or c ... or A or Z


- `[-F-H]` matched - or F or G or H (- muss am Anfang stehen!)
- `[^0-9]` matched alles was KEINE Zahl ist, aber nicht nichts!

### Shortcuts
    \w    Word: [a-zA-Z0-9_]
    \W    Non Word Character
    \d    Digit: [0-9]
    \D    Non Digit
    \s    Whitespace: Space, Tab, Newline
    \S    Non Whitespace
    
`^` hat eine Doppelbelegung 😱


More on character classes: https://alexanderkosik.github.io/kickstart_regex/content/char_classes.html

# Anchors

Anchor matchen eine Position, **keinen** Character

    ^    Matches on beginning of the line 😱
    $    Matches on the end of a line
    \b   Matches on a word boundary (beginning or end of a word)
         Matches, without consuming any characters, immediately between a character 
         matched by \w and a character not matched by \w (in either order)

# Groups

- Gruppen werden mit `()` erstellt
- 3 "Use Cases" für Gruppen
    - Sub-Matches ausgeben
    - Sachen zusammenpacken, z.B. für Quantifier
    - Alternation (SubRegEx A, oder B, oder C, ...)

# Beispiele

## Sub-Matches ausgeben

In [4]:
import re
files = [
    "holiday1999.png",
    "invoice_car_insurance.pdf",
    "invoice_telekom2021.pdf",
    "resumee.pdf",
]

pattern = r"(\w+)\.([a-z]+)$"

for file in files:
    m = re.match(pattern, file)
    if m:
        print("filename:", m.group(1))
        print("ending:", m.group(2))
        print("")



filename: holiday1999
ending: png

filename: invoice_car_insurance
ending: pdf

filename: invoice_telekom2021
ending: pdf

filename: resumee
ending: pdf



## Zusammenpacken

In [6]:
import re

# How can we match "abc" 3 times
m = re.search(r"abcabcabc", "abcabcabc")
print(m.group())

# This looks simpler
m = re.search(r"(abc){3}", "abcabcabc")
print(m.group())

abcabcabc
abcabcabc


In [None]:
## Alternatio

# IP-Adressen Validator

### Verschiedene RegEx "Genauigkeiten"

### 1) Ungenau! Aber zur Suche in Logfiles ausreichend

In [34]:
!cat log_file_ip.txt

192.168.0.1      Error transmitting 3 Bytes
192.168.0.10     Error transmitting 3 Bytes
192.168.1.42     Error transmitting 2 Bytes
192.168.1.52     Error transmitting 2 Bytes
192.168.0.100    Error transmitting 8 Bytes
172.148.1.1      Error transmitting 2 Bytes
192.168.0.1      Error transmitting 1 Bytes
178.148.1.1      Error transmitting 2 Bytes
178.148.1.1      Error transmitting 1 Bytes
178.148.1.2      Error transmitting 1 Bytes
178.148.1.2      Error transmitting 6 Bytes


In [37]:
# RegEx for matching 192.168.0.xxx with at least 1 digit in the last segment
# Matches every number between 0 and 999
pattern = r"192\.168\.0\.\d{1,3}"

# iterate of our file and print the row if pattern matches
with open('log_file_ip.txt') as f:
    for row in f:
        if re.match(pattern, row):
            print("Error detected:", row, end="")

Error detected: 192.168.0.1      Error transmitting 3 Bytes
Error detected: 192.168.0.10     Error transmitting 3 Bytes
Error detected: 192.168.0.100    Error transmitting 8 Bytes
Error detected: 192.168.0.1      Error transmitting 1 Bytes


### 2) Auch ungenau! Aber zu ungenau und potenziell gefährlich! 😈

- Angenommen wir möchten nur IP-Adressen < 10 am Ende finden (`[0-9]`)

In [43]:
!cat log_file_ip.txt |grep -o '192\.168\.0\.[0-9]'

192.168.0.1
192.168.0.1
192.168.0.1
192.168.0.1


## 3) Genau. Aber evtl. zu komplex? 😱

In [60]:
import re

def exact_ip_validator(ip_address):
    m = re.match(r"192\.168\.1\.(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])$", ip_address)
    #                           ^                             ^
    #                          has to be duplicated to verifiy generic ip address
    
    return m is not None

assert exact_ip_validator("192.168.1.") is False

assert exact_ip_validator("192.168.1.1") is True
assert exact_ip_validator("192.168.1.11") is True
assert exact_ip_validator("192.168.1.99") is True
assert exact_ip_validator("192.168.1.111") is True
assert exact_ip_validator("192.168.1.199") is True
assert exact_ip_validator("192.168.1.200") is True
assert exact_ip_validator("192.168.1.255") is True

assert exact_ip_validator("192.168.1.256") is False
assert exact_ip_validator("192.168.1.999") is False
assert exact_ip_validator("192.168.1.x") is False
assert exact_ip_validator("192.168.1.xx") is False
assert exact_ip_validator("192.168.1.xxx") is False
print("Good RegEx!")





Good RegEx!


### 4) KISS. Außerhalb des Terminals brauche ich vermutlich keinen RegEx

In [65]:
import re


def exact_ip_validator(ip_address):
    try:
        segments = ip_address.split(".")
        return all(0 < int(seg) < 256 for seg in segments)
    except ValueError:
        return False
    
assert exact_ip_validator("192.168.1.") is False
assert exact_ip_validator("0.0.0.0") is False     # <--- Zusätzliche Validierung

assert exact_ip_validator("192.168.1.1") is True
assert exact_ip_validator("192.168.1.11") is True
assert exact_ip_validator("192.168.1.99") is True
assert exact_ip_validator("192.168.1.111") is True
assert exact_ip_validator("192.168.1.199") is True
assert exact_ip_validator("192.168.1.200") is True
assert exact_ip_validator("192.168.1.255") is True

assert exact_ip_validator("192.168.1.256") is False
assert exact_ip_validator("192.168.1.999") is False
assert exact_ip_validator("192.168.1.x") is False
assert exact_ip_validator("192.168.1.xx") is False
assert exact_ip_validator("192.168.1.xxx") is False
print("Good RegEx!")

Good RegEx!


# Best Practices

- Online-Hilfen
    - [RegEx101](https://regex101.com)
    - [Regexr](https://regexr.com)

# Verbose Mode

In [70]:
import re

def exact_ip_validator(ip_address):
    m = re.match(r"""
        192\.168\.1\ # Match string literal 192.168.1.
        (            # New group used for alternation
            \d       # Match 0-9
            |        # OR
            \d\d     # Match 10-99
            |        # OR
            1\d\d    # Match 100-199
            |        # OR
            2[0-4]\d # Match 200-249
            |        # OR
            25[0-5]  # Match 250 bis 255
        )            # End group
        $            # EOL
    """, ip_address, re.VERBOSE)
    
    return m is not None

assert exact_ip_validator("192.168.1.") is False
...
print("Good Regex")

Good Regex


# Wie genau muss ich sein?

- je präziser der RegEx, desto komplexer
- Wieviel Präzision ist wirklich notwendig? It depends ... Siehe Beispiel IP-Addresse

# Greediness 🤑

- Alle Quantifier sind per default greedy
- Nehmen alles was sie kriegen können
- Geben nur etwas ab, wenn sie müssen
- `.*` innerhalb eines RegEx ist immer eine potenzielle Fehlerquelle

## Beispiel Greediness

- Wir möchten alle HTML-Tags finden, die mit `h` beginnen


In [86]:
# This is our example html homepage
html = """
<!DOCTYPE html>
<html>
    <head>
        <title>Example Page</title>
    </head>
    <body>
        <h1>Hello World</h1>
        <h3>This is some smaller heading</h3>
    </body>
</html>
"""

# we now want to find every occurrence of <html-tags> starting with a lower case `h`.
# Our pattern might look like this:
pattern = r"<h.*>"

# so we have a literal `<` followed by a literal `h`
# followed by any number of arbitrary characters (to match html, head, h1, ...)
# and a literal `>`

# we use findall to find all results
# have a look at the results. Is this what we wanted?!
print(re.findall(pattern, html))

['<html>', '<head>', '<h1>Hello World</h1>', '<h3>This is some smaller heading</h3>']


In [None]:
- Die Greediness des Quantifiers kann auf `non-greedy` geändert werden
- Durch `?` am Ende des `*` (Achtung: Nicht zu verwechseln mit dem ? Quantifier)

# Flavors

# Mehr Beispiele

# Tausender Trennzeichen

    - 1000 -> 1.000
    - 83240000 -> 83.240.000
    - 70123 -> 70.123
    
    Inhalt soll ersetzt werden in einer sehr großen oder in mehreren Dateien.
    Wie vorgehen? (Brainstorming) 🧠

# Duplizierte Wörter erkennen

# Wir bauen uns eine RegEx-Engine