### Regular Expressions

Regular expressions are patterns used to match character combinations in strings.

In [48]:
import re

### BackSlash "\\" Confusion
- In regular expressions (regex), the backslash has a special meaning
- Python strings treat backslash as escape character


In [49]:
print('Pyhton String:', "\\n", 'Raw String:', r"\n")
print('Pyhton String:', "\\\\", 'Raw String:', r"\\\\")

Pyhton String: \n Raw String: \n
Pyhton String: \\ Raw String: \\\\


In [50]:
a = '\tHello'
b = r'\tHello'
print(a)
print(b)

	Hello
\tHello


This code will throw an error because Python string '\\' becomes '\' (single backslash), and in regex a single backslash is an incomplete escape sequence, so regex raises an error instead of matching.

In [51]:
text = "The \ rain in \ Spain."
print(re.findall("\\", text))

error: bad escape (end of pattern) at position 0

If we want to match the pattern "\\" using python string then we need string "\\\\\\\\" because it will be "\\\\" which is a pattern in regex to find "\\" pattern. It is always suggested to use raw string.

In [52]:
text = "The \ rain in \ Spain."
print(re.findall("\\\\", text))

['\\', '\\']


In [53]:
text = "The \ rain in \ Spain."
print(re.findall(r"\\", text))

['\\', '\\']


Regular expressions can contain both special and ordinary characters. 
- Ordinary Characters: 'A', 'a', 0
- Special Characters: '|', '('

### Important Functions
- match(): Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
- findall(): Find all substrings where the RE matches, and returns them as a list.
- finditer(): Find all substrings where the RE matches, and returns them as an iterator.
- split(): Returns a list where the string has been split at each match
- sub(): Replaces one or many matches with a string

In [54]:
# finditer()
my_string = 'abc123ABC123abc'
pattern = re.compile(r'123')
matches = pattern.finditer(my_string)
print(matches)  # returns an iterator object
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<callable_iterator object at 0x000001B2CEF6B040>
<re.Match object; span=(3, 6), match='123'>
(3, 6) 3 6
123
<re.Match object; span=(9, 12), match='123'>
(9, 12) 9 12
123


In [55]:
# findall()
pattern = re.compile(r'123')
matches = pattern.findall(my_string)
print(matches)  # returns a list of strings
for match in matches:
    print(match)

['123', '123']
123
123


In [56]:
# match
pattern = re.compile(r'123')
match = pattern.match(my_string)
print(match)
pattern = re.compile(r'abc')
match = pattern.match(my_string)
print(match)
match

None
<re.Match object; span=(0, 3), match='abc'>


<re.Match object; span=(0, 3), match='abc'>

In [57]:
# search
match = pattern.search(my_string)
match

<re.Match object; span=(0, 3), match='abc'>

### Methods on a Match object¶
- group(): Return the string matched by the RE
- start(): Return the starting position of the match
- end(): Return the ending position of the match
- span(): Return a tuple containing the (start, end) positions of the match

In [58]:
test_string = '123abc456789abc123ABC'
pattern = re.compile(r'abc')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the substring that was matched by the RE

<re.Match object; span=(3, 6), match='abc'>
(3, 6) 3 6
abc
<re.Match object; span=(12, 15), match='abc'>
(12, 15) 12 15
abc


### Meta characters
Metacharacters are characters with a special meaning in regex.

**All meta characters:** `. ^ $ * + ? { } [ ] \ | ( )`

- Meta characters need to be escaped (with `\`) if we actually want to search for the literal character.

- `.` Any character (except newline) → `"he..o"`
- `^` Starts with → `"^hello"`
- `$` Ends with → `"world$"`
- `*` Zero or more occurrences → `"aix*"`
- `+` One or more occurrences → `"aix+"`
- `{}` Exactly the specified number of occurrences → `"al{2}"`
- `[]` A set of characters → `"[a-m]"`
- `\` Signals a special sequence (can also be used to escape special characters) → `r"\d"` (or `"\\d"`)
- `|` Either or → `"falls|stays"`
- `()` Capture and group



### Special Sequences in Regex

A *special sequence* is a backslash (\) followed by one of the characters below, and it has a special meaning.

- \d : Matches any decimal digit (same as [0-9])
- \D : Matches any non-digit character (same as [^0-9])
- \s : Matches any whitespace character
- \S : Matches any non-whitespace character
- \w : Matches any alphanumeric (word) character (same as [a-zA-Z0-9_])
- \W : Matches any non-alphanumeric character (same as [^a-zA-Z0-9_])

- \b : Word boundary (beginning or end of a word)  
      Example: r"\bain" , r"ain\b"

- \B : Not a word boundary (NOT at beginning or end of a word)  
      Example: r"\Bain" , r"ain\B"

- \A : Matches if the specified characters are at the beginning of the string  
      Example: r"\AThe"

- \Z : Matches if the specified characters are at the end of the string  
      Example: r"Spain\Z"


In [66]:
test_string = 'hello 123_ heyho hohey'
pattern = re.compile(r'\d')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>


In [67]:
pattern = re.compile(r'\s')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(5, 6), match=' '>
<re.Match object; span=(10, 11), match=' '>
<re.Match object; span=(16, 17), match=' '>


In [68]:
pattern = re.compile(r'\w')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>


In [69]:
pattern = re.compile(r'\bhey')
matches = pattern.finditer('heyho hohey') # ho-hey, ho\nhey are matches!
for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='hey'>


In [70]:
pattern = re.compile(r'\Ahello')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='hello'>


In [72]:
pattern = re.compile(r'hey\Z')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(19, 22), match='hey'>


### Sets in Regex
A set is a group of characters inside square brackets [ ] with a special meaning.
You can append multiple conditions back-to-back, e.g. [a-zA-Z].

- A ^ (caret) inside a set negates the expression.
- A - (dash) inside a set specifies a range if it is in between; otherwise it represents the dash itself.

---

### Examples

- [arn]  
  Returns a match where one of the specified characters (a, r, or n) is present.

- [a-n]  
  Returns a match for any lowercase character alphabetically between a and n.

- [^arn]  
  Returns a match for any character EXCEPT a, r, and n.

- [0123]  
  Returns a match where any of the specified digits (0, 1, 2, or 3) is present.

- [0-9]  
  Returns a match for any digit between 0 and 9.

- [0-5][0-9]  
  Returns a match for any two-digit numbers from 00 to 59.

- [a-zA-Z]  
  Returns a match for any character alphabetically between a and z, lowercase OR uppercase.


In [84]:
test_string = 'hello 123_'
pattern = re.compile(r'[a-z]')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>


In [85]:
dates = '''
01.04.2020
2020.04.01
2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11
2020/04/02
2020_04_04
2020_04_04
'''

print('all dates with a character in between')
pattern = re.compile(r'\d\d\d\d.\d\d.\d\d')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

all dates with a character in between
<re.Match object; span=(12, 22), match='2020.04.01'>
<re.Match object; span=(23, 33), match='2020-04-01'>
<re.Match object; span=(34, 44), match='2020-05-23'>
<re.Match object; span=(45, 55), match='2020-06-11'>
<re.Match object; span=(56, 66), match='2020-07-11'>
<re.Match object; span=(67, 77), match='2020-08-11'>
<re.Match object; span=(78, 88), match='2020/04/02'>
<re.Match object; span=(89, 99), match='2020_04_04'>
<re.Match object; span=(100, 110), match='2020_04_04'>


In [86]:
print('only dates with - or . in between in May or June')
pattern = re.compile(r'\d\d\d\d[-.]0[56][-.]\d\d')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

only dates with - or . in between in May or June
<re.Match object; span=(34, 44), match='2020-05-23'>
<re.Match object; span=(45, 55), match='2020-06-11'>


In [87]:
print('only dates with - or . in between in May, June, July')
pattern = re.compile(r'\d\d\d\d[-.]0[5-7][-.]\d\d') #  no escape for the . here in the set
matches = pattern.finditer(dates)
for match in matches:
    print(match)

only dates with - or . in between in May, June, July
<re.Match object; span=(34, 44), match='2020-05-23'>
<re.Match object; span=(45, 55), match='2020-06-11'>
<re.Match object; span=(56, 66), match='2020-07-11'>


### Quantifiers

Quantifiers specify *how many times* a character, group, or pattern should occur.

- '*'  : 0 or more occurrences
- '+'  : 1 or more occurrences
- ?  : 0 or 1 occurrence (optional)
- {4}   : exactly 4 occurrences
- {4,6} : between 4 and 6 occurrences (min = 4, max = 6)


In [88]:
my_string = 'hello_123'
pattern = re.compile(r'\d*')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(1, 1), match=''>
<re.Match object; span=(2, 2), match=''>
<re.Match object; span=(3, 3), match=''>
<re.Match object; span=(4, 4), match=''>
<re.Match object; span=(5, 5), match=''>
<re.Match object; span=(6, 9), match='123'>
<re.Match object; span=(9, 9), match=''>


In [89]:
pattern = re.compile(r'\d+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 9), match='123'>


In [90]:
my_string = 'hello_1_2-3'
pattern = re.compile(r'_?\d')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(5, 7), match='_1'>
<re.Match object; span=(7, 9), match='_2'>
<re.Match object; span=(10, 11), match='3'>


In [91]:
my_string = '2020-04-01'
pattern = re.compile(r'\d{4}') # or if you need a range r'\d{3,5}'
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='2020'>


In [92]:
dates = '''
2020.04.01
2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11
2020/04/02
2020_04_04
2020_04_04
'''
pattern = re.compile(r'\d{4}.\d{2}.\d{2}')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='2020.04.01'>
<re.Match object; span=(13, 23), match='2020-04-01'>
<re.Match object; span=(24, 34), match='2020-05-23'>
<re.Match object; span=(35, 45), match='2020-06-11'>
<re.Match object; span=(46, 56), match='2020-07-11'>
<re.Match object; span=(57, 67), match='2020-08-11'>
<re.Match object; span=(69, 79), match='2020/04/02'>
<re.Match object; span=(81, 91), match='2020_04_04'>
<re.Match object; span=(92, 102), match='2020_04_04'>


In [93]:
pattern = re.compile(r'\d+.\d+.\d+')
matches = pattern.finditer(dates)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='2020.04.01'>
<re.Match object; span=(13, 23), match='2020-04-01'>
<re.Match object; span=(24, 34), match='2020-05-23'>
<re.Match object; span=(35, 45), match='2020-06-11'>
<re.Match object; span=(46, 56), match='2020-07-11'>
<re.Match object; span=(57, 67), match='2020-08-11'>
<re.Match object; span=(69, 79), match='2020/04/02'>
<re.Match object; span=(81, 91), match='2020_04_04'>
<re.Match object; span=(92, 102), match='2020_04_04'>


### Conditions¶
Use the | for either or condition.

In [94]:
my_string = """
Mr Simpson
Mrs Simpson
Mr. Brown
Ms Smith
Mr. T
"""
pattern = re.compile(r'Mr\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mr Simpson'>
<re.Match object; span=(24, 33), match='Mr. Brown'>
<re.Match object; span=(43, 48), match='Mr. T'>


In [95]:
pattern = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(1, 11), match='Mr Simpson'>
<re.Match object; span=(12, 23), match='Mrs Simpson'>
<re.Match object; span=(24, 33), match='Mr. Brown'>
<re.Match object; span=(34, 42), match='Ms Smith'>
<re.Match object; span=(43, 48), match='Mr. T'>


## Grouping¶
( ) is used to group substrings in the matches.

In [97]:
emails = """
pythonengineer@gmail.com
Python-engineer@gmx.de
python-engineer123@my-domain.org
"""
pattern = re.compile('[a-zA-Z1-9-]+@[a-zA-Z-]+\.[a-zA-Z]+')
pattern = re.compile('[a-zA-Z1-9-]+@[a-zA-Z-]+\.(com|de)')
pattern = re.compile('([a-zA-Z1-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)')
matches = pattern.finditer(emails)
for match in matches:
    print(match)
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

<re.Match object; span=(1, 25), match='pythonengineer@gmail.com'>
pythonengineer@gmail.com
pythonengineer
gmail
com
<re.Match object; span=(26, 48), match='Python-engineer@gmx.de'>
Python-engineer@gmx.de
Python-engineer
gmx
de
<re.Match object; span=(49, 81), match='python-engineer123@my-domain.org'>
python-engineer123@my-domain.org
python-engineer123
my-domain
org


### Modifying strings¶
- split(): Split the string into a list, splitting it wherever the RE matches
- sub(): Find all substrings where the RE matches, and replace them with a different string

In [98]:
my_string = 'abc123ABCDEF123abc'
pattern = re.compile(r'123') #  no escape for the . here in the set
matches = pattern.split(my_string)
print(matches)

['abc', 'ABCDEF', 'abc']


In [99]:
my_string = "hello world, you are the best world"
pattern = re.compile(r'world')
subbed_string = pattern.sub(r'planet', my_string)
print(subbed_string)

hello planet, you are the best planet


In [100]:
urls = """
http://python-engineer.com
https://www.python-engineer.org
http://www.pyeng.net
"""
pattern = re.compile(r'https?://(www\.)?(\w|-)+\.\w+')
pattern = re.compile(r'https?://(www\.)?([a-zA-Z-]+)(\.\w+)')
matches = pattern.finditer(urls)
for match in matches:
    #print(match)
    print(match.group()) # 0
    #print(match.group(1))
    #print(match.group(2))
    print(match.group(3))

# substitute using back references to replace url + domain name
subbed_urls = pattern.sub(r'\2\3', urls)
print(subbed_urls)

http://python-engineer.com
.com
https://www.python-engineer.org
.org
http://www.pyeng.net
.net

python-engineer.com
python-engineer.org
pyeng.net



### Compilation Flags¶
- ASCII, A : Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
- DOTALL, S : Make . match any character, including newlines.
- IGNORECASE, I : Do case-insensitive matches.
- LOCALE, L : Do a locale-aware match.
- MULTILINE, M : Multi-line matching, affecting ^ and $.
- VERBOSE, X (for ‘extended’) : Enable verbose REs, which can be organized more cleanly and understandably.

In [101]:
my_string = "Hello World"
pattern = re.compile(r'world', re.IGNORECASE) # No match without I flag
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

my_string = '''
hello
cool
Hello
'''
# line starts with ...
pattern = re.compile(r'^[a-z]', re.MULTILINE) # No match without M flag
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(6, 11), match='World'>
<re.Match object; span=(1, 2), match='h'>
<re.Match object; span=(7, 8), match='c'>
