### Regular Expressions

Regular expressions are patterns used to match character combinations in strings.

In [16]:
import re

### BackSlash "\\" Confusion
- In regular expressions (regex), the backslash has a special meaning
- Python strings treat backslash as escape character


In [12]:
print('Pyhton String:', "\\n", 'Raw String:', r"\n")
print('Pyhton String:', "\\\\", 'Raw String:', r"\\\\")

Pyhton String: \n Raw String: \n
Pyhton String: \\ Raw String: \\\\


In [28]:
a = '\tHello'
b = r'\tHello'
print(a)
print(b)

	Hello
\tHello


Below code will throw error because python string convert "\\\\" to "\" but in regex "\" is a special character. That's why it will throw error instead of matching the pattern.

In [25]:
text = "The \ rain in \ Spain."
print(re.findall("\\", text))

error: bad escape (end of pattern) at position 0

If we want to match the pattern "\\" using python string then we need string "\\\\\\\\" because it will be "\\\\" which is a pattern in regex to find "\\" pattern. It is always suggested to use raw string.

In [26]:
text = "The \ rain in \ Spain."
print(re.findall("\\\\", text))

['\\', '\\']


In [27]:
text = "The \ rain in \ Spain."
print(re.findall(r"\\", text))

['\\', '\\']


Regular expressions can contain both special and ordinary characters. 
- Ordinary Characters: 'A', 'a', 0
- Special Characters: '|', '('

### Important Functions
- match(): Determine if the RE matches at the beginning of the string.
- search(): Scan through a string, looking for any location where this RE matches.
- findall(): Find all substrings where the RE matches, and returns them as a list.
- finditer(): Find all substrings where the RE matches, and returns them as an iterator.
- split(): Returns a list where the string has been split at each match
- sub(): Replaces one or many matches with a string

In [31]:
# finditer()
my_string = 'abc123ABC123abc'
pattern = re.compile(r'123')
matches = pattern.finditer(my_string)
print(matches)  # returns an iterator object
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<callable_iterator object at 0x000001B2D0596490>
<re.Match object; span=(3, 6), match='123'>
(3, 6) 3 6
123
<re.Match object; span=(9, 12), match='123'>
(9, 12) 9 12
123


In [34]:
# findall()
pattern = re.compile(r'123')
matches = pattern.findall(my_string)
print(matches)  # returns a list of strings
for match in matches:
    print(match)

['123', '123']
123
123


In [43]:
# match
pattern = re.compile(r'123')
match = pattern.match(my_string)
print(match)
pattern = re.compile(r'abc')
match = pattern.match(my_string)
print(match)
match

None
<re.Match object; span=(0, 3), match='abc'>


<re.Match object; span=(0, 3), match='abc'>

In [46]:
# search
match = pattern.search(my_string)
match

<re.Match object; span=(0, 3), match='abc'>

### Methods on a Match object¶
- group(): Return the string matched by the RE
- start(): Return the starting position of the match
- end(): Return the ending position of the match
- span(): Return a tuple containing the (start, end) positions of the match

In [47]:
test_string = '123abc456789abc123ABC'
pattern = re.compile(r'abc')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the substring that was matched by the RE

<re.Match object; span=(3, 6), match='abc'>
(3, 6) 3 6
abc
<re.Match object; span=(12, 15), match='abc'>
(12, 15) 12 15
abc


### Meta characters
Metacharacters are characters with a special meaning in regex.

**All meta characters:** `. ^ $ * + ? { } [ ] \ | ( )`

- Meta characters need to be escaped (with `\`) if we actually want to search for the literal character.

- `.` Any character (except newline) → `"he..o"`
- `^` Starts with → `"^hello"`
- `$` Ends with → `"world$"`
- `*` Zero or more occurrences → `"aix*"`
- `+` One or more occurrences → `"aix+"`
- `{}` Exactly the specified number of occurrences → `"al{2}"`
- `[]` A set of characters → `"[a-m]"`
- `\` Signals a special sequence (can also be used to escape special characters) → `r"\d"` (or `"\\d"`)
- `|` Either or → `"falls|stays"`
- `()` Capture and group

