# Regular Expressions in Python

[regexplained](https://regexplained.com/) (presentations)

### Introduction

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module.

You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.



**Objectives**:

- Understand the concept and purpose of regex.
- Learn common characters and metacharacters used in regex patterns
- Use Python's `re` for regex matching and substitution

A concise explanation of regex and their use cases (text searching, data validation, text manipulation).

In [None]:
import re

### Character and Meta-characters

In [None]:
text = "Adam Abrar and Ibrahim"
print(re.findall(r"a", text))
print(re.findall(r"A", text))
print(re.findall(r"a\w\w", text)) # \w is Any Alphanumeric character
print(re.findall(r"a..", text))   # . is Any character except newline

['a', 'a', 'a', 'a']
['A', 'A']
['and', 'ahi']
['am ', 'ar ', 'and', 'ahi']


In [None]:
text = "Adam Abrar and Ibrahim"
print(re.findall(r"\w+", text)) # \w+ is one or more Alphanumeric character
print(re.findall(r"\w*", text)) # * is zero or more
print(re.findall(r"\w?", text)) # ? is zero or one

['Adam', 'Abrar', 'and', 'Ibrahim']
['Adam', '', 'Abrar', '', 'and', '', 'Ibrahim', '']
['A', 'd', 'a', 'm', '', 'A', 'b', 'r', 'a', 'r', '', 'a', 'n', 'd', '', 'I', 'b', 'r', 'a', 'h', 'i', 'm', '']


In [None]:
text = "Adam Abrar and Ibrahim"
print(re.findall(r"\ba\w\w\b", text)) # \b is Word Boundary
print(re.findall(r"^A\w+", text)) # ^ is Start of the string
print(re.findall(r"\w+m$", text)) # $ is End of the string

['and']
['Adam']
['Ibrahim']


#### The `\w` metacharacter

In [None]:
text = "betty bought a bit of bitter butter"
print(re.findall(r"\w", text))
print(re.findall(r"\W", text))
print(re.findall(r".", text))

['b', 'e', 't', 't', 'y', 'b', 'o', 'u', 'g', 'h', 't', 'a', 'b', 'i', 't', 'o', 'f', 'b', 'i', 't', 't', 'e', 'r', 'b', 'u', 't', 't', 'e', 'r']
[' ', ' ', ' ', ' ', ' ', ' ']
['b', 'e', 't', 't', 'y', ' ', 'b', 'o', 'u', 'g', 'h', 't', ' ', 'a', ' ', 'b', 'i', 't', ' ', 'o', 'f', ' ', 'b', 'i', 't', 't', 'e', 'r', ' ', 'b', 'u', 't', 't', 'e', 'r']


#### The `\d` metacharacter

In [None]:
text = "this tree is 300 years old"
print(re.findall(r"\d", text))
print(re.findall(r"\D", text))
print(re.findall(r".", text))

['3', '0', '0']
['t', 'h', 'i', 's', ' ', 't', 'r', 'e', 'e', ' ', 'i', 's', ' ', ' ', 'y', 'e', 'a', 'r', 's', ' ', 'o', 'l', 'd']
['t', 'h', 'i', 's', ' ', 't', 'r', 'e', 'e', ' ', 'i', 's', ' ', '3', '0', '0', ' ', 'y', 'e', 'a', 'r', 's', ' ', 'o', 'l', 'd']


#### Character sets using `[]`

In [None]:
text = """
1. Pick it up
2. Put it down
"""

print(re.findall(r"[ptui]", text)) # character set
print(re.findall(r"[a-z]", text))  # character range
print(re.findall(r"[a-z]+", text)) # quantifiers
print(re.findall(r"[A-Z]", text))
print(re.findall(r"[a-zA-Z0-9]", text)) # multiple ranges
print(re.findall(r"[^a-zA-Z]", text)) # ^ in the set notation is negation

['i', 'i', 't', 'u', 'p', 'u', 't', 'i', 't']
['i', 'c', 'k', 'i', 't', 'u', 'p', 'u', 't', 'i', 't', 'd', 'o', 'w', 'n']
['ick', 'it', 'up', 'ut', 'it', 'down']
['P', 'P']
['1', 'P', 'i', 'c', 'k', 'i', 't', 'u', 'p', '2', 'P', 'u', 't', 'i', 't', 'd', 'o', 'w', 'n']
['\n', '1', '.', ' ', ' ', ' ', '\n', '2', '.', ' ', ' ', ' ', '\n']


### Finding Digits

In [None]:
text = "The product is $20.15 which is equivalent to 94.3125 SAR"

print(re.findall(r"\d", text))  # \d is Any Digit
print(re.findall(r"[0-9]", text))  # \d is equivalent to [0-9]
print(re.findall(r"\d\d", text))
print(re.findall(r"\d{4}", text))
print(re.findall(r"\d+", text)) # + is One or more
print(re.findall(r"\d+\.\d+", text))   # \. is the "." Character since the "." is a meta-character matching: Any Character
print(re.findall(r"\$\d+\.\d+", text)) # \$ is the "$" Character since the "$" is a meta-character matching: End of Line
print(re.findall(r"\d+\.\d+\sSAR", text)) # \s is Any Whitespace Character

['2', '0', '1', '5', '9', '4', '3', '1', '2', '5']
['2', '0', '1', '5', '9', '4', '3', '1', '2', '5']
['20', '15', '94', '31', '25']
['3125']
['20', '15', '94', '3125']
['20.15', '94.3125']
['$20.15']
['94.3125 SAR']


### Grouping matches `()`

In [None]:
# key value matching
text = r"""{key1: value1, key2: value2, key3:   value3, key4:
value4}"""

print(re.findall(r"(\w+):\s(\w+)", text)) # grouping the key and value
print(re.findall(r"(\w+):\s*(\w+)", text)) # \s* is zero or more whitespace characters (including tabs and newlines)

[('key1', 'value1'), ('key2', 'value2'), ('key4', 'value4')]
[('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3'), ('key4', 'value4')]


#### Using `|` (OR) Operator

In [None]:
text = """
I like cats
I like horses
I like trees
"""

print(re.findall(r"I like (cats|horses)", text)) # | is OR

['cats', 'horses']


In [None]:
text = "55 thousands, 77 hundereds"

print(re.findall(r"(\d+)\s+(thousands|hundereds)", text))

[('55', 'thousands'), ('77', 'hundereds')]


### String literal concatenation

In [1]:
# Note: adjacent string literals are concatenated
assert ("spam " 'eggs') == "spam eggs"
assert ("spam " "eggs") == "spam eggs"



This feature can be used to reduce the number of backslashes needed, to split long strings conveniently across long lines, or even **to add comments to parts of strings**, for example:

In [2]:
import re
re.compile("[A-Za-z_]"       # letter or underscore
           "[A-Za-z0-9_]*"   # letter, digit or underscore
          )

re.compile(r'[A-Za-z_][A-Za-z0-9_]*', re.UNICODE)

See: https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation

### Naming patterns

We use `?P<pattern-name>` to name patterns.

In [None]:
text = "my email is adam@example.com and yours is belal@example.com"

matches_iterator = re.finditer(r"(?P<name>\w+)@(?P<domain>\w+)\.(?P<tld>\w+)", text)
for m in matches_iterator:
    print(m.group("name"))

adam
belal


### Resources to Learn Regex

**Tutorials with interactive exercises**

- [RegexLearn](https://regexlearn.com/) - Interactive tutorial and practice problems.
    - Languages: 🇺🇸, 🇹🇷, 🇷🇺, 🇪🇸, 🇨🇳, 🇩🇪, 🇺🇦, 🇫🇷, 🇵🇱, 🇰🇷, 🇧🇷, 🇨🇿, 🇬🇪.
- [RegexOne](https://regexone.com/) - Interactive tutorial and practice problems.

**Videos**

- [*Demystifying Regular Expressions*](https://www.youtube.com/watch?v=M7vDtxaD7ZU) - Great presentation for beginners, by Lea Verou at HolyJS 2017 (1hr 12m).
- [*Learn Regular Expressions In 20 Minutes*](https://www.youtube.com/watch?v=rhzKDrUiJVk) - Live syntax walkthrough in a regex tester, by Kyle Cook.

**Other Resources**:
- [**rexegg** Cheat Sheet](https://www.rexegg.com/regex-quickstart.php)
- [Python Docs (Regex How to)](https://docs.python.org/3/howto/regex.html#regex-howto) is an introductory tutorial to using regular expressions in Python with the `re` module. It provides a gentler introduction than the corresponding section in the Library Reference.