# <a id='toc1_'></a>[Regular Expressions (Regex)](#toc0_)


Regex, short for Regular Expression, is a **sequence of characters** that **defines a search pattern**. 

**Table of contents**<a id='toc0_'></a>    
- [Regular Expressions (Regex)](#toc1_)    
  - [Why RegEx?](#toc1_1_)    
  - [RegEx intuition](#toc1_2_)    
  - [RegEx Cheat Sheet](#toc1_3_)    
- [RegEx in Python 🐍 - Python's `re` Module](#toc2_)    
  - [Common Regex Functions](#toc2_1_)    
      - [`re.search(pattern, string)`:](#toc2_1_1_1_)    
      - [`re.findall(pattern, string)`:](#toc2_1_1_2_)    
      - [`re.match(pattern, string)`:](#toc2_1_1_3_)    
      - [`re.finditer(pattern, string)`:](#toc2_1_1_4_)    
      - [`re.sub(pattern, replacement, string)`:](#toc2_1_1_5_)    
      - [`re.split(pattern, string)`:](#toc2_1_1_6_)    
    - [Note](#toc2_1_2_)    
- [RegEx patterns](#toc3_)    
  - [General tokens](#toc3_1_)    
  - [Quantifiers](#toc3_2_)    
  - [Collections](#toc3_3_)    
  - [💡 Exercise 1: Help Elton out](#toc3_4_)    
  - [💡Exercise 2: Email patterns](#toc3_5_)    
- [Extra: Weird number formats](#toc4_)    
- [Extra: `re` in web scraping](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Why RegEx?](#toc0_)

In [49]:
0 in [0, 1, 2]

True

In [51]:
# string in string - "at" in
("cat" or "hat" or "mat") in "My cat went to look at my hat on the mat"

True

sabinagio@xyz.com

sabinagioxyz.com

S@B1naGIO

sabinagio

Regex is commonly used for **pattern** **search**, **matching** and **data manipulation** in various types of strings.

This makes it invaluable for tasks such as **data validation**, **text parsing**, and **data extraction**. Some typical applications include:

1. Emails
2. URLs
3. Phone Numbers
4. Dates and Times
5. Social Security Numbers
6. Credit Card Numbers
7. File Paths
8. HTML Tags
9. Log Files
10. Natural Language Processing

## <a id='toc1_2_'></a>[RegEx intuition](#toc0_)

We'll use [RegExR](https://regexr.com/) to get an idea of how RegEx works, using a few of the patterns below:

## <a id='toc1_3_'></a>[RegEx Cheat Sheet](#toc0_)

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/regex.png?raw=true)

# <a id='toc2_'></a>[RegEx in Python 🐍 - Python's `re` Module](#toc0_)

Python provides the `re` module, which allows you to work with regular expressions. Before using `re`, you need to import it:

In [52]:
import re
import warnings

warnings.filterwarnings('ignore')

## <a id='toc2_1_'></a>[Common Regex Functions](#toc0_)

Regex functions in Python allow you to find and work with patterns in strings:

1. To find the pattern:
   - `re.search(pattern, string)`: Searches the string for a match to the pattern and returns a match object if found.
   - `re.findall(pattern, string)`: Returns all occurrences of the pattern in the string as a list of strings.
   - `re.match(pattern, string)`: Searches the string for a match only at the beginning and returns a match object if found.
   - `re.finditer(pattern, string)`: Returns an iterator yielding match objects for all occurrences of the pattern in the string.

2. To work with the pattern:
   - `re.sub(pattern, replacement, string)`: Replaces all occurrences of the pattern in the string with the replacement string.
   - `re.split(pattern, string)`: Splits the string by occurrences of the pattern and returns a list of substrings.



#### <a id='toc2_1_1_1_'></a>[`re.search(pattern, string)`:](#toc0_)
   - Searches the `string` for a match to the `pattern`.
   - Returns a match object if the pattern is found, otherwise returns `None`.


 We can use the `re.search()` function to find specific patterns within a string. Let's explore some regex patterns and how they work:

In [56]:
text = 'I have 10 apples   and  2 bananas.'
pattern='\d'    # numeric, 1 characters

result = re.search(pattern, text)

if result:
    print(f"Match found: {result}")
    print(f"Match: {result.group()}")
    print(f"Position: {result.span()}")
else:
    print("No match found.")

Match found: <re.Match object; span=(7, 8), match='1'>
Match: 1
Position: (7, 8)


In [55]:
pattern = '\w' # matches any word character. A word character includes alphanumeric characters (letters and digits) and underscores (_). It is equivalent to [a-zA-Z0-9_].

result = re.search(pattern, text) # Returns first match, 'I'
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

Match found: I


#### <a id='toc2_1_1_2_'></a>[`re.findall(pattern, string)`:](#toc0_)
   - Returns all occurrences of the `pattern` in the `string` as a list of strings.


In [54]:
pattern = '\d' # numerico, 1 o mas caracteres
text = 'I have 10 apples   and  2 bananas.'

result = re.findall(pattern, text)
print(f"Occurrences: {result}")

Occurrences: ['1', '0', '2']


We'll take a quick break now to look into [RegEx patterns](#regex-patterns) before going into all the `re` functions!

#### <a id='toc2_1_1_3_'></a>[`re.match(pattern, string)`:](#toc0_)
   - Searches the `string` for a match only at the beginning.
   - Returns a match object if the pattern is found at the start, otherwise returns `None`.

In [None]:
pattern = '\w' # matches any word character. A word character includes alphanumeric characters (letters and digits) and underscores (_). It is equivalent to [a-zA-Z0-9_].
text = ' I have an apple and a banana.'

result = re.match(pattern, text)
if result:
    print(f"Match found: {result.group()}")
else:
    print("No match found.")

Since there is a space at the beginning of the text, and that is not an alphanumeric character, returns match not found.

#### <a id='toc2_1_1_4_'></a>[`re.finditer(pattern, string)`:](#toc0_)
   - Returns an iterator yielding match objects for all occurrences of the `pattern` in the `string`.

In [None]:
pattern = '\d+'
text = 'I have 3 apples and 5 bananas.'

matches = re.finditer(pattern, text)
for match in matches:
    print(f"Match found: {match.group()}")

#### <a id='toc2_1_1_5_'></a>[`re.sub(pattern, replacement, string)`:](#toc0_)
   - Replaces all occurrences of the `pattern` in the `string` with the `replacement` string.

In [None]:
pattern = r'apples'
text = 'I have 3 apples and apples.'

result = re.sub(pattern, 'oranges', text)
print(f"Updated text: {result}")

In [None]:
re.sub('\d+', '', text)   # replaces numbers for nothing

#### <a id='toc2_1_1_6_'></a>[`re.split(pattern, string)`:](#toc0_)
   - Splits the `string` by occurrences of the `pattern` and returns a list of substrings.


In [None]:
pattern = '\s+' # matches one or more occurrences of whitespace characters


result = re.split(pattern, text)
print(f"Splitted text: {result}")

### <a id='toc2_1_2_'></a>[Note](#toc0_)

You can use Python functions, for example `re.sub()` instead of `replace()`, or `re.split()` instead of `split()`, if you don't need a regex pattern.

# <a id='toc3_'></a>[RegEx patterns](#toc0_)

## <a id='toc3_1_'></a>[General tokens](#toc0_)

`\d` - digit  
`\w` - word character, i.e. no symbols  
`\s` - space character, e.g. space, tab (`\t`), newline (`\n`)  
`.` - any character  
**`^` - start of line  
`$`- end of line  **

In [57]:
text = 'I have 10 apples   and  2 bananas.'
pattern='\d'    # numeric, 1 character

# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

# Get all matches
result = re.findall(pattern, text)

print(result)

<re.Match object; span=(7, 8), match='1'>
1
(7, 8)
['1', '0', '2']


In [59]:
text = 'I have 10 apples   and  2 bananas.'
pattern='^\d'    # numeric, 1 character

# Get first result
result = re.search(pattern, text)
print(result)

None


In [72]:
text = '1 have 10 apples   and  2 bananesaugkrneahbrebwrhas.'
pattern='^\d'    # numeric, 1 character # caret

# Get first result
result = re.search(pattern, text)
print(result)

<re.Match object; span=(0, 1), match='1'>


In [69]:
text = 'I have 10 orange apples  banananananannanananananna and  2 bananas.'
pattern='\.$'    # numeric, 1 character

# Get first result
result = re.search(pattern, text)
print(result)

<re.Match object; span=(66, 67), match='.'>


In [87]:
# How can I change this pattern to look at 2 digits instead of 1?
text = '1 have 10 apples   and  22 bananesaugkrneahbrebwrhas.'
pattern='\\b\d{2}\\b'    # numeric, 1 character # caret

# Get first result
result = re.search(pattern, text)
print(result)

# Get all matches
result = re.findall(pattern, text)
print(result)

<re.Match object; span=(7, 9), match='10'>
['10', '22']


In [None]:
# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

# Get all matches
result = re.findall(pattern, text)

print(result)

In [88]:
text = 'I would like to find a white courtain.'
pattern='\w'    # word character, 1 characters

# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

# Get all matches
result = re.findall(pattern, text)

print(result)

<re.Match object; span=(0, 1), match='I'>
I
(0, 1)
['I', 'w', 'o', 'u', 'l', 'd', 'l', 'i', 'k', 'e', 't', 'o', 'f', 'i', 'n', 'd', 'a', 'w', 'h', 'i', 't', 'e', 'c', 'o', 'u', 'r', 't', 'a', 'i', 'n']


In [92]:
# How can I change to find only the single-letter word?
text = 'I would like to find a white courtain.'
pattern='\\bcourtain\\b'    # word character, 1 characters

# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

# Get all matches
result = re.findall(pattern, text)

print(result)

<re.Match object; span=(29, 37), match='courtain'>
courtain
(29, 37)
['courtain']


In [106]:
# How can I change to find only the single-letter word?
text_US = 'I would like to find a courtain of my favorite color.'
text_GB = 'I would like to find a courtain of my favourite colour.'
text_gibberish = 'I would like to find a courtain of my favourite colobr.'
text_loooooooooooooooong = 'I would like to find a courtain of my favourite colooooooooooooooooooooooooooooooooooooooooooooor.'
text_wrong = 'I would like to find a courtain of my favourite colr.'
pattern='\\bcolo+u*r\\b'    # word character, 1 characters

# Get first result
result = re.search(pattern, text_US)
print("US match:", result)
result = re.search(pattern, text_GB)
print("GB match:", result)
result = re.search(pattern, text_gibberish)
print("Gibberish match:", result)
result = re.search(pattern, text_loooooooooooooooong)
print("Loooooong match:", result)
result = re.search(pattern, text_wrong)
print("Wrong match:", result)

US match: <re.Match object; span=(47, 52), match='color'>
GB match: <re.Match object; span=(48, 54), match='colour'>
Gibberish match: None
Loooooong match: <re.Match object; span=(48, 97), match='colooooooooooooooooooooooooooooooooooooooooooooor>
Wrong match: None


In [None]:
# What if my text was formatted like this?
text = 'I.would.like.to.find.a.white.courtain.'

In [None]:
pattern = r"co.kie" # The dot (.) in regex represents any character (except newline, \n).
text = "I love my cookie and coke."

# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

# Get all matches
result = re.findall(pattern, text)

print(result)

In [None]:
pattern = r"that$" # The dot (.) in regex represents any character (except newline, \n).
text = "I want to have a look at that"

# Get first result
result = re.search(pattern, text)

print(result)
print(result.group())
print(result.span())

In [None]:
# How can I change this pattern to accommodate punctuation?

In [None]:
# What if we had a mix of sentences with and without punctuation?

## <a id='toc3_2_'></a>[Quantifiers](#toc0_)

`?` - 1 or no character  
`+` - 1 or more characters   
`*` - 0 or more characters   
`{k}` - k characters  
`{k, n}` - between k and n characters  

Interestingly called a quantifier but not really a quantifier:  
`|` - or  

In [None]:
# Quantifiers specify how many times a character or group should repeat.
pattern = r"\d{3}-\d{2}-\d{4}" #3 digits - 3 digits - 4 digits
text = "My social security number is 123-45-6789."

result = re.search(pattern, text)
if result:
    print("SSN found:", result.group())
else:
    print("No SSN found.")


In [None]:
pattern = r"apple|banana" # Alternation allows matching one of several patterns separated by |.
text = "I have a banana and an apple."

result = re.search(pattern, text)
if result:
    print("Fruit found:", result.group())
else:
    print("No fruit found.")

## <a id='toc3_3_'></a>[Collections](#toc0_)

`[a-z]` - all lowercase letters  
`[A-Z]` - all uppercase letters  
`[asdf]` - all lowercase letters in this sequence  
`[^asdf]` - all letters except the lowercase letters in this sequence  

In [109]:
pattern = r"[aei]" # Character classes allow matching any one of several characters at a specific position.
text = "The quick brown fox jumps over the lazy dog."

result = re.search(pattern, text)
if result:
    print("Vowel found:", result.group())
else:
    print("No vowel found.")

results = re.findall(pattern, text)
print(results)

Vowel found: e
['e', 'i', 'e', 'e', 'a']


In [120]:
pattern = r"\w*[aei]\w*" # Character classes allow matching any one of several characters at a specific position.
text = "a quick brown fox jumps over the lazy dog."

result = re.search(pattern, text)
if result:
    print("Vowel found:", result.group())
else:
    print("No vowel found.")

results = re.findall(pattern, text)
print(results)

Vowel found: a
['a', 'quick', 'over', 'the', 'lazy']


In [None]:
# What if the text was uppercase?
text = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."

## <a id='toc3_4_'></a>[💡 Exercise 1: Help Elton out](#toc0_)

How would you find `looooaaaaaoooaoaoaong` and `looaaooaoooaaoooot` in the sentence:
`Elton, this is a looooaaaaaoooaoaoaong sentence with a looaaooaoooaaoooot of repetition`?

In [None]:
# Your code here

In addition to using the plus sign (`+`) to look for one or more characters, you can also use the asterisk (`*`) for sets of characters that may appear zero or more times:

In [45]:
# WIll 'o' appear at all?
re.findall('l.[nto]g?','Elton, this is a loooooong sentence with a looooooooooooot of repetition')

['lto', 'loo', 'loo']

## <a id='toc3_5_'></a>[💡Exercise 2: Email patterns](#toc0_)

Lets look at the regex pattern `[\w\.]+@\w+\.\w+`, designed to match email addresses:

1. `[\w\.]+`: Matches one or more occurrences of word characters or dots (`.`).
   - `\w` represents word characters (letters, digits, and underscores).
   - `\.` matches a literal dot (period).

2. `@`: Matches the `@` symbol.

3. `\w+`: Matches one or more occurrences of word characters after the `@` symbol.
   - `\w` represents word characters (letters, digits, and underscores).

4. `\.`: Matches a literal dot (period).

5. `\w+`: Matches one or more occurrences of word characters after the dot.
   - `\w` represents word characters (letters, digits, and underscores).


In [None]:
emails_text = """
Here are some made-up email addresses:
john.doe@example.com
mary_smith123@gmail.com
theodore@example.co.uk
contact_us@company.net
info123@yahoo.com
alice.bob@example.org
support@website.io
sales.department@example.com
test.email@domain.com
random.email@subdomain.co
"""

pattern = '[\w\.]+@\w+\.\w+'

re.findall(pattern, emails_text)

What do you observe in the result? How would you fix it?

In [None]:
# Your answer here

# <a id='toc4_'></a>[Extra: Weird number formats](#toc0_)

In [None]:
string = 'The phone numbers, as they gave them to us, are 00351 933456789, +351927654321, 00351 915 678 901, 969 343 291'

In [None]:
re.findall('((\+\d{3}|00\d{3} ?)?)((\d{3} ?){3})', string)

In [None]:
# Get just the phone numbers
groupings_complex = re.findall('((\+\d{3} ?|00\d{3} ?)?)((\d{3} ?){3})', string)
list(map( (lambda x : x[2]), groupings_complex))

# <a id='toc5_'></a>[Extra: `re` in web scraping](#toc0_)

In addition to using the `BeautifulSoup` library to search for HTML tags, attributes and CSS selectors, we can also use RegEx to find patterns:

In [None]:
# I will create a typical pattern for matching a script tag
pattern = '<script>.*</script>'

In [None]:
# Then I'll just get the Wikipedia landing page for my example
import requests
response = requests.get('https://wikipedia.com')
response.content

In [None]:
# Now I'll extract the JS scripts from the page:
re.findall(pattern, response.content)

I get an error because the HTML response content is a bytes-like (computer readable) object instead of a string, so I need to convert it to a string (human-readable) object before I find my pattern: 

In [None]:
re.findall(pattern, str(response.content))