## 1.0 Intro to Regular Expressions: 5 Basic Operations 
---

### Background

Have you ever needed to **extract or validate data** but couldn’t find an automated way to pull exactly what you need?  

**Regular Expressions (RegEx)** provide a powerful, flexible way to describe **patterns in text**, making them indispensable for tasks such as:

- Searching  
- Filtering  
- Validating input  
- Transforming data  

Python provides a built-in regular expressions package, as do many other programming languages like **Java, C#, and Rust**. Additionally, RegEx is deeply integrated in **Linux, Unix, and Mac operating systems**, and can be used on **Windows via PowerShell**.

---

### What You’ll Learn

In this notebook, we’ll cover 5 of the most commonly used functions from the python library for regular expressions. Specifically, we'll:

1. Review the match, search, findall, sub, and subn regex functions
2. Show you basic examples calling for these functions for real-world application
3. Provide additional practice problems, with progressing difficulty that you can try on your own for<br><span style="color:#1f77b4;">`re.match()`</span>, <span style="color:#1f77b4;">`re.search()`</span>, <span style="color:#1f77b4;">`re.findall()`</span>, <span style="color:#1f77b4;">`re.sub()`</span>, and <span style="color:#1f77b4;">`re.subn()`</span>

In the next video we'll cover practical applications of RegEx to solve business problems at speed.  

---


### Regex Module Functions: 

Let's begin by importing Python's Regular Expression module, `re`, and reviewing the available functions.

In [3]:
#Begin by importing regular expressions module (re)
import re

#Also load our dummy data: 
import pandas as pd
df = pd.read_csv(r'data/Contact_Directory.csv')

#Uncomment to view regex functions . 
#print([name for name in dir(re) if callable(getattr(re, name))])

#For more details go here: 
#https://docs.python.org/3/library/re.html

In [4]:
df.sample(5)

Unnamed: 0,id,name,email,phone,region,notes
171,172,Johnson,emma at johnson dot com,4723537339,Australia,No notes
896,897,Bob Brown,bob36@example.co.uk,,East Asia,Call after 5pm
470,471,Garcia,jose_garcia@invalid,429.227.1040,West Africa,No notes
231,232,Jane Martinez,,,South America,Preferred contact: email
359,360,Noah Doe,,(984) 853-6864,North America,Call after 5pm


#### Function 1: Match

**Calling Method:** <span style="color:#1f77b4;">`re.match(pattern, string)`</span>

**Returns:**  
<span style="color:#d62728;">`None`</span> if no match is found, else a <span style="color:#2ca02c;">`match object`</span>, with:  

- the matched text  
- start & end position of the match  
- span of the match  
- captured groups  

**Explanation:**<br>This RegEx function searches for substrings in text that satisfy a specified pattern. It <u>only returns a match if it is found at the *start*</u> of a string.

**Example Usage:**  
Let's check a list of report names to only include **Quarterly Reports** for global operations (e.g., North_America, Europe, Asia), while avoiding files with incomplete or invalid data.
```python
pattern = r"\d{8}_Quarterly_Report\.csv"
text = df["filename"].iloc[7]
match = re.match(pattern,text)

```

In [3]:
#Example: 

#Load datafile & view sample records 
df = pd.read_csv(r'data/report_files.csv')
df.sample(5)

Unnamed: 0,filename,comment
120,Market_Insight.csv,Document Classification: Public Record Prepare...
273,20220918_Global_Quarterly_Report.csv,Document Classification: Business Confidential...
62,20250712_Global_Quarterly_Report.csv,Document Classification: Business Confidential...
157,incomplete_inventory.csv,Document Classification: not classified yet Pr...
370,inventory.csv,Document Classification: Public Record Prepare...


In [4]:
#Declare pattern and text to check for a match
pattern = r"\d{8}_Global_Quarterly_Report\.csv"
text = df["filename"].iloc[2]
match = re.match(pattern,text)

# #Example outputs if a match is found:
print(type(match), match.start(), match.end(), match.span(), match.group())

<class 're.Match'> 0 36 (0, 36) 20200517_Global_Quarterly_Report.csv


In [5]:
#Example outputs if no match is found:
text = df["filename"].iloc[0]
match = re.match(pattern,text)
print(type(match))

#If you try to get the start, end, span or group from a non-match, you'll get an error.
#match.start(), match.end(), match.span(), match.group()

<class 'NoneType'>


**Quick Note:** Since the regex's <span style="color:#1f77b4;">`re.match()`</span> function only searches the beginning of each string for an exact match, it can execute very quickly. However, often our substring of interest is somewhere in the middle the text. In this case we can use Regex's *Search* function.    


#### Function 2: Search

**Calling Method:** <span style="color:#1f77b4;">`re.search(pattern, string)`</span>

**Returns:**  
Similar to <span style="color:#1f77b4;">`re.match()`</span>, this function will return <span style="color:#d62728;">`None`</span> if no match is found, else it returns a <span style="color:#2ca02c;">`match object`</span>, with:  

- the matched text  
- start & end position of the match  
- span of the match  
- captured groups  

**Explanation:**<br>This RegEx function searches for substrings in text that satisfy a specified pattern. It <u>only returns the first match </u> found in a string, even if multiple other matches exist.

**Example Usage:**  
The finance director wants you to **pull all records** that were **prepared or validated by Frank B.** to highlight examples of superb work quality. 

```python
pattern = r"Frank B."
text = df["comments"].iloc[4]
match = re.search(pattern,text)

```

In [10]:
# Example outputs if a match is found:

pattern = r"Frank B."
text = df["comment"].iloc[4]  # Row 4 contains Frank B.
search_match = re.search(pattern, text)


print('search_match type: ', type(search_match)) 
print('start position: ', search_match.start()) 
print('end position: ', search_match.end()) 
print('span: ', search_match.span()) 
print('search_match text: ', search_match.group())

search_match type:  <class 're.Match'>
start position:  52
end position:  60
span:  (52, 60)
search_match text:  Frank B.


In [11]:
# Example outputs if no search match is found:

text = df["comment"].iloc[3] # The file at row 3 want not prepared or validated by Frank B.
search_match = re.search(pattern, text)

print('search_match type: ', type(search_match))

search_match type:  <class 'NoneType'>


**Quick Note:** So, just to re-cap <span style="color:#1f77b4;">`re.match()`</span> checks only at the start of a text, while <span style="color:#1f77b4;">`re.search()`</span> checks anywhere in the text, but **only pulls the first match found**. So what if we want to pull multiple matches? That's where we will use <span style="color:#1f77b4;">`re.findall()`</span>


#### Function 3: Findall

**Calling Method:**  
`re.findall(pattern, string)`

**Returns:**  
This function returns a **list** of matches:

- If **no match** is found → an empty list `[]`
- If matches are found:
  - A list of matched strings (when no capture groups are used), **or**
  - A list of tuples (when the pattern contains capture groups)

> Unlike `re.search()`, `re.findall()` does **not** return match objects and does **not** provide start/end positions or spans.

**Explanation:**  
This RegEx function searches for **all non-overlapping substrings** in text that satisfy a specified pattern. It returns **every match** found in the string, not just the first one.

**Example Usage:**  
For many older texts, it's common to update phrases over time so they better reflect the intended meaning, or so they are more socially acceptable. We might also change phrases as we make versions that are easier to read for younger audiences. For this example, we've pulled the book *Alice's Adventures in Wonderland* by Lewis Carroll from Project Gutenberg. In the original text, most dialogus was written as, "said the king", "said the queen", or "said the [noun]". We'd like to update that to "the king said", or "the queen said", and so on. 

So, let's **find all instances** where **"said the""** occurs in the original text.

```python

#Open the file
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

    pattern = r" [sS]aid the "
    matches = re.findall(pattern, text)
    print("Matches found:", matches)
    print("Number of matches:", len(matches))


In [14]:
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

    pattern = r" [sS]aid the "
    matches = re.findall(pattern, text)
    print("Number of matches:", len(matches))

Number of matches: 197


**Quick Note:** Excellent, so we can find how many instances this phrase appears in the original text using <span style="color:#1f77b4;">`re.findall()`</span>. Now, what if we want to actually perform the replacement? Well, we would need to do a substitution when the pattern is found. So looks examine the next function, <span style="color:#1f77b4;">`re.sub()`</span>


#### Function 4: Substitution

**Calling Method:**  
`re.sub(pattern, repl, string)`

**Returns:**  
This function returns a **new string** where matches from the specified pattern have been replaced with the specified replacment (repl):

- If **no match** is found → the original string is returned unchanged
- Otherwise → all matching substrings are replaced according to the `repl` argument

> Unlike `re.findall()`, `re.sub()` does **not** return a list of matches. Instead, it **transforms the text** by replacing matching patterns.

**Explanation:**  
So, rather than simply locating patterns like re.findall, re.match, or re.search, this RegEx function actually updates the input string which is why it is commonly used for cleaning, updating, and standardizing text.

**Example Usage:**  
Let's revisit the last example where we wanted to find and update old phrases from *Alice's Adventures in Wonderland*. In the original text, most dialogue was written as `"said the king"`, `"said the queen"`, or `"said the [noun]"`. We'd like to update that phrasing so the subject comes first, such as `"the king said"` or `"the queen said"`.

So instead of finding all instances of `"said the"`, we can **replace** them directly using `re.sub()`.

```python

# Open the file
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

pattern = r"([sS]aid the )([a-zA-Z]+)"
replacement = r"the \2 said"

updated_text = re.sub(pattern, replacement, text)

print(updated_text[:500])  # Preview the first 500 characters

In [15]:
# Open the file
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

pattern = r"([sS]aid the )([a-zA-Z]+)"
replacement = r"the \2 said"

updated_text = re.sub(pattern, replacement, text)

#show some examples of the new text replacement
old_pattern = r" [sS]aid the "
new_pattern = r"the [a-zA-Z]+ said"

old_matches = re.findall(old_pattern, text)
new_matches = re.findall(new_pattern, text)

if old_matches == None: 
    print('All matches replaced successfully.')

if new_matches: 
    print(new_matches[0:5])  # Preview first 5 new matches

['the Duchess said', 'the Duchess said', 'the Cat said', 'the Hatter said', 'the Hatter said']


Perfect! Now we've updated the text as desired. 

It might have been nice though, to see how many replacements were performed when we made the change. Rather than having to write a counter method to keep track, RegEx has an alternative method `re.subn()`. It's basically the same as `re.sub()`, except that it returns 2 objects instead of just one. Let's repeat the last task using `re.subn()`

#### Function 5: Substitution with a Counter (Subn)

**Calling Method:**  
`re.sub(pattern, repl, string)`

**Returns:**  
This function returns a **tuple** containing:

1. A **new string** where matches from the specified pattern have been replaced with the specified replacement (`repl`)
2. An **integer count** representing how many substitutions were made

- If **no match** is found → the original string is returned and the count is `0`
- Otherwise → all matching substrings are replaced, and the count reflects the number of replacements performed

> Unlike `re.sub()`, which returns only the modified string, `re.subn()` provides **additional insight** by reporting how many substitutions occurred.

**Explanation:**  
So, rather than simply locating patterns like re.findall, re.match, or re.search, this RegEx function actually updates the input string which is why it is commonly used for cleaning, updating, and standardizing text.

**Example Usage:**  
Let's revisit the last example where we wanted to find and update old phrases from *Alice's Adventures in Wonderland*. In the original text, most dialogue was written as `"said the king"`, `"said the queen"`, or `"said the [noun]"`. We'd like to update that phrasing so the subject comes first, such as `"the king said"` or `"the queen said"`.

So instead of finding all instances of `"said the"`, we can **replace** them directly using `re.sub()`.

```python

# Open the file
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

pattern = r"([sS]aid the )([a-zA-Z]+)"
replacement = r"the \2 said"

updated_text, replacement_count = re.subn(pattern, replacement, text)

print("Number of replacements:", replacement_count)

In [18]:
with open("data/alice_in_wonderland.txt", "r", encoding="utf-8") as file:
    text = file.read()

pattern = r"([sS]aid the )([a-zA-Z]+)"
replacement = r"the \2 said"

updated_text, replacement_count = re.subn(pattern, replacement, text)

print("Number of replacements:", replacement_count)
print("First 500 characters of updated text:\n", updated_text[:500])

Number of replacements: 199
First 500 characters of updated text:
 The Project Gutenberg eBook of Alice's Adventures in Wonderland
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using 


Excellent, not only did we make the changes, but we we able to get a quick, reliable counter as well. 

### Summary: 

In this notebook, we've demonstrated how to use <span style="color:#1f77b4;">`re.match()`</span>, <span style="color:#1f77b4;">`re.search()`</span>, <span style="color:#1f77b4;">`re.findall()`</span>, <span style="color:#1f77b4;">`re.sub()`</span>, and <span style="color:#1f77b4;">`re.subn()`</span>. We've explained the basics of how to call each function, what they return, and we've explained when they can be useful. Now, as reminder, if you want 1 on 1 tutoring, or have a project you'd like to tackle, we offer consulting support for a fee. If you're interested, reach out at info@envirolytica.com. 


In the next lecture, we'll cover additional regex functions such as <span style="color:#1f77b4;">`re.split()`</span>, <span style="color:#1f77b4;">`re.fullmatch()`</span>, <span style="color:#1f77b4;">`re.finditer()`</span>,<span style="color:#1f77b4;">`re.escape()`</span>, <span style="color:#1f77b4;">`re.purge()`</span>, and <span style="color:#1f77b4;">`re.compile()`</span>

If you want to additional practice with the functions we've just covered, feel free to continue with our practice problems in the next video!