## 1.0 Intro to Regular Expressions
---

### Background

Have you ever needed to **extract or validate data** but couldn’t find an automated way to pull exactly what you need?  

**Regular Expressions (RegEx)** provide a powerful, flexible way to describe **patterns in text**, making them indispensable for tasks such as:

- Searching  
- Filtering  
- Validating input  
- Transforming data  

Python provides a built-in regular expressions package, as do many other programming languages like **Java, C#, and Rust**. Additionally, RegEx is deeply integrated in **Linux, Unix, and Mac operating systems**, and can be used on **Windows via PowerShell**.

---

### What You’ll Learn

In this notebook, we’ll:
1. Provide an overview of regex **pattern operators** — including lookaheads, lookbehinds, quantifiers, character classes, and grouping — which control *how patterns are matched without necessarily consuming text*
2. Show you how to **construct and interpret RegEx patterns**  
3. Highlight some **resources for practicing** regular expressions

If you're already comfortable with recognizing and crafting RegEx strings, go ahead and check out our other resources that cover RegEx functions, basic example problems, and real-world RegEx applications.

Let's go ahead and get started with the pattern operators!


---

### 1. Overview of Regex Pattern Operators: 

#### 1.A Operators: 

To learn how to create regular expressions, we need to first be familiar with the operators that exist. The table below provides a brief summary of the main operators, what they do, and example usage.

| RegEx Operator | Purpose | Example Task | Pattern | Example of Potential Matches |
|----------------|---------|--------------|---------|------------------------------|
| `.` | Matches any character *except a newline* | Find a 3-character string that starts with `c` and ends with `t` | `c.t` | `cat`, `cut`, `cot` |
| `^` | Matches the start of the string | Find a string that starts with the string `Rat` | `^Rat` | `Rat`, `Rather`, `Rat-like`, `Rattled` |
| `$` | Matches the end of the string | Find any string that ends with `ing` | `ing$` | `Running`, `Jumping`, `hiking`, `walking` |
| `*` | Matches the preceding token 0 or more times | Find strings starting with `cand` that may or may not end with `y` | `candy*` | `cand`, `candy`, `candyy` |
| `+` | Matches the preceding token 1 or more times | Find strings starting with `cand` that must contain at least one `y` | `candy+` | `candy`, `candyy`, `candyyy` |
| `?` | Matches the preceding token 0 or 1 time | Find both American and British spellings of color | `colou?r` | `color`, `colour` |
| `\d` | Any numerical digit (0–9) | Find any number listed in a document | `\d+` | `42`, `2025`, `3`, `17`, `00` |
| `{n}` | Matches exactly *n* repetitions | Find the exact string `111222333` | `1{3}2{3}3{3}` | `111222333` |
| `{n,m}` | Matches between *n* and *m* repetitions | Find numbers with 2–4 digits | `\d{2,4}` | `12`, `576`, `4787` |
| `[]` | Character class: matches any one character inside | Find any vowel in a string | `[aeiouy]` | `a`, `e`, `i`, `o`, `u`, `y` |
| `[^ ]` | Negated character class: matches anything *not* listed | Find a single non-digit character | `[^0-9]` | `a`, `#`, `?`, `!`, `P` |
| `\w` | Word character (letter, digit, or `_`) | Find word-like strings in text | `\w+` | `hello_123`, `yikes` |
| `\s` | Whitespace character | Find whitespace characters in text | `\s+` | ` ` (space), `\t`, `\n`, `\r` |
| `()` | Capturing group | Capture repeated occurrences of `ha` | `(ha)+` | `ha`, `hahaha`, `hahahahaha` |
| `\|` | OR operator | Check whether `cat` or `dog` appears | `cat\|dog` | `cat`, `dog` |
| `(?= )` | Positive lookahead | Match a word only if it is followed by a period | `\w+(?=\.)` | `file` from `file.txt` |
| `(?<= )` | Positive lookbehind | Match digits only if preceded by a `$` sign | `(?<=\$)\d+` | `100` from `$100` |

Now, **we can combine** these **operators** to create a **variety of pattern matches**. In fact, in the table above you may have noticed some from the examples. Like combining lookahead with the word character operator, or combining any digit `\d` with the 1 or more token operator `+` to get a continuous numerical string of any length. We can repeat the same operator multiple times, as shown with the repitition operator, and we can mix-n-match operators on demand. Before we go into crafting and iterpreting regex patterns, let's review how we handle special characters. 



#### 1.B Handling Special Characters: 
For letters or numbers, we just need to type the character in a regex pattern. As we've shown though, some characters are not read directly as characters, but instead as *special operators*. These characters are: 
`.`,  `^`,  `$`,  `*`,  `+`,  `?`,  `{`,  `}`,  `[`,  `]`,  `\`,  `|`,  `(`,  `)`

So what about when we want to match on one of these characters? For example, what if we wanted to directly extract all phone numbers from a directory, including the area codes? For US-based directories, those phone numbers might look like (123)-456-7890. So We'd need to handle the parenthesis. To manage this, we can use the `\` character. For our phone number match, if we only wanted phone numbers that had the parenthesis structure, were 10 digits long in groups of 3, 3, and 4, then we could craft a pattern like this: `\(/d{3}\)-/d{3}-/d{4})`. 

**Important Note**: If you wanted to match the hyphen character, you do not need a backslash unless it is inside the a character class. So matching `9-5` would exactly retrieve `9-5`, but in `[9-5]` it would retrieve any number between 9 and 5. You would need to add the blackslash to indicated you wanted to find instances of 9, 5, or '-'. In `[9/-5]` we'll extract a 9 or a 5 or a '-'. while adding the `+` operation (like `[9/-5]+`) would allow us to retrieve records like `555-999`, `9-5`, `995-`, `-`, and `5`. If this parts is a little confusing, don't worry -- with practice you will find it become almost second nature.  

Let's go ahead and practice interpreting RegEx patterns: 



---- 

### 2.0 Practice Problems: Interpreting RegEx Patterns 


**Problem 1:** Re.match()

Using `regex`, which of the following strings could be captured using the pattern below? Select all that apply. 

```python
pattern = 'cat'
```



In [6]:
import ipywidgets as widgets
from IPython.display import display

# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="cat", indent=False)
cb2 = widgets.Checkbox(description="CAT", indent=False)
cb3 = widgets.Checkbox(description="Cat", indent=False)
cb4 = widgets.Checkbox(description="catatonic", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"cat"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, submit, output)


Checkbox(value=False, description='cat', indent=False)

Checkbox(value=False, description='CAT', indent=False)

Checkbox(value=False, description='Cat', indent=False)

Checkbox(value=False, description='catatonic', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 2:** Dot Wildcard

Which of the following would be identified from the regex pattern `b.t`? Select all that apply. 

```python
pattern = 'b.t'
re.match(pattern, string)
```



In [12]:
import ipywidgets as widgets
from IPython.display import display

# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="bat", indent=False)
cb2 = widgets.Checkbox(description="bot", indent=False)
cb3 = widgets.Checkbox(description="bin", indent=False)
cb4 = widgets.Checkbox(description="bit", indent=False)
cb5 = widgets.Checkbox(description="but", indent=False)
cb6 = widgets.Checkbox(description="byte", indent=False)


# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"bat", "bot", "bit", "but"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4, cb5, cb6) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, cb5, cb6, submit, output)


Checkbox(value=False, description='bat', indent=False)

Checkbox(value=False, description='bot', indent=False)

Checkbox(value=False, description='bin', indent=False)

Checkbox(value=False, description='bit', indent=False)

Checkbox(value=False, description='but', indent=False)

Checkbox(value=False, description='byte', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 3:** Character Class

Which of the following would be identified from the regex pattern `c[aou]t`? Select all that apply. 

```python
pattern = 'c[aou]t'
```



In [13]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="cot", indent=False)
cb2 = widgets.Checkbox(description="coat", indent=False)
cb3 = widgets.Checkbox(description="cat", indent=False)
cb4 = widgets.Checkbox(description="cut", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"cot", "cat", "cut"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, submit, output)


Checkbox(value=False, description='cot', indent=False)

Checkbox(value=False, description='coat', indent=False)

Checkbox(value=False, description='cat', indent=False)

Checkbox(value=False, description='cut', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 4:** Multi-Character Class

Which of the following would be identified from the regex pattern `[a-z][0-9]`? Select all that apply. 

```python
pattern = '[a-z][0-9]'
```



In [14]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="at1", indent=False)
cb2 = widgets.Checkbox(description="b4", indent=False)
cb3 = widgets.Checkbox(description="s7", indent=False)
cb4 = widgets.Checkbox(description="0c", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"b4", "s7"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, submit, output)


Checkbox(value=False, description='at1', indent=False)

Checkbox(value=False, description='b4', indent=False)

Checkbox(value=False, description='s7', indent=False)

Checkbox(value=False, description='0c', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 5:** Multi-Character Class with `+` quantifier

Which of the following would be identified from the regex pattern `[a-z]+[0-9]`? Select all that apply. 

```python
pattern = '[a-z]+[0-9]'
```



In [15]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="apple1", indent=False)
cb2 = widgets.Checkbox(description="agent007", indent=False)
cb3 = widgets.Checkbox(description="a007", indent=False)
cb4 = widgets.Checkbox(description="a7", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"apple1", "a7"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, submit, output)


Checkbox(value=False, description='apple1', indent=False)

Checkbox(value=False, description='agent007', indent=False)

Checkbox(value=False, description='a007', indent=False)

Checkbox(value=False, description='a7', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 6:** Multi-Character Class with `*` quantifier

Which of the following would be identified from the regex pattern `[a-z][0-9]*`? Select all that apply. 

```python
pattern = '[a-z][0-9]*'
```



In [16]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="quantum", indent=False)
cb2 = widgets.Checkbox(description="7mechanics", indent=False)
cb3 = widgets.Checkbox(description="mechanical", indent=False)
cb4 = widgets.Checkbox(description="fifteen77", indent=False)
cb5 = widgets.Checkbox(description="thirty9", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"quantum", "mechanical", "fifteen77", "thirty9"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4, cb5) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, cb5, submit, output)


Checkbox(value=False, description='quantum', indent=False)

Checkbox(value=False, description='7mechanics', indent=False)

Checkbox(value=False, description='mechanical', indent=False)

Checkbox(value=False, description='fifteen77', indent=False)

Checkbox(value=False, description='thirty9', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 7:** Word character `/w+` operator

Which of the following would be identified from the regex pattern `/w+`? Select all that apply. 

```python
pattern = '/w+'
```



In [17]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="howdy", indent=False)
cb2 = widgets.Checkbox(description="yessir77", indent=False)
cb3 = widgets.Checkbox(description="hello_world", indent=False)
cb4 = widgets.Checkbox(description="right!", indent=False)
cb5 = widgets.Checkbox(description="sure thing", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"howdy", "yessir77", "hello_world"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4, cb5) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, cb5, submit, output)


Checkbox(value=False, description='howdy', indent=False)

Checkbox(value=False, description='yessir77', indent=False)

Checkbox(value=False, description='hello_world', indent=False)

Checkbox(value=False, description='right!', indent=False)

Checkbox(value=False, description='sure thing', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 8:** Question Mark Operator

Which of the following would be identified from the regex pattern `t?oo`? Select all that apply. 

```python
pattern = 't?oo'
```



In [None]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description="too", indent=False)
cb2 = widgets.Checkbox(description="to", indent=False)
cb3 = widgets.Checkbox(description="oo", indent=False)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {"too", "to"}
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, submit, output)


Checkbox(value=False, description='too', indent=False)

Checkbox(value=False, description='to', indent=False)

Checkbox(value=False, description='oo', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 9:** Email address matches

If we wanted to match any potential valid email addresses that end with the suffix '.com', '.org', or '.gov' which patterns would work?
*Assume hyphens, and spaces are not supported; and accented letters are not needed* 




In [None]:
# Create checkboxes (left aligned by default)
cb1 = widgets.Checkbox(description=r"\w+@(com|org|gov)", indent=False)          # ✅ valid
cb2 = widgets.Checkbox(description=r"[a-zA-Z0-9._%+-]+@(com|org|gov)", indent=False)  # ✅ valid
cb3 = widgets.Checkbox(description=r"\w+@(\.com|\.org|\.gov)", indent=False)    # ✅ valid
cb4 = widgets.Checkbox(description=r"\w+/@(com|org|gov)", indent=False)         # ❌ invalid
cb5 = widgets.Checkbox(description=r"\w+@(com|org|gov)+", indent=False)         # ❌ close but invalid
cb6 = widgets.Checkbox(description=r"\w+@[com|org|gov]", indent=False)         # ❌ close but invalid

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answers
        correct = {r"\w+@(com|org|gov)", 
                   r"[a-zA-Z0-9._%+-]+@(com|org|gov)",
                   r"\w+@(\.com|\.org|\.gov)",
                   r"\w+/@(com|org|gov)",
                   }
        
        # User selections
        selected = {
            cb.description for cb in (cb1, cb2, cb3, cb4, cb5, cb6) if cb.value
        }

        if selected == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(cb1, cb2, cb3, cb4, cb5, cb6, submit, output)


Checkbox(value=False, description='\\w+@(com|org|gov)', indent=False)

Checkbox(value=False, description='[a-zA-Z0-9._%+-]+@(com|org|gov)', indent=False)

Checkbox(value=False, description='\\w+@(\\.com|\\.org|\\.gov)', indent=False)

Checkbox(value=False, description='\\w+/@(com|org|gov)', indent=False)

Checkbox(value=False, description='\\w+@(com|org|gov)+', indent=False)

Checkbox(value=False, description='\\w+@[com|org|gov]', indent=False)

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

**Problem 10:** Pattern Comprehension

Which text option matches this pattern `([A-Z][a-z]+)\s([A-Z][a-z]+)`?


In [21]:
radio = widgets.RadioButtons(
    options=["John Doe", "john Doe", "John doe", "J Doe"],
    description="Select:",
    indent=False
)

# Submit button
submit = widgets.Button(
    description="Submit",
    button_style="primary"
)

# Output area
output = widgets.Output()

def check_answers(b):
    with output:
        output.clear_output()
        
        # Correct answer
        correct = "John Doe"
        
        # Check user selection
        if radio.value == correct:
            print("✅ Correct!")
        else:
            print("❌ Incorrect.")
            print("Try again!")

submit.on_click(check_answers)

# Display UI
display(radio, submit, output)

RadioButtons(description='Select:', options=('John Doe', 'john Doe', 'John doe', 'J Doe'), value='John Doe')

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

#### Final Note: A Caution on Creating Patterns: 
You'll find that with regex, there are many ways to get the outcome you want. Going back to our example of pulling phone records, each of these patterns acheive the same outcome: 
*   `\(/d{3}\)-/d{3}-/d{4})`
*   `\(/[0-9]{3}\)-/[0-9]{3}-[0-9]{4})`
*   `\(/[0-9][0-9][0-9]\)-/[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9])`

Some of these patterns are easier to interpret and write than others though. When creating your regex patterns, it's important to be careful guarding against unexpected edge-cases and potential matches from the rules you construct. For example, while those 3 patterns get the same result, this pattern:  `\(\d+\)[\-\d+]+`, is not equivalent and would match numerical stirngs with more or less than 10 digits. Like `(999)-09-000000`

In any pattern you create, it's always a good idea to ask, "How could this pattern fail?", "What matches might occur on this pattern that I *don't want*?"