<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312_H11/blob/main/CST2312_H11_Spr2025_Class_17_RegexPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CST2312 Class17 - Regex**

*this Colab notebook available with the link [bit.ly/cst2312cl17](https://bit.ly/cst2312cl17)*    

*This notebook includes CST2312 homework Assignment #4 which includes   an exercise to be completed ⚪ 4.1 at the end of the notebook.*

## Contents  
  
1. About Regular Expressions (Regex)
2. Python and Regex -- The Basics
3. Raw Strings
4. Regex Syntax Fundamentals
5. Finding Regex Patterns with `re.search()` and `re.match()`
6. Using an Online Regex Tester
7. Multiple Matches with Regex `re.findall()`
8. Regex Match Objects
9. Looping with Regex `re.finditer()`
10. Substitution with Regex `re.sub()`
11. Character Classes and Ranges
12. Quantifiers and Greediness
13. Grouping and Capture
14. Compiling Regex Patterns with `re.compile()`  
  
15. Homework Assignment 4.1  
16. Additional Practice -- NOAA Tides





---



## 1. About Regular Expressions (Regex)

**Reading** from the required textbook: ( [https://www.py4e.com/lessons/](https://www.py4e.com/lessons/))

* [Regular Expressions](https://www.py4e.com/lessons/regex) (Chapter 12)
* [Data Science Cheat Sheet Python Regular Expressions](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)



So far we have been using methods like `split` and `find` to extract portions of strings or to answer a question of whether a particular item / string is part of a list-set-tuple-dictionary / longer string.

Regular Expressions
-------------------

Regular expressions (Regex or re's) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists.

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement.

We will present examples using python's standard [re regular expression library](http://docs.python.org/library/re.html).

We will discuss Python libraries in detail later.

Many examples from from this [Google tutorial](https://developers.google.com/edu/python/regular-expressions). Pay attention that tutorial itself uses a 2.x version of Python, thus, several statements (for example, `print()`) look differently than in the later versions of Python. In this notebook, the examples from the Google tutorial are converted to the current Python version.



---



## 2. Python and Regex -- The Basics

1. The `re` Module
The `re` module provides regular expression support in Python. It must be explicitly imported before use.

2. Raw Strings in Regex
Raw strings (e.g., r"...") are commonly used in regular expressions to prevent Python from interpreting backslashes as escape characters. This ensures the regex engine receives the intended pattern.

3. Core Matching Functions

- `re.search(pattern, string)`: Scans through the string and returns the first match found.  
  
- `re.match(pattern, string)`: Attempts to match a pattern only at the beginning of the string.  
  
- `re.fullmatch(pattern, string)`: Requires the entire string to match the pattern exactly.  

4. Basic Example
A simple example might include identifying a word or number within a sentence. Typical usage includes checking if a match is found and retrieving the result using `.group()`.

5. Function Selection Considerations
Each matching function serves a different use case. The choice among `search`, `match`, and `fullmatch` depends on whether the pattern must match the entire string, the start of the string, or any part of the string.

### Housekeeping   

The regular expression library `re` must be imported into your program before you can use it.

In [None]:
# first import the regular expression library
import re



---



## 3. Raw Strings

First, a reminder about *raw strings* which are delineated with the `r` character preceding the first quotation (single or double) mark of a *string*.  When a string is denoted as a *raw string* then the contents inside the quotes are interpreted exactly as they are.  That is, special control characters such as `\n`, `\t`, etc.  are interpreted as the text characters '\' and 'n' or '\' and 't' respectively and not as their control characters (new line, or tab).  

In Python, a raw string is a string prefixed with the letter `r` or `R`, like this:  

In [None]:
r"This is a raw string"

### Purpose

  
Raw strings are used to treat backslashes (`\`) as literal characters rather than escape characters. This is particularly useful for things like:

- Regular expressions  
- Windows file paths  
- Other cases where many backslashes are used  



In [None]:
# Regular string
print("C:\\Users\\John\\Documents")
# Output: C:\Users\John\Documents

# Raw string
print(r"C:\Users\John\Documents")
# Output: C:\Users\John\Documents

In the regular string, each `\\` is interpreted as a single backslash. In the raw string, the backslashes are preserved as-is.

*Important Notes*  
  
Raw strings do not escape backslashes, so sequences like `\n`, `\t`, or `\\` are taken literally.

A raw string cannot end with a single backslash, as it would escape the closing quote and cause a syntax error:

In [None]:
# Invalid: ends with single backslash
r"Path ends with backslash\"

In [None]:
r"Path ends with backslash\\"

### Practice


#### 1.1 Printing a string with embedded control characters    


In [None]:
print('This string\nprints two lines.')

#### 1.2 Printing a raw string with the same embedded control characters    


In [None]:
print(r'This string\n does not print two lines.')

#### 1.3 Try printing a string with embedded tab control characters as an `r` string and as a string (not `r`)      




---



## 4. Regex Syntax Fundamentals

see [Python Documentation](https://docs.python.org/3/library/re.html)

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are some basic patterns.  

### Regular Expression Pattern Reference by Category

| **Pattern**     | **Description**                                         | **Usage Example**                            |
|-----------------|---------------------------------------------------------|----------------------------------------------|
|                 | **Literal Characters**                                  |                                              |
| `a`, `Z`, `7`   | Match exact characters                                  | `re.search("dog", "hotdog")` → Match `'dog'`  |
|                 | **Metacharacters (must be escaped to match literally)** |                                              |
| `.`             | Matches any character except newline                    | `re.search("a.c", "abc")` → Match `'abc'`    |
| `\.`            | Matches a literal dot                                   | `re.search(r"a\.c", "a.c")` → Match `'a.c'`  |
| `\*`, `\+`      | Escape special characters to match them literally       | `re.search(r"\*", "a*b")` → Match `'*'`      |
|                 | **Character Classes**                                   |                                              |
| `[abc]`         | Matches any one of `a`, `b`, or `c`                     | `re.search(r"[abc]", "xay")` → `'a'`         |
| `[a-z]`         | Matches any lowercase letter                            | `re.search(r"[a-z]", "ABCd")` → `'d'`        |
| `\d`            | Matches any digit (`[0-9]`)                             | `re.search(r"\d", "abc3")` → `'3'`           |
| `\D`            | Matches any non-digit                                   | `re.search(r"\D", "1a")` → `'a'`             |
| `\w`            | Matches any word character (`[a-zA-Z0-9_]`)             | `re.search(r"\w", "@a")` → `'a'`             |
| `\W`            | Matches any non-word character                          | `re.search(r"\W", "a!")` → `'!'`             |
| `\s`            | Matches any whitespace character                        | `re.search(r"\s", "a b")` → `' '`            |
| `\S`            | Matches any non-whitespace character                    | `re.search(r"\S", " a")` → `'a'`             |
|                 | **Quantifiers**                                         |                                              |
| `*`             | Matches 0 or more of the preceding element              | `re.search(r"a*b", "aaab")` → `'aaab'`       |
| `+`             | Matches 1 or more of the preceding element              | `re.search(r"a+b", "aaab")` → `'aaab'`       |
| `?`             | Matches 0 or 1 of the preceding element                 | `re.search(r"a?b", "b")` → `'b'`             |
| `{n}`           | Matches exactly `n` times                               | `re.search(r"a{3}", "aaab")` → `'aaa'`       |
| `{n,}`          | Matches `n` or more times                               | `re.search(r"a{2,}", "aaa")` → `'aaa'`       |
| `{n,m}`         | Matches between `n` and `m` times                       | `re.search(r"a{2,3}", "aaaa")` → `'aaa'`     |
|                 | **Anchors**                                             |                                              |
| `^`             | Anchors to the beginning of a string                    | `re.search(r"^abc", "abcdef")` → `'abc'`     |
| `$`             | Anchors to the end of a string                          | `re.search(r"abc$", "123abc")` → `'abc'`     |
| `\b`            | Matches a word boundary                                 | `re.search(r"\bcat", "the cat")` → `'cat'`   |
| `\B`            | Matches a non-word boundary                             | `re.search(r"\Bcat", "scat")` → `'cat'`      |
|                 | **Groups and Capturing**                                |                                              |
| `(abc)`         | Captures a group                                        | `re.search(r"(abc)", "abc").group(1)` → `'abc'`  |
| `(?:abc)`       | Non-capturing group                                     | `re.search(r"(?:abc)", "abc")` → `'abc'`     |
| `(?P<name>abc)` | Named capturing group                                   | `re.search(r"(?P<grp>abc)", "abc").group("grp")` → `'abc'`  |
|                 | **Alternation**                                         |                                              |
| `a\|b`          | Matches either `a` or `b`                               | `re.search(r"a\|b", "cat")` → `'a'`           |
|                 | **Escapes and Special Sequences**                       |                                              |
| `\\`            | Matches a literal backslash                             | `re.search(r"\\\\", "\\path")` → `'\\'`      |
| `\n`            | Matches newline character                               | `re.search(r"\n", "a\nb")` → `'\n'`          |
| `\t`            | Matches tab character                                   | `re.search(r"\t", "a\tb")` → `'\t'`          |
| `\xFF`          | Matches a character by hexadecimal value                | `re.search(r"\x41", "A")` → `'A'`            |
| `\u1234`        | Matches a Unicode character                             | `re.search(r"\u0061", "a")` → `'a'`          |


### Case Sensitivity in Regular Expressions

Regular expressions are typically case-sensitive.

In [None]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'in.*on')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print('The end')

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [None]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('in.*on',re.IGNORECASE)
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print('The end')

### Regular Expression Functions for Analyzing Patterns

* `.compile()`    
* `.finditer()`    
* `.findall()`    
* `.match()`    
* `.search()`    
* `.sub()`    

[https://docs.python.org/3/library/re.html](https://docs.python.org/3/library/re.html)

Let's separate these functions into four groups:  
1. `.match()`, `.search()`  
2. `.finditer()`, `.findall()`    
3. `.sub()`   
4. `.compile()`    
   
We will discuss each of those in this notebook.  Along the way, we will cover related topics as they are revealed in our practice with the various `re` functions.  Many brief examples are provided.  Try changing strings to be searched and search patterns as you practice with this notebook.



---



## 5. Finding Regex Patterns with `re.search()` and `re.match()`

#### `re.search()`
  
**Purpose**:  

Scans the entire string for the first match of the pattern.

Returns a match object if a match is found; otherwise, returns `None`.

**Example**:



In [None]:
re.search(r'\d+', 'abc123xyz')  # Matches '123'

<b>Key Points</b>:  

Useful when you only need to check if a pattern exists in the string.  

Stops after the first match.  

Unlike `re.match()`, it doesn’t require the match to start at the beginning.  

#### Simple (relatively) Regex Examples with 're.search()`

In [None]:
# let's create a string variable and try to use `split` to find 'cat'
inputStr = 'An example word:cat!!'
input_list = inputStr.split(' ')
for element in input_list :
    print(element)

In [None]:
inputStr = 'An example word:cat!!'
search_pattern_raw = r'cat'

print('Search for the pattern "' + search_pattern_raw + '" in "' + inputStr + '"')

In [None]:
# the `re.search()` method returns a Match object which can be evaluated as a Boolean
inputStr = 'An example word:cat!!'

# tell me a pattern and where to look for that pattern...
if re.search(r'cat', inputStr):   # search takes arg1 of a search pattern and arg2 of string to search
  print('Found that pattern in the following string:')
  print(inputStr)

# then print confirmation of execution
print ("Done with the example")

In [None]:
# The actual Match object returned by the `re.search()` method which can be evaluating as a Boolean
inputStr = 'An example word:cat!!'
re_returns = re.search(r'cat', inputStr)
print(re_returns)

In [None]:
# for the last example there is no need for a raw `r` string
inputStr = 'An example word:cat!!'
if re.search('cat', inputStr):   # search takes arg1 of a search term and arg2 of string to search
  print('Found that pattern in the following string:')
  print(inputStr)

print ("Done with the example")

In [None]:
# the `re.search()` method returns a Match object which can be evaluated as a Boolean
inputStr = 'An example word: cat!!'
if re.search(r'dog', inputStr):
    print(inputStr)
else:
    print('Sorry.  Could not find "' + r'dog' + '" in "' + inputStr + '"')

In [None]:
# The Match object returned by the `re.search()` method which can be evaluated as a Boolean
inputStr = 'An example word:cat!!'
re_returns = re.search(r'dog', inputStr)
print(re_returns)

#### Some additional search patterns with 're.search()`   

In [None]:
inputStr = 'An example word: cat!!'
if re.search(r'word: \w\w\w', inputStr):
  print('Found it in -', inputStr)

print ("Done with the example")

In [None]:
# The Match object returned by the `re.search()` method which can be evaluated as a Boolean
inputStr = 'An example thing: cat!!'
re_returns = re.search(r'thing: \w\w\w', inputStr)
print(re_returns)

In [None]:
inputStr = 'An example word: dog, anda, cat!!'
if re.search(r'word: \w\w\w', inputStr):
  print('Found it in -', inputStr)

print ("Done with the example")

In [None]:
# The Match object returned by the `re.search()` method which can be evaluated as a Boolean
inputStr = 'An example word: dog, anda, cat!!'
re_returns = re.search(r'word: \w\w\w', inputStr)
print(re_returns)

In [None]:
# What is the regular expression pattern included a space after 'word:'?
inputStr = 'An example word:dog, anda, cat!!'
if re.search(r'word: \w\w\w', inputStr):
    print('Found it in -', inputStr)
else:
    print('Sorry.  Could not find "' + r'word: \w\w\w' + '" in "' + inputStr + '"')

print ("Done with the example")

In [None]:
# The Match object returned by the `re.search()` method which can be evaluated as a Boolean
inputStr = 'An example word:dog, anda, cat!!'
re_returns = re.search(r'word: \w\w\w', inputStr)
print(re_returns)

In [None]:
the_text = 'there are many words here some with three characters and also this phrase "stuffH dog", plus cat!!'
if re.search(r'stuffH \w\w\w', the_text):
  print("I found the pattern you asked about.")

print ("Done with the example")

In [None]:
# The Match object returned by the `re.search()` method which can be evaluated as a Boolean
the_text = 'there are many words here some with three characters and also this phrase "stuffH dog", plus cat!!'
re_returns = re.search(r'stuffH \w\w\w', the_text)
print(re_returns)

In [None]:
# What if we did that but had a lower case 'h' in the raw search string?
the_text = 'there are many words here some with three characters and also this phrase "stuffH dog", plus cat!!'
if re.search(r'stuffh \w\w\w', the_text):
    print("I found the pattern you asked about.")
else:
    print('Sorry.  Could not find "' + r'stuffh \w\w\w' + '" in "' + the_text + '"')

print ("Done with the example")

#### Pattern Matching in File Contents with `re.search()`

Let's try using some regex patterns with the `re.search()` function.  To begin, we will bring a copy of Dr. Chuck's `mbox-short.txt` file to our current working director with the following `curl` statement.

In [None]:
!curl "https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/refs/heads/main/mbox-short.txt" -o mbox-short.txt

The regular expression module re must be imported into your program before you can use it. The simplest use of the regular expression module is the `search()` function. The following program demonstrates a trivial use of the search function.

In [None]:
# Search for lines that contain 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'From:', line):
        print(line)

The power of the regular expressions comes when we add special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.

For example, the caret character is used in regular expressions to match “the beginning” of a line. We could change our program to only match lines where “From:” was at the beginning of the line as follows:

In [None]:
# Search for lines that start with 'From'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^From:', line):
        print(line)

There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period or full stop, which matches any character.


In the following example, the regular expression F..m: would match any of the strings “From:”, “Fxxm:”, “F12m:”, or “F!@m:” since the period characters in the regular expression match any character.

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^F..m:', line):
        print(line)

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^F...m:', line):
        print(line)

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^F.{3}m:', line):
        print(line)

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^F.+m:', line):
        print(line)

In [None]:
# Search for lines that start with 'F', followed by
# 2 characters, followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^F.*m:', line):
        print(line)



---



#### More Basic Regex Pattern Examples


The basic rules of regular expression search for a pattern within a string are:

* The search proceeds through the string from start to end, stopping at the first match found

* All of the pattern must be matched, but not all of the string

* If `match = re.search(pat, str)` is successful, match is not `None` and in particular `match.group()` is the matching text

In [None]:
## Search for pattern 'ooo' in string 'pooog'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.

inputStr = 'pooog'

look_for = r'ooo'
matched_string = re.search(look_for, inputStr)

if matched_string:
  print ('found, matched_string.group() == ', look_for)
  print (matched_string.group())

In [None]:
match = re.search(r'ig', 'pooog')
# match = re.search(r'igs', 'pooog')
if not match:
    print ('not found, match == None')
    print (re.search(r'igs', 'pooog'))
else:
    print ('found, matched_string.group() == "ig"')

In [None]:
## . = any char but \n

# try each of these separately
# inputStr = 'pooog'
# inputStr = '9oooz'
# inputStr = '\nxyz'
inputStr = r'\nxyz'

# look_for = r'..g'
# look_for = r'g..'
# look_for = r'p.'
look_for = r'.'

match = re.search(look_for, inputStr)

if match:
  print ('found, match.group() == ', look_for)
  print (match.group())

print("All done.")

In [None]:
## \d = digit char, \w = word char

inputStr = 'p123g'

look_for = r'\d\d\d'

match = re.search(look_for, inputStr)

# match = re.search(r'\d\d\d', 'p123g')

if match:
  print ('found, match.group() == ', look_for)
  print (match.group())

In [None]:
## \d = digit char, \w = word char

inputStr = '@@abcd!!'

# try each of these separately
# look_for = r'\w\w\w'
# look_for = r'\w\w.\w'
# look_for = r'\w\w\w.'
look_for = r'.\w\w\w'

match = re.search(look_for, inputStr)

if match:
  print ('found, match.group() == ', look_for)
  print (match.group())
else:
  print ("Nothing found")

In [None]:
## looking for four word characters

# try each of these separately
# inputStr = '@@abcd!!'
# inputStr = '@@abc7!!'
inputStr = '@@abc!7??'

look_for = r'\w\w\w\w'

match = re.search(look_for, inputStr)

if match:
  print ('found, match.group() == ', look_for)
  print (match.group())
else:
  print ("Nothing found")

#### <u><b><font color=blue>Regex Email Exercise 1</b></font></u>   
    
You just started a new job and have been asked to look at the code used in a program that has not been working.  The program specification says that it should read a string with a full email address then extract the domain name and the TLD (top-level domain) from that email address.  The code you reviewed assigns email addresses to a variable `email` but it is not printing the domain name or TLD consistently.  For example, the email addresses:    
 `patrick.slattery16@citytech.cuny.edu`    or
 your CityTech email address would result in incorrect output.   

Review and fix the following code to extract the domain name and TLD according to the specifications.


In [None]:
# email = 'patrick.slattery16@citytech.cuny.edu'
email = 'fname.lname99@mail.citytech.cuny.edu'

### FIX/REPLACE THE CODE BETWEEN HERE
email_match = re.search(r'(\w+)@(\w+)\.(\w+)', email)

print('found:' + email_match.group())

match_list= re.split(r"@",email_match.group())
### ... AND HERE

print('The domain and TLD of',  email,'are:',match_list)

In [None]:
# type your code here


##### ***SOLUTION - don't look here***

The following regular expression matches the domain and TLD at the end of an email address, capturing any sub-domains as a separate group. The regular expression uses character classes and non-capturing groups to match different parts of the domain name and TLD. The regular expression also ensures that the match occurs at the end of the string using the `$` symbol.

In [None]:
# email = 'patrick.slattery16@citytech.cuny.edu'
email = 'fname.lname99@mail.citytech.cuny.edu'

pattern = r'@((?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,}(?:\.[A-Za-z]{2,})?)$'

match = re.search(pattern, email)

if match:
    sub_domains = match.group(1).split(".")[:-2]
    domain_name = match.group(1).split(".")[-2]
    tld = match.group(1).split(".")[-1]
    print(f"Sub-domains: {sub_domains}")
    print(f"Domain name: {domain_name}")
    print(f"TLD: {tld}")


<b><u><center><h3>Breaking Down the Solution Regex Pattern</h3></center></b></u>    
    
```
r'@((?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,}(?:\.[A-Za-z]{2,})?)$'
```

**The regular expression search pattern uses three fundamental patterns to solve this problem:**    

1. <b><u>@</b></u>: This matches the "@" symbol in an email address, indicating the start of the domain name.    
    
2. <b><u>([[A-Za-z0-9]](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+</b></u>: This captures the domain name and any sub-domains that precede it. The `+` at the end allows for multiple sub-domains.     
    
3. <b><u>[A-Za-z]{2,}</b></u>: This captures the domain name or TLD, which must be composed of two or more letters.    

*Let's break this down further*:    
    
  - <b><u>[A-Za-z0-9]</b></u>: This matches the first character of a sub-domain or domain name, which must be a letter or number.    
    
  - <b><u>(?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?</b></u>: This is a non-capturing group that matches between 0 and 61 additional characters in a sub-domain or domain name. These characters can be letters, numbers, or hyphens, but the last character must be a letter or number.    
       
  - <b><u>\.</b></u>: This matches the dot separator between sub-domains and the domain name.    
    
  - <b><u>[A-Za-z]{2,}</b></u>: This captures the actual domain name, which must be composed of two or more letters.    
    
  - <b><u>(?:\.[A-Za-z]{2,})?</b></u>: This captures the TLD, which is optional and may consist of one or two segments separated by a dot. The (?:...) indicates a non-capturing group, and the ? indicates that this group is optional.    
    
  - <b><u>$</b></u>: This matches the end of the string, ensuring that the regular expression matches only the domain and TLD at the end of an email address.    
    

In [None]:
# Muhammad's solution

# email = 'patrick.slattery16@citytech.cuny.edu'
email = 'fname.lname99@mail.citytech.cuny.edu'
# email = cserv@michigan.edu

match = re.match(r'(\w+)\.(\w+)@(\w+)\.(\w+)\.(\w+).(\w+)', email)

if match:
    # extract the account name and domain
    account_name = match.group(3) + '.' + match.group(4) + '.' + match.group(5)
    domain = match.group(5)
    tld = match.group(6)
    # print the results
    print('The domain and TLD of',  email,'are:', domain + ' and ' + tld)

In [None]:
# Olekseii's solution

# email = 'patrick.slattery16@citytech.cuny.edu'
email = 'fname.lname99@mail.citytech.cuny.edu'

email_match = re.search(r'^[a-zA-Z0-9._]+@[a-zA-Z.]+$', email)

match_list = re.split(r"@",email_match.group())

dt_list = match_list[-1].split('.')

domain_tld = (dt_list[-2], dt_list[-1])

print('The domain and TLD of',  email,'are:', domain_tld)

#### <u><b><font color=blue>Regex Email Exercise 2</b></font></u>

Suppose you want to find the email address inside the string 'xyz myemail@citytech.edu data  science'.

Here's an attempt using the pattern `r'\w+@\w+'`:

#####***SOLUTION - don't look here***   

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'\w+@\w+', emailText)
if match:
  print (match.group())

The search does not get the whole email address in this case because the `\w` does not match the `'-'` or `'.'` in the address. We'll fix this using the regular expression features below.

**Square Brackets**

Square brackets can be used to indicate a set of chars, so `[abc]` matches `'a'` or `'b'` or `'c'`. The codes `\w`, `\s` etc. work inside square brackets too with the one exception that dot (`.`) just means a literal dot. For the emails problem, the square brackets are an easy way to add `'.'` and `'-'` to the set of chars which can appear around the `@` with the pattern `r'[\w.-]+@[\w.-]+'` to get the whole email address:

In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'[\w.-]+@[\w.-]+', emailText)
if match:
  print (match.group())

We will use `group()` function to extract the account name and the domain from the email address


In [None]:
emailText = 'xyz my-email@citytech.edu data science'
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

### More Regex `.search()` and `match()` Examples

In [None]:
inputStr = 'an example word: dog!!'
match_object3 = re.search(r'word: \w\w\w', inputStr)
# If-statement after search() tests if it succeeded
if match_object3:
  # print ('found', match_object3.group()) ## 'found word:dog'
  print ('found', match_object3.group(0)) ## 'found word:dog'
else:
  print ('did not find')
print ("Done with the example")

In [None]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line) # regExs within a regEx

if matchObj:
   # print ("matchObj.group() : ", matchObj.group(0))
   print ("matchObj.group(0) : ", matchObj.group(0))
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

In [None]:
line = "My cats are smarter than your dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line) # regExs within a regEx

if matchObj:
   # print ("matchObj.group() : ", matchObj.group(0))
   print ("matchObj.group(0) : ", matchObj.group(0))
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")



---



---
#### <b>`re.match()`</b>

    
**Purpose**:  
Checks only the beginning of the string for a match.  

Returns a match object or None.  

**Example**:  

In [None]:
re.match(r'\d+', '123abc')  # Matches '123'

In [None]:
re.match(r'\d+', 'abc123')  # Returns None

**Key Points**:  
Only useful if the pattern must start at the beginning of the string.  

Often confused with `re.search()` —but `re.search()` scans the whole string.  



---



## 6. Using an Online Regex Tester

Before we go further, have a look at one of the many available regex testers available online.  The site `regexr.com` offers an easy way to test a regex pattern against a set of test text.  At the bottom of the web page, an interactive frame lets you see which parts of the regex pattern are impacting the result.

<b>[Click for RegExr](https://regexr.com/)</b>



---



## 7. Multiple Matches with Regex `re.findall()`

### `re.findall()`  
**Purpose**:   
Finds all non-overlapping matches of the pattern in the string.  

Returns a list of strings (or tuples if groups are used).  

**Example**:  

In [None]:
re.findall(r'\d+', 'abc123xyz456')  # Returns ['123', '456']


**Key Points**:  
Use when you want a list of all matches.  

Does not return **match objects**—just the matched substrings.  

Efficient for extracting multiple items like numbers, words, etc.  

In [None]:
emailText = '''
your2_email@citytech.edu,
'''

In [None]:
emailText = '''
your2_email -at- citytech.edu,
'''

In [None]:
emailText = '''
xyz my-email@citytech.edu data science or
your2_email@citytech.edu,
also we can mention their.email@citytech.edu'''

In [None]:
# Using `.findall()` with regex groups
matches = re.findall(r'([\w\.-]+)@([\w\.-]+)', emailText)
print('The variable `matches`:')
print (matches)
print()
print('Iterating through each element of `matches`:')
for match in matches:
  print ('Group 1:', match[0], end='\t')  ## username
  print ('Group 1:', match[1])  ## host

In [None]:
type(matches)

In [None]:
print(len(matches))

In [None]:
# Using `.findall()` without regex groups
matches = re.findall(r'[\w\.-]+@[\w\.-]+', emailText)
print (matches)
print()
for match in matches:
  print (match)

In [None]:
# Comparing use of `.search() with using `.findall()` without regex groups
matches_one = re.search(r'[\w\.-]+@[\w\.-]+', emailText)
print ('The search match result: ', matches_one)
print()
# the result is not an iterable
# for match in matches_one:
#   print (match)

matches_all = re.findall(r'[\w\.-]+@[\w\.-]+', emailText)
print ('The findall match result: ', matches_all)
print()
# the result is an iterable
for match in matches_all:
  print (match)



---



## 8. Regex Match Objects

#### We can store the result of `re.search(pat, str)` in a variable.

In Python a regular expression search is typically written as:

`match = re.search(pat, str)`

The code `match = re.search(pat, str)` stores the search result in a variable named "match".

The `re.search()` method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, `search()` returns a match object or `None` otherwise.

Then the `if`-statement tests the match -- if `True` the search succeeded and `match.group()` is the matching text (e.g. 'word: cat'). Otherwise if the match is `False` (`None` to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. It is recommended that you always write pattern strings with the 'r' just as a habit.



In [None]:
inputStr = 'An example word: cat!!'
search_result = re.search(r'word: \w\w\w', inputStr) ## both 'cat' and 'dog' are 3-letter words
# search_result
print(search_result)

In [None]:
inputStr = 'An example word: ice!!'
search_object = re.search(r'word: \w\w\w', inputStr) ## both 'cat' and 'dog' are 3-letter words
# search_object
print(search_object)

🧠 What Is a Match Object?
When you use functions like `re.match()`, `re.search()`, or `re.finditer()` in Python's re module, they return a Match object—but only if the pattern matches.  

A Match object contains detailed information about the match:  

- What part of the string matched  

- Where the match starts and ends  

- Any groups (sub-patterns in parentheses)  
  
**Example**:  

In [None]:
text = "Order number: 12345"
match = re.search(r'\d+', text)  # Looks for one or more digits

if match:
    print("Match:", match.group())       # '12345'
    print("Start index:", match.start()) # 14
    print("End index:", match.end())     # 19


### 🧰 Useful `Match` Object Methods

| Method        | Description                                           | Example Output       |
|---------------|-------------------------------------------------------|-----------------------|
| `.group()`    | Returns the full matched text                         | `'12345'`             |
| `.start()`    | Returns the starting index of the match in the string | `14`                  |
| `.end()`      | Returns the ending index (one past the last character)| `19`                  |
| `.span()`     | Returns a tuple with the start and end positions      | `(14, 19)`            |
| `.group(n)`   | Returns the text matched by group `n` (using `()` in pattern


🧩 **Example with Groups**:


In [None]:
text = "Name: Alice, Age: 30"
match = re.search(r'Name: (\w+), Age: (\d+)', text)

if match:
    print("Name:", match.group(1))  # 'Alice'
    print("Age:", match.group(2))   # '30'


In the case above:  

-  `group(0)` is the entire match: 'Name: Alice, Age: 30'

- `group(1)` is the first group: 'Alice'

- `group(2)` is the second group: '30'



✅ **Key Takeaways**:  
A Match object holds the details of a successful match.  

- Always check if match: before using its methods to avoid errors.  

- Use `.group()`, `.start()`, `.end()`, and `.span()` to extract match info.  

- Use parentheses in your regex pattern to define groups for finer control.  

#### `.match()` vs `.search()`

This is the trickiest part. A good explanation is provided in this tutorial: https://www.8bitavenue.com/difference-between-re-search-and-re-match-in-python/ which is paraphrased here.

Additional information can also be found at ▶

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/



When searching or matching, regular expression operations can take optional flags or modifiers. The following two modifiers are the most used ones…

* `re.M multiline`
* `re.I ignore case`

#### .match()

`re.match` matches an expression at the beginning of a string. If a match is found, a match object is returned, otherwise `None` is returned. If the input is a multiline string (i.e. starts and ends with three double quotes) that does not change the behavior of the match operation. `re.match` always tries to match the beginning of the string. In regular expressions syntax, the control character (^) is used to match the beginning of a string. If this character is used with re.match, it has no effect. The syntax for `re.match` operation is as follows…



`mobj = re.match(pat, str, flag)`

* pat: regular expression pattern to match
* str: string in which to search for the pattern
* flags: one or more modifiers, for example re.M|re.I

#### `.match(pattern, string, flags=0)`

Looks for a regex match at the beginning of a string.

If zero or more characters at the beginning of *string* match the regular expression *pattern*, return a corresponding match object. Return `None` if the string does not match the pattern; note that this is different from a zero-length match.

This is identical to `re.search()`, except that `re.search()` returns a match if the regular expression *pattern* matches anywhere in *string*, whereas `re.match()` returns a match only if the regular expression *pattern* matches at the beginning of *string*.

### .search()

`re.search` attempts to find the first occurrence of the pattern anywhere in the input string as opposed to the beginning. If the search is successful, `re.search` returns a match object, otherwise it returns `None`. The syntax for re.search operation is as follows…

`mobj = re.search(pat, str, flag)`

* pat: regular expression pattern to search for
* str: string in which to search for the pattern
* flags: one or more modifiers, for example re.M|re.I

#### `.search(pattern, string, flags=0)`

Scans a string for a regex match.

Scan through *string* looking for the **first** location where the regular expression *pattern* produces a match, and return a corresponding match object. Return `None` if no position in the string matches the pattern.


You can use this function in a Boolean context like a conditional statement that checkes if a pattern is found in the input string (`True`) or not (`False`).



---



## 9. Looping with Regex `re.finditer()`

### Iteration Using Regular Expressions

If we want to extract data from a string in Python we can use the `findall()` or `finditer()` methods to extract all of the substrings which match a regular expression.

These two methods produce results of different types.

Let’s use the example of wanting to extract anything that looks like an email address from any line regardless of format. For example, we want to pull the email addresses from each of the following lines:

### re.finditer()  
**Purpose**:  
Like `findall()`, but returns an iterator of match objects.  

Each match object provides more detail (start/end position, groups, etc.)

**Example**:


In [None]:
for match in re.finditer(r'\d+', 'abc123xyz456'):
    print(match.group(), match.start(), match.end())


**Key Points**:  
More memory-efficient than findall() for large texts.  

Allows access to match **metadata**, which `findall()` does not provide.  

In [None]:
emailText = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
match = re.search(r'([\w.-]+)@([\w.-]+)', emailText)
if match:
  print ("email address:\t", match.group())
  print ("email account:\t", match.group(1))
  print ("email domain:\t", match.group(2))

#### Using `.finditer` for Regular Expression Matches

In [None]:
emailText = '''xyz my-email@citytech.edu data science or
your2_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
matches = re.finditer(r'[\w.-]+@[\w.-]+', emailText)
for match in matches:
    print(match.group())

In [None]:
matches = re.finditer(r'([\w.-]+)@([\w.-]+)', emailText)
for match in matches:
    print(match.group(),"\t", match.group(1),"\t", match.group(2))

In [None]:
print (matches)

#### `.finditer()` vs. `.findall()`

* `.findall()` returns a **list** of all matches of a regex in a string.
* `.finditer()` returns an iterator that yields regex matches.

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/

https://www.8bitavenue.com/difference-between-re-search-and-re-match-in-python/

`re.findall(<regex>, <string>)` returns a list of all non-overlapping matches of `<regex>` in `<string>`. It scans the search string from left to right and returns all matches in the order found:

In [None]:
re.findall(r'#(\w+)#', '#foo#.#bar#.#baz#')

In this case, the specified regex is `#(\w+)#`. The matching strings are `'#foo#'`, `'#bar#'`, and `'#baz#'`. But the hash (`#`) characters don’t appear in the return list because they’re outside the grouping parentheses.

If `<regex>` contains more than one capturing group, then `re.findall()` returns a list of tuples containing the captured groups. The length of each tuple is equal to the number of groups specified:

In [None]:
re.findall(r'(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge')

In [None]:
re.findall(r'(\w+),(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge')

`re.finditer(<regex>, <string>)` scans `<string>` for non-overlapping matches of `<regex>` and returns an iterator that yields the match objects from any it finds. It scans the search string from left to right and returns matches in the order it finds them:

In [None]:
it = re.finditer(r'\w+', '...foo,,,,bar:%$baz//|')
it

In [None]:
for i in re.finditer(r'\w+', '...foo,,,,bar:%$baz//|'):
  print(i.group())

`re.findall()` and `re.finditer()` are very similar, but they differ in two respects:

1. `re.findall()` returns a **list**, whereas `re.finditer()` returns an **iterator**.

The items in the list that `re.findall()` returns are the actual matching strings, whereas the items yielded by the iterator that `re.finditer()` returns are match objects.

Any task that you could accomplish with one, you could probably also manage with the other. Which one you choose will depend on the circumstances. However, a lot of useful information can be obtained from a match object. If you need that information, then `re.finditer()` will probably be the better choice. For example, you can use the `.group()` method to return Match object.

#### Match objects

If a match is found when using `re.finditer()`, `re.match` or `re.search`, we can use some useful methods provided by the match object. Here is a short list of such methods, you may check the reference section for more details…

* `group()` returns the part of the string matched by the entire regular expression
* `group(1)` returns the text matched by the second capturing group
* `start()` and `end()` return the indices of the start and end of the substring matched by the capturing group.

---

#### 🧰 Summary of Differences: Python `re` Functions


| Function        | Searches Whole String? | Returns What?              | Multiple Matches? | Match Object? |
|-----------------|------------------------|-----------------------------|-------------------|---------------|
| `re.search()`   | Yes                    | First match                 | No                | Yes           |
| `re.findall()`  | Yes                    | List of matches (strings)   | Yes               | No            |
| `re.finditer()` | Yes                    | Iterator of match objects   | Yes               | Yes           |
| `re.match()`    | No (only at start)     | First match (if at start)   | No                | Yes           |




---



## 10. Substitution with Regex `re.sub()`

### Substitution with `.sub()`

The `re.sub(pat, replacement, str)` function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include `'\1'`, `'\2'` which refer to the text from `group(1)`, `group(2)`, and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user `(\1)` but have `nyc.gov` as the host.

In [None]:
text = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@nyc.gov', text))


#### Regular expression subsitution    

Regular expression substitution works on an `input-string` to produce a result with all (or a specified-number) of instances matching a `regex-pattern` replaced with a `substitution-string`.    

More at [PythonExamples.org](https://pythonexamples.org/python-re-sub/)    
    
There are also some good examples and practice exercises using `re.sub()` at [Python Regex Replace Pattern in a string using `re.sub()`](https://pynative.com/python-regex-replace-re-sub/) on **PyNative.com**.      


##### `re.sub(pattern, substitution, input, count=1 flags=0)`

Evaluate a regular expression `pattern` to be replaced by a `substitution` in a `string` for `count` number of times (default=all), and any `flags` or no flags.


####***Example 1: `re.sub()` - Replace Pattern Matchings with Replacement String***    
    
In this example, we will take a string and replace patterns that contains a continuous occurrence of numbers with the string `NN`. We will do the replacement using `re.sub()` function.    

In [None]:
pattern = '[0-9]+'
string = 'Account Number - 12345, Amount - 586.32'
repl = r'NN'

print('Original string')
print(string)

result = re.sub(pattern, repl, string)

print('After replacement')
print(result)

#### ***Example 2: re.sub() - Limit Maximum Number of Replacement***

We can limit the maximum number of replacements by re.sub() function by specifying `count` optional argument.    

In this example, we will take the same `pattern`, `string`, and `replacement` as in the previous example. But, we limit the maximum number of replacements to `2`.    

In [None]:
pattern = '[0-9]+'
string = 'Account Number - 12345, Amount - 586.32'
repl = r'NN'

print('Original string')
print(string)

result = re.sub(pattern, repl, string, count=2)

print('After replacement')
print(result)

#### ***Example 3: `re.sub()` - Optional Flags***    
    
In this example, we will pass optional `flags` argument `re.IGNORECASE` to `re.sub()` function. This flag tells `re.sub()` function to ignore case while matching the pattern in the string.    

In [None]:
pattern = '[a-z]+'
string = 'Account Number - 12345, Amount - 586.32'
repl = r'AA'

print('Original string')
print(string)

result = re.sub(pattern, repl, string, flags=re.IGNORECASE)

print('After replacement')
print(result)



---



Another explanation of the basics of regular expressions with examples of different functions and their uses as well as descriptions of pattern-matching can be found at Geeks for Geeks post on Regular Expressions for Python here ▶ https://www.geeksforgeeks.org/regular-expression-python-examples-set-1/



---



## 11. Character Classes and Ranges

**1. Predefined Character Classes**  
The `re` module includes shorthand notations for common character categories:  
- `\d` matches any digit (`0–9`)  
- `\w` matches any word character (letters, digits, and underscore)  
- `\s` matches any whitespace character (spaces, tabs, and newlines)
- `\D` matches any non-digit (`0–9`)  
- `\W` matches any non-word character (letters, digits, and underscore)  
- `\S` matches any non-whitespace character (spaces, tabs, and newlines)

**2. Custom Character Sets**  
Character sets allow the definition of specific groups of characters to match.  
- `[aeiou]` matches any lowercase vowel  
- `[A-Z]` matches any uppercase letter  
- `[^0-9]` matches any character *except* a digit

**3. Character Ranges**  
Within brackets, a hyphen (`-`) can define a range of characters. For example, `[a-z]` matches any lowercase letter.

**4. Special Characters Inside Sets**  
Certain characters have special behavior in sets. For example, `-` is used for ranges, and `^` negates the set when placed first.




---



## 12. Quantifiers and Greediness

**1. Basic Quantifiers**  
Quantifiers define how many times a pattern element should occur.  
- `*` matches zero or more repetitions  
- `+` matches one or more repetitions  
- `?` matches zero or one occurrence  
- `{n}` matches exactly *n* repetitions  
- `{n,}` matches *n* or more repetitions  
- `{n,m}` matches between *n* and *m* repetitions

**2. Greedy Matching**  
By default, quantifiers are greedy, meaning they match as much text as possible.  
Example: In the string `"abc123xyz"`, the pattern `".*"` will match the entire string.

**3. Non-Greedy (Lazy) Matching**  
Appending a `?` to a quantifier makes it non-greedy, meaning it matches the smallest possible portion.  
Example: The pattern `".*?"` will match the shortest possible string.

**4. Use Cases for Greediness Control**  
Greedy and non-greedy quantifiers are particularly important when parsing nested structures or matching content between delimiters.


### Greedy vs. Non-Greedy Regular Expressions

Suppose you have text with tags in it: `<b>foo</b> and <i>so on</i>`

Suppose you are trying to match each tag with the pattern '(<.*>)' -- what does it match first?



"<b>fo<i>o and </b>so on</i>"

In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.*>)')
match = regex.search(text)
print (match.group())

print('The end')



The result is a little surprising, but the greedy aspect of the `.*` causes it to match the whole `<b>foo</b> and <i>so on</i>` as one big match. The problem is that the `.*` goes as far as is it can, instead of stopping at the first `>` (aka it is "greedy").

There is an extension to regular expression where you add a `?` at the end, such as `.*?` or `.+?`, changing them to be non-greedy. Now they stop as soon as they can. So the pattern `'(<.*?>)'` will get just `'<b>'` as the first match, and `'</b>'` as the second match, and so on getting each <..> pair in turn. The style is typically that you use a `.*?`, and then immediately its right look for some concrete marker (> in this case) that forces the end of the `.*?` run.



In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
match = regex.search(text)
print (match.group())

print('The end')

In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print('The end')

In [None]:
text = "<b>foo</b> and <i>so on</i>"
regex = re.compile(r'(<.+?>)')
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print('The end')



---



## 13. Grouping and Capture

**1. Grouping with Parentheses**  
Parentheses `()` are used to group parts of a regular expression. This allows quantifiers to be applied to an entire group rather than a single character.

**2. Capturing Groups**  
Text matched by parentheses is stored and can be retrieved using methods such as `.group(n)` on a match object. Groups are indexed starting from 1.

**3. Named Capturing Groups**  
Groups can be assigned names using the syntax `(?P<name>...)`. These named groups can be accessed using `.group('name')`, which improves code readability.

**4. Backreferences**  
Backreferences allow a pattern to refer to a previously captured group within the same expression. This is useful for identifying repeated content.  
- Numeric backreference: `\\1`, `\\2`, etc.  
- Named backreference: `(?P=name)`

**5. Nested Groups**  
Regular expressions support nesting of groups. Each opening parenthesis creates a new group scope, and care must be taken when interpreting group numbers or names in complex patterns.


The `re.match()` function returns a match object on success, `None` on failure. We use the `group(num)` function of a match object to get matched expression.

`group(num=0)`: This method returns entire match (or specific subgroup num)

In [None]:
inputStr = 'An example word: cat!!'
# If-statement after search() tests if it succeeded
if re.search(r'word: \w\w\w', inputStr):
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')
print ("Done with the example")

🧩 **Example with Groups**:


In [None]:
text = "Name: Alice, Age: 30"
match = re.search(r'Name: (\w+), Age: (\d+)', text)

if match:
    print("Name:", match.group(1))  # 'Alice'
    print("Age:", match.group(2))   # '30'


In the case above:  

-  `group(0)` is the entire match: 'Name: Alice, Age: 30'

- `group(1)` is the first group: 'Alice'

- `group(2)` is the second group: '30'



Grouping regular expression patterns with parentheses:   

In [None]:
inputStr = 'An example word: ice!!'
match_object = re.match(r'(\w+)', inputStr) ## all of 'cat'. 'ice', and 'dog' are 3-letter words
# match_object
print(match_object)

In [None]:
inputStr = 'A good example word: ice!!'
match_object = re.match(r'(\w+)', inputStr) ## all of 'cat'. 'ice', and 'dog' are 3-letter words
# match_object
print(match_object)

In [None]:
inputStr = 'A good example word: ice!!'
match_object = re.match(r'\w+', inputStr) ## all of 'cat'. 'ice', and 'dog' are 3-letter words
# match_object
print(match_object)

In [None]:
inputStr = 'Another good example word: ice!!'
match_object = re.match(r'\w+', inputStr) ## all of 'cat'. 'ice', and 'dog' are 3-letter words
# match_object
print(match_object)

#### <b>Geeks-for-Geeks</b> Regex Example

See **Geeks for Geeks** on the use of arguments in the `.group()` function for `.re.search()` at --    
 o  https://www.geeksforgeeks.org/re-matchobject-group-function-in-python-regex/

*Here is the example used by <b>Geeks for Geeks</b>*

In [None]:
"""We create a re.MatchObject and store it in
   match_object variable
   the '()' parenthesis are used to define a
   specific group"""

inputStr = 'username@geekforgeeks.org'

match_object = re.match(r'(\w+)@(\w+)\.(\w+)', inputStr)

# """ w in above pattern stands for alphabetical character
#    + is used to match a consecutive set of characters
#    satisfying a given condition
#    so w+ will match a consecutive set of alphabetical characters"""

In [None]:
print(match_object)

#### <u><b><font color=blue>Regex Email Exercise 3</b></font></u>
    

Try that with your school email address and assign the result to `match_object2` in the following cell:    


##### ***SOLUTION -- don't look here***

In [None]:

inputStr = 'patrick.slattery16@citytech.cuny.edu'

match_object2 = re.match(r'(\w+)@(\w+)\.(\w+)', inputStr)

# """ w in above pattern stands for alphabetical character
#    + is used to match a consecutive set of characters
#    satisfying a given condition
#    so w+ will match a consecutive set of alphabetical characters"""

In [None]:
print(match_object2)

Let's look at what `.match()` returned...

In [None]:
# for entire match
# print('The full pattern match is', match_object.group()) # one way
print('The full pattern match is', match_object.group(0))  # another way
# also print(match_object.group(0)) can be used

# for the first parenthesized subgroup
print('The first pattern group match is', match_object.group(1))

# for the second parenthesized subgroup
print('The second pattern group match is', match_object.group(2))

# for the third parenthesized subgroup
print('The third pattern group match is', match_object.group(3))

# for a tuple of all matched subgroups
print('A tuple of all pattern group matches is', match_object.group(1, 2, 3))

In [None]:
# for a tuple of the full match and all matched subgroups
print('A tuple of the match and all pattern group matches is', match_object.group(0, 1, 2, 3))

In [None]:
# What if we ask for the fourth parenthesized subgroup?
print('The third pattern group match is', match_object.group(4))



---



## 14. Compiling Regex Patterns with `re.compile()`  


`re.compile()` method.

#### Regular expression compilation

Regular expression compilation produces a Python object that can be used to do all sort of regular expression operations. What is the benefit of that as long as we can use `re.match` and `re.search` directly? This technique is convenient in case we want to use a regular expression more than once. It makes our code efficient and more readable.

##### `re.compile(pattern, flags=0)`

Compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()`, `search()` and other methods.


This sequence

In [None]:
inStr = 'This is my 123 example string'
prog = re.compile(r'\d+')
result = prog.search(inStr)
result

is equivalent to

In [None]:
result = re.search(r'\d+', inStr)
result

By using `re.compile()` and saving the resulting regular expression object for reuse you can make your code more efficient when the expression will be used several times in a single program.

In [None]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
emailText = '''xyz my-email@citytech.edu data science or
your_email@citytech.edu, also we can mention their.email@citytech.edu'''

In [None]:
# We are looking for email addresses
regex = re.compile(r'[\w\.-]+@[\w\.-]+')
matches = regex.finditer(emailText)
for match in matches:
    print(match.group())

In [None]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1400 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

In [None]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text = "CUNY, Citytech, Information and Data Management, my-email@citytech.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

print ("The end")



---



## 15. Homework Assignment 4.1  

Imagine we want to conceal all phone numbers in a document.
Read the below multiline docstring and substitute all phone numbers with

'XXX-XXX-XXXX'

In [None]:
docstring_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

### Solution for Assignment 4.1

In [None]:
## your solution here

*solution provided after assignments are submitted*

---

## 16. Additional Practice -- NOAA Tides

Read text data to find the times of high tides for you to fish at Battery Park in NYC.   

*This notebook uses text captured from a snapshot of the NOAA data on tides at Battery, New York as of 01-Nov-2022.*


[A current snapshot of the NOAA tide data for Battery, NYC](https://tidesandcurrents.noaa.gov/map/index.html?id=8518750)  


**References**  

Review material from [Class 17 on Regex](https://bit.ly/cst2312cl12)      

Use the [Regex Cheat Sheet](https://bit.ly/cst2312regex)        

Practice use cases from [Class 15 Lab Exercises](https://bit.ly/cst2312cl15)    

**Resources**

The following provide interactive tools for developing and testing regular expressions:    

 - https://regexr.com/     

 - https://regexlearn.com/playground    

 - https://regex101.com/    



---

### Housekeeping   
    
**(for Additional Regex Practice)**        

####Import libraries

In [None]:
import re

####Acess text file(s)   

In [None]:
!curl "https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/680aa4214a98cff5b62ccf432d3a6029809b322c/battery-nyc-tides.txt" -o tide-data.txt

###Read the file into a string variable  

In [None]:
with open("tide-data.txt", "r") as file:
 text_str = str(file.read())

print(text_str)

---

###Search the string variable with `re.search()`

In [None]:
# 1 - grab a substring of text and search for it
re_searched = re.search(r'2:33 AM	high	4\.19 ft', text_str)

In [None]:
print(re_searched)

In [None]:
# 2 - use an even smaller substring of text and search for it
re_searched = re.search(r'2:33 AM', text_str)

In [None]:
print(re_searched)

In [None]:
# 3 - try a regex pattern for time and search for it
re_searched = re.search(r'(\d+:\d+ \wM)', text_str)

In [None]:
print(re_searched)

In [None]:
# 4 - try a regex pattern for time followd by a space then 'high'
re_searched = re.search(r'(\d+:\d+ \w thigh)', text_str)

In [None]:
print(re_searched)

In [None]:
# 5 - what went wrong with the #4 search?
re_searched = re.search(r'(\d+:\d+ \wM.high)', text_str)

In [None]:
print(re_searched)

###Using an `|` (or) operator for high and low tides   

In [None]:
# 6 - searching for high or low tides
re_searched = re.search(r'(\d+:\d+ \wM.)(high|low)', text_str)

In [None]:
print(re_searched)

####Adding the sea level to the pattern

In [None]:
# 7 - adding the sea height to the search pattern
re_searched = re.search(r'(\d+:\d+ \wM.)(high|low).+(\d+.\d+)', text_str)

In [None]:
print(re_searched)

####Including the unit of measure for sea height in the search pattern   


In [None]:
# 8 - including the unit of measure for sea height to the search pattern
re_searched = re.search(r'(\d+:\d+ \wM.)(high|low).+(\d+.\d+).(\w+)', text_str)

In [None]:
print(re_searched)

---

### Search the string for matches with `re.findall` to find all high tides  

In [None]:
print(text_str)

In [None]:
matches = re.findall(r'(\d+:\d+ \wM).+(high|low).+(\d+.\d+).(\w+)', text_str)
print(matches)
print()
# iterating through the result
for match in matches:
  print(match)

In [None]:
print('Matches 0:', matches[0])
print('Matches 1:', matches[1])
print('Matches 2:', matches[2])
print('Matches 3:', matches[3])

In [None]:
# iterating through matches with an index counter
i=0
while i < len(matches):
    print(matches[i])
    i +=  1

---

### <b>Your turn:</b>     

**What other `regex` searches could you perform to extract data from the text in the file?**   



---



#---
*notes:*



Some [practice examples](https://www.programiz.com/python-programming/regex) using different Regex patterns.

Additional reading on an 'almost perfect' regular expression approach to finding email addresses ▶ [Almost Perfect Email Regex](https://emailregex.com/)

Additional reading on the complexities of thorough regex rules for email addresses ▶ [O'Relly Regular Expressions Cookbook 4.1, Validate Email Addresses](https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch04s01.html) by Jan Goyvaerts and Steven Levithan.



---

