**Table of contents**<a id='toc0_'></a>    
- [Introduction to Regular Expressions](#toc1_)    
  - [Key Functions in the `re` Module](#toc1_1_)    
  - [Raw Strings](#toc1_2_)    
- [Basic Pattern Matching](#toc2_)    
  - [Literal Characters](#toc2_1_)    
  - [Escape Characters](#toc2_2_)    
  - [Wildcard character](#toc2_3_)    
- [Examples](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Introduction to Regular Expressions](#toc0_)

In [13]:
import re

In [14]:
# Basic example of a regex match

text = "Hello, World!"
pattern = r"Hello"
result = re.search(pattern, text)

if result:
    print(f"Pattern found: {result.group()}")
else:
    print("Pattern not found")

Pattern found: Hello


## <a id='toc1_1_'></a>[Key Functions in the `re` Module](#toc0_)

Python's `re` module provides several functions for working with regex:

- `re.search()` - Finds the first match of a pattern within a string
- `re.match()` - Checks if a pattern matches at the beginning of a string
- `re.findall()` - Returns all non-overlapping matches as a list
- `re.finditer()` - Returns an iterator of match objects
- `re.sub()` - Replaces matches with a replacement string
- `re.split()` - Splits a string by pattern matches

In [15]:
text = "Python is amazing. Python is powerful."

# re.search() - Returns the first match
search_result = re.search(r"Python", text)
print(f"search result: {search_result.group()}")  # Outputs: Python

# re.match() - Matches only at the beginning of the string
match_result = re.match(r"Python", text)
print(f"match result: {match_result.group() if match_result else 'No match'}")  # Outputs: Python

# re.findall() - Returns all matches as a list
findall_result = re.findall(r"Python", text)
print(f"findall result: {findall_result}")  # Outputs: ['Python', 'Python']

# re.finditer() - Returns an iterator of match objects
finditer_result = re.finditer(r"Python", text)
for match in finditer_result:
    print(f"Found '{match.group()}' at position {match.start()}")

# re.sub() - Substitutes matches with replacement
sub_result = re.sub(r"Python", "Regex", text)
print(f"sub result: {sub_result}")  # Outputs: Regex is amazing. Regex is powerful.

# re.split() - Splits the string by pattern matches
split_result = re.split(r"\.", text)
print(f"split result: {split_result}")  # Outputs: ['Python is amazing', ' Python is powerful', '']

search result: Python
match result: Python
findall result: ['Python', 'Python']
Found 'Python' at position 0
Found 'Python' at position 19
sub result: Regex is amazing. Regex is powerful.
split result: ['Python is amazing', ' Python is powerful', '']


In [16]:
# Without raw string - need double backslash for literal backslash
pattern1 = "\\d+"  
print(pattern1)  # Outputs: \d+

# With raw string (preferred) - more readable
pattern2 = r"\\d+"
print(pattern2)  # Outputs: \\d+

# Example with a file path (why raw strings are useful)
windows_path = "C:\\Users\\Username\\Documents"  # Without raw string
windows_path_raw = r"C:\Users\Username\Documents"  # With raw string

print(windows_path)
print(windows_path_raw)  # Both output the same, but raw string is more readable

\d+
\\d+
C:\Users\Username\Documents
C:\Users\Username\Documents


In [17]:
# Example: Raw strings vs normal strings

# Without raw string - need double backslash for literal backslash
pattern1 = "\\d+"  
print(pattern1)  # Outputs: \d+

# With raw string (preferred) - more readable
pattern2 = r"\d+"
print(pattern2)  # Outputs: \d+

\d+
\d+


# <a id='toc2_'></a>[Basic Pattern Matching](#toc0_)

## <a id='toc2_1_'></a>[Literal Characters](#toc0_)

In [18]:
text = "The quick brown fox jumps over the lazy dog"

# Matching literal words
result = re.search(r"brown", text)
print(f"Found: {result.group()}")  # Outputs: Found: brown


Found: brown


## Case sensitivity

In [19]:

# Case-insensitive matching using flags
result = re.search(r"BROWN", text, re.IGNORECASE)
print(f"Found (case-insensitive): {result.group()}")  # Found (case-insensitive): brown

Found (case-insensitive): brown


## <a id='toc1_2_'></a>[Raw Strings](#toc0_)

It's recommended to use raw strings (prefix with `r`) when defining regex patterns 
to avoid unintended escape sequences:

### Why Use Raw Strings in Regex?

Raw strings (prefix with `r`) are recommended for regex patterns to avoid issues with escape sequences. 

For example, `\d` in a normal string would require double backslashes (`\\d`), but in a raw string, 
you can simply write `r"\d"`. This makes regex patterns more readable and less error-prone.

## <a id='toc2_2_'></a>[Escape Characters](#toc0_)

Special regex characters: . ^ $ * + ? { } [ ] \ | ( )

These need to be escaped with a backslash to match them literally.

In [20]:
text = "What is the cost? It's $10.99."

# Escaping special characters to match them literally
price_pattern = r"\$\d+\.\d+"
price = re.search(price_pattern, text)
print(f"Price found: {price.group()}")  # Outputs: Price found: $10.99

# Another example with parentheses
text_with_parens = "The result (42) is interesting."
parens_pattern = r"\((\d+)\)"
parens_match = re.search(parens_pattern, text_with_parens)
if parens_match:
    print(f"Full match: {parens_match.group(0)}")  # (42)
    print(f"Inside parens: {parens_match.group(1)}")  # 42

Price found: $10.99
Full match: (42)
Inside parens: 42


### Notes on Regex Special Characters

Regex uses special characters like `. ^ $ * + ? { } [ ] \ | ( )` to define patterns. 
If you want to match these characters literally, you need to escape them with a backslash (`\`).

For example:
- To match a literal dot (`.`), use `\.`.
- To match a dollar sign (`$`), use `\$`.

In [21]:
# Example: Escaping special characters
text = "What is the cost? It's $10.99."

# Escaping special characters to match them literally
price_pattern = r"\$\d+\.\d+"
price = re.search(price_pattern, text)
print(f"Price found: {price.group()}")  # Outputs: $10.99

Price found: $10.99


## <a id='toc2_3_'></a>[Wildcard character](#toc0_)

The dot (.) matches any character except a newline.


In [22]:
text = "The bat, cat, and rat sat on the mat."

# . matches any character except newline
pattern = r"..at"
matches = re.findall(pattern, text)
print(matches)  # Outputs: ['bat', 'cat', 'rat', 'sat', 'mat']

# To match a literal dot, escape it
ip_pattern = r"\d+\.\d+\.\d+\.\d+"
ip_address = "Server IP is 192.168.0.1"
ip_match = re.search(ip_pattern, ip_address)
print(f"IP address: {ip_match.group() if ip_match else 'Not found'}")

# Using dot to match anything between two words
between_pattern = r"bat(.*)mat"
between_match = re.search(between_pattern, text)
print(f"Text between 'bat' and 'mat': '{between_match.group(1)}'")  # ', cat, and rat sat on the '

[' bat', ' cat', ' rat', ' sat', ' mat']
IP address: 192.168.0.1
Text between 'bat' and 'mat': ', cat, and rat sat on the '


### Wildcard Character in Regex

The dot (`.`) is a special character in regex that matches any character except a newline. 
For example, `r"..at"` will match any two characters followed by `at`.

If you want to match a literal dot, you need to escape it using `\.`.

In [None]:
# Example: Using the wildcard character
text = "The bat, cat, and rat sat on the mat mmat chaat cheeeat"

# . matches any character except newline. The output will include all words that have two characters followed by 'at'. 
# It only contains 4 characters in the output.
pattern = r"..at"
matches = re.findall(pattern, text)
print(matches)  # Outputs: ['bat', 'cat', 'rat', 'sat', 'mat']

[' bat', ' cat', ' rat', ' sat', ' mat', 'mmat', 'haat', 'eeat']


# <a id='toc3_'></a>[Examples](#toc0_)

### Practical Examples of Regex

Here are some practical examples of regex usage:

- Extracting email addresses from text
- Finding words that start with a specific letter
- Reformatting phone numbers

In [11]:
# Example 1: Extract email addresses
text = "Contact us at info@example.com or support@company.co.uk"
emails = re.findall(r"\w+@\w+\.\w+", text)
print(f"Simple email extraction: {emails}")

# Example 2: Find all words starting with 'p'
text = "Python is a powerful programming language"
p_words = re.findall(r"\bp\w+", text, re.IGNORECASE)
print(f"Words starting with 'p': {p_words}")

# Example 3: Replace phone number format
text = "Call me at 123-456-7890"
formatted = re.sub(r"(\d{3})-(\d{3})-(\d{4})", r"(\1) \2-\3", text)
print(f"Reformatted: {formatted}")  # Call me at (123) 456-7890

Simple email extraction: ['info@example.com', 'support@company.co']
Words starting with 'p': ['Python', 'powerful', 'programming']
Reformatted: Call me at (123) 456-7890


In [28]:
# Example 1: Extracting area codes from phone numbers
phone_numbers = "Call 555-123-4567 or (123) 456-7890 for assistance"
area_codes = re.findall(r"\(?(\d{3})\)?", phone_numbers)
print(f"Area codes: {area_codes}")  # Outputs: ['555', '123']

# Example 2: Validating username format (letters, numbers, underscores only)
def is_valid_username(username):
    pattern = r"^[a-zA-Z0-9_]+$"
    return bool(re.match(pattern, username))

usernames = ["john_doe", "jane.doe", "user123", "user-name"]
for username in usernames:
    print(f"'{username}' is valid: {is_valid_username(username)}")

# Example 3: Finding words with specific pattern 
# (words that start with 'p' and end with 'n')
text = "Python is a programming language. A pen is on the table."
p_words = re.findall(r"\b[pP][a-zA-Z]*n\b", text)
print(f"Words starting with 'p' and ending with 'n': {p_words}")  # Python, programming, pen


Area codes: ['555', '123', '456', '123', '456', '789']
'john_doe' is valid: True
'jane.doe' is valid: False
'user123' is valid: True
'user-name' is valid: False
Words starting with 'p' and ending with 'n': ['Python', 'pen']
