# Module 2: Basic Regex Patterns

In this notebook, we'll explore the fundamentals of regular expressions in Python, including pattern matching, metacharacters, and basic regex functions.

In [1]:
import re

## 1. Matching Literal Characters

The simplest form of pattern matching is searching for exact text matches.

In [None]:
# Example of literal character matching
text = "Hello, Python!"
pattern = "Python"

# Using re.search() to find the pattern
match = re.search(pattern, text)
print(f"Pattern found: {match is not None}")
if match:
    print(f"Found at position: {match.start()}-{match.end()}")

## 2. Understanding Metacharacters

Metacharacters are special characters in regex that have specific meanings:
- `.` - Matches any character except newline
- `^` - Matches start of string
- `$` - Matches end of string
- `*` - Matches 0 or more repetitions
- `+` - Matches 1 or more repetitions
- `?` - Matches 0 or 1 repetition
- `{}` - Specifies exact number of repetitions
- `[]` - Defines a character set
- `\` - Escapes special characters
- `|` - Alternation (OR)
- `()` - Groups patterns

In [4]:
# Examples of metacharacters

# . (dot) - matches any character
print("Dot pattern:")
text = "cat, hat, rat, dog"
pattern = ".at"
matches = re.findall(pattern, text)
print(f"Words ending in 'at': {matches}")

# [] - character set
print("\nCharacter set:")
pattern = "[chr]at"
matches = re.findall(pattern, text)
print(f"Words starting with c, h, or r: {matches}")

# * - zero or more occurrences
print("\nZero or more occurrences:")
text = "ca cat caat caaat ct"
pattern = "ca*t"
matches = re.findall(pattern, text)
print(f"Matching 'ca*t': {matches}")

Dot pattern:
Words ending in 'at': ['cat', 'hat', 'rat']

Character set:
Words starting with c, h, or r: ['cat', 'hat', 'rat']

Zero or more occurrences:
Matching 'ca*t': ['cat', 'caat', 'caaat']


## 3. Using re.search() and re.match()

- `re.search()`: Searches for a pattern anywhere in the string
- `re.match()`: Matches pattern at the beginning of the string

In [6]:
text = "Python is awesome!"

# re.search() example
search_result = re.search(r"awesome", text)
print(f"search() found 'awesome': {search_result is not None}")

# re.match() example
match_result1 = re.match(r"Python", text)
match_result2 = re.match(r"awesome", text)

print(f"match() found 'Python' at start: {match_result1 is not None}")
print(f"match() found 'awesome' at start: {match_result2 is not None}")

search() found 'awesome': True
match() found 'Python' at start: True
match() found 'awesome' at start: False


## Practice Problems

Try solving these exercises to practice what you've learned:

In [None]:
# Exercise 1: Match all email addresses in the text
text = "Contact us at: support@example.com or sales@company.com"
# Your code here - create a pattern to match email addresses

# Exercise 2: Find all words that start with 'p' or 'P'
text = "Python programming is powerful and practical"
# Your code here - create a pattern to match words starting with p/P

# Exercise 3: Match phone numbers in format XXX-XXX-XXXX
text = "Call us: 123-456-7890 or 987-654-3210"
# Your code here - create a pattern to match phone numbers

### Solutions


```python
email_pattern = r'[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}'
```

### Components of the Regular Expression

1. **`[\w\.-]+`**:
   - **`[...]`**: A character class that matches any one of the characters inside the brackets.
   - **`\w`**: Matches any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).
   - **`.`**: Matches a literal dot (`.`). Inside a character class, the dot loses its special meaning and represents just a dot.
   - **`-`**: Matches a literal hyphen (`-`). Inside a character class, placing `-` at the end avoids confusion with range specifications.
   - **`+`**: Quantifier that matches one or more occurrences of the preceding element.
   - **Combined**: This part matches the username of the email, allowing letters, digits, underscores, dots, and hyphens.

2. **`@`**:
   - Matches the literal `@` symbol, separating the username from the domain.

3. **`[\w\.-]+`**:
   - Similar to the first component, this matches the domain name part, allowing letters, digits, underscores, dots, and hyphens.

4. **`\.`**:
   - Matches a literal dot (`.`). The backslash escapes the dot to indicate that it should be interpreted as a literal character rather than its special meaning in regular expressions.

5. **`[a-zA-Z]{2,}`**:
   - **`[a-zA-Z]`**: Matches any uppercase or lowercase letter.
   - **`{2,}`**: Quantifier that matches two or more occurrences of the preceding element.
   - **Combined**: This part matches the top-level domain (TLD) of the email, ensuring it has at least two letters (e.g., `.com`, `.org`, `.net`).

### Summary

The regular expression `r'[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}'` is designed to match email addresses by:

- Allowing usernames with letters, digits, underscores, dots, and hyphens.
- Ensuring the presence of a single `@` symbol.
- Matching domain names with similar allowed characters.
- Requiring a top-level domain of at least two letters.



In [7]:
# Solution 1: Email addresses
text = "Contact us at: support@example.com or sales@company.com"
email_pattern = r'[\w\.-]+@[\w\.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(f"Found emails: {emails}")

# Solution 2: Words starting with p/P
text = "Python programming is powerful and practical"
p_pattern = r'\b[pP]\w+'
p_words = re.findall(p_pattern, text)
print(f"\nWords starting with p/P: {p_words}")

# Solution 3: Phone numbers
text = "Call us: 123-456-7890 or 987-654-3210"
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phone_numbers = re.findall(phone_pattern, text)
print(f"\nPhone numbers: {phone_numbers}")

Found emails: ['support@example.com', 'sales@company.com']

Words starting with p/P: ['Python', 'programming', 'powerful', 'practical']

Phone numbers: ['123-456-7890', '987-654-3210']
