## Regular Expressions in Python

### Introduction

Regular expressions (regex) are powerful tools used for pattern matching and text manipulation. They allow you to search, edit, and manipulate text based on specific patterns. Python provides the `re` module, which includes functions for working with regular expressions.

### Basic Concepts

- **Pattern**: A sequence of characters that defines a search pattern.
- **String**: The text to be searched.
- **Match**: The process of comparing a regex pattern with a string.

### Special Characters and Their Meaning

1. **`.`**: Matches any single character except newline.
2. **`^`**: Matches the start of a string.
3. **`$`**: Matches the end of a string.
4. **`*`**: Matches 0 or more repetitions of the preceding element.
5. **`+`**: Matches 1 or more repetitions of the preceding element.
6. **`?`**: Matches 0 or 1 repetition of the preceding element.
7. **`{m}`**: Matches exactly `m` repetitions of the preceding element.
8. **`{m,n}`**: Matches between `m` and `n` repetitions of the preceding element.
9. **`[]`**: Matches any one of the characters inside the brackets.
10. **`|`**: Matches either the pattern before or the pattern after the `|`.
11. **`()`**: Groups patterns together and captures the matched text.

### Character Classes

1. **`\d`**: Matches any digit; equivalent to `[0-9]`.
2. **`\D`**: Matches any non-digit; equivalent to `[^0-9]`.
3. **`\w`**: Matches any word character (alphanumeric + underscore); equivalent to `[a-zA-Z0-9_]`.
4. **`\W`**: Matches any non-word character; equivalent to `[^a-zA-Z0-9_]`.
5. **`\s`**: Matches any whitespace character (spaces, tabs, newlines).
6. **`\S`**: Matches any non-whitespace character.

### Common Functions in the `re` Module

#### `re.compile(pattern, flags=0)`

Compiles a regular expression pattern into a regex object, which can be used for matching. Flags modify the behavior of the regex.

- **`re.IGNORECASE`**: Ignore case.
- **`re.MULTILINE`**: Make `^` and `$` match the start and end of each line.
- **`re.DOTALL`**: Make `.` match any character, including newlines.
- **`re.VERBOSE`**: Allow verbose regex for better readability.

```python
import re

pattern = re.compile(r'\d+')
```

#### `re.match(pattern, string)`

Determines if the regex matches at the start of the string.

```python
result = re.match(r'\d+', '123abc')
if result:
    print(result.group())
```

#### `re.search(pattern, string)`

Searches the string for the first location where the regex pattern produces a match.

```python
result = re.search(r'\d+', 'abc123def')
if result:
    print(result.group())
```

#### `re.findall(pattern, string)`

Finds all substrings where the regex pattern matches and returns them as a list.

```python
result = re.findall(r'\d+', 'abc123def456')
print(result)
```

#### `re.finditer(pattern, string)`

Finds all substrings where the regex pattern matches and returns them as an iterator of match objects.

```python
for match in re.finditer(r'\d+', 'abc123def456'):
    print(match.group())
```

#### `re.sub(pattern, repl, string)`

Replaces the matches with the replacement text.

```python
result = re.sub(r'\d+', 'NUMBER', 'abc123def456')
print(result)
```

#### `re.split(pattern, string)`

Splits the string by the occurrences of the pattern.

```python
result = re.split(r'\d+', 'abc123def456ghi')
print(result)
```

### Match Objects

When a pattern matches a string, a match object is returned. The match object provides information about the match.

#### Methods of Match Objects

- **`group()`**: Returns the matched text.
- **`start()`**: Returns the start index of the match.
- **`end()`**: Returns the end index of the match.
- **`span()`**: Returns a tuple containing the start and end indices.

```python
match = re.search(r'\d+', 'abc123def')
if match:
    print(match.group())
    print(match.start())
    print(match.end())
    print(match.span())
```

### Advanced Topics

#### Lookahead and Lookbehind

- **Positive Lookahead `(?=...)`**: Asserts that what follows the regex must match the lookahead pattern.
- **Negative Lookahead `(?!...)`**: Asserts that what follows the regex must not match the lookahead pattern.
- **Positive Lookbehind `(?<=...)`**: Asserts that what precedes the regex must match the lookbehind pattern.
- **Negative Lookbehind `(?<!...)`**: Asserts that what precedes the regex must not match the lookbehind pattern.

```python
# Positive Lookahead
result = re.search(r'abc(?=\d)', 'abc123')
if result:
    print(result.group())

# Negative Lookahead
result = re.search(r'abc(?!\d)', 'abc123')
if not result:
    print("No match")

# Positive Lookbehind
result = re.search(r'(?<=abc)\d+', 'abc123')
if result:
    print(result.group())

# Negative Lookbehind
result = re.search(r'(?<!abc)\d+', 'def123')
if result:
    print(result.group())
```

#### Non-capturing Groups

Non-capturing groups are useful when you want to group parts of the regex without capturing the matched text.

```python
result = re.search(r'(?:abc)\d+', 'abc123')
if result:
    print(result.group())
```

#### Named Groups

Named groups allow you to assign a name to a capturing group.

```python
pattern = re.compile(r'(?P<word>\w+)\s+(?P<digit>\d+)')
match = pattern.search('abc 123')
if match:
    print(match.group('word'))
    print(match.group('digit'))
```

#### Verbose Mode

Verbose mode allows you to write more readable regular expressions by allowing you to include whitespace and comments.

```python
pattern = re.compile(r"""
    \d+  # Match one or more digits
    \s+  # Followed by one or more spaces
    \w+  # Followed by one or more word characters
    """, re.VERBOSE)

result = pattern.search('123 abc')
if result:
    print(result.group())
```

### Examples

1. **Email Validation**

```python
email_pattern = re.compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')
result = email_pattern.match('example@example.com')
if result:
    print("Valid email")
else:
    print("Invalid email")
```

2. **Phone Number Validation**

```python
phone_pattern = re.compile(r'^\+?\d{1,3}?\s?\(?\d{1,4}?\)?[\s.-]?\d{1,4}[\s.-]?\d{1,9}$')
result = phone_pattern.match('+1 123-456-7890')
if result:
    print("Valid phone number")
else:
    print("Invalid phone number")
```

3. **Extracting URLs from Text**

```python
text = 'Check out https://www.example.com and http://www.test.com'
url_pattern = re.compile(r'https?://[a-zA-Z0-9./-]+')
urls = url_pattern.findall(text)
print(urls)
```

4. **Password Strength Validation**

```python
password_pattern = re.compile(r'''
    (?=.*[A-Z])       # at least one uppercase letter
    (?=.*[a-z])       # at least one lowercase letter
    (?=.*\d)          # at least one digit
    (?=.*[@$!%*?&])   # at least one special character
    [A-Za-z\d@$!%*?&]{8,}  # minimum length of 8 characters
    ''', re.VERBOSE)

result = password_pattern.match('StrongP@ssw0rd')
if result:
    print("Strong password")
else:
    print("Weak password")
```

### Conclusion

Regular expressions are versatile and powerful for text processing tasks. By mastering regex, you can efficiently search, validate, and manipulate text data. Remember to use tools like `re.compile` and various regex flags to simplify your regex patterns and improve their readability. Regular expressions may have a steep learning curve, but with practice, they become an invaluable skill for any Python developer.

---

## Using Python for Matching with Grep

### Introduction

Grep is a powerful command-line utility used for searching plain-text data for lines that match a regular expression. In Python, similar functionality can be achieved using the `re` module along with other standard libraries. This guide will cover how to replicate the behavior of grep using Python, including searching for patterns in files, filtering lines, and other advanced techniques.

### Basic Concepts

1. **Pattern**: A sequence of characters that defines a search pattern (regex).
2. **String**: The text to be searched.
3. **Match**: The process of comparing a regex pattern with a string.
4. **File I/O**: Reading from and writing to files.

### The `re` Module

Python’s `re` module provides the necessary functions for working with regular expressions.

- **`re.compile(pattern, flags=0)`**: Compiles a regex pattern into a regex object.
- **`re.search(pattern, string)`**: Searches for the pattern within the string.
- **`re.findall(pattern, string)`**: Returns all non-overlapping matches of the pattern in the string.
- **`re.finditer(pattern, string)`**: Returns an iterator yielding match objects for all non-overlapping matches.
- **`re.match(pattern, string)`**: Determines if the regex matches at the start of the string.
- **`re.sub(pattern, repl, string)`**: Replaces the matches with the replacement text.
- **`re.split(pattern, string)`**: Splits the string by the occurrences of the pattern.

### Reading and Searching Files

To replicate grep’s functionality, we need to read files and search for patterns within the content.

#### Reading Files

Python provides several methods to read files:

```python
# Reading a file line by line
with open('example.txt', 'r') as file:
    lines = file.readlines()

# Reading the entire file content
with open('example.txt', 'r') as file:
    content = file.read()
```

#### Searching for Patterns

Combining file reading with the `re` module allows us to search for patterns within files.

```python
import re

pattern = re.compile(r'your_regex_pattern')

with open('example.txt', 'r') as file:
    for line in file:
        if pattern.search(line):
            print(line.strip())
```

### Command-Line Grep with Python

To create a Python script that mimics grep, you can use the following structure:

#### Simple Grep Script

```python
import re
import sys

def grep(pattern, filename):
    regex = re.compile(pattern)
    with open(filename, 'r') as file:
        for line in file:
            if regex.search(line):
                print(line.strip())

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python grep.py <pattern> <file>")
        sys.exit(1)
    pattern = sys.argv[1]
    filename = sys.argv[2]
    grep(pattern, filename)
```

### Advanced Grep-like Functionality

#### Case-Insensitive Search

```python
import re

def grep(pattern, filename, ignore_case=False):
    flags = re.IGNORECASE if ignore_case else 0
    regex = re.compile(pattern, flags)
    with open(filename, 'r') as file:
        for line in file:
            if regex.search(line):
                print(line.strip())

if __name__ == "__main__":
    import sys
    pattern = sys.argv[1]
    filename = sys.argv[2]
    ignore_case = '--ignore-case' in sys.argv
    grep(pattern, filename, ignore_case)
```

#### Recursive Search in Directories

To search for patterns in all files within a directory recursively, you can use the `os` and `glob` modules:

```python
import re
import os
import glob

def grep(pattern, path, ignore_case=False):
    flags = re.IGNORECASE if ignore_case else 0
    regex = re.compile(pattern, flags)
    for filepath in glob.glob(os.path.join(path, '**'), recursive=True):
        if os.path.isfile(filepath):
            with open(filepath, 'r') as file:
                try:
                    for line in file:
                        if regex.search(line):
                            print(f"{filepath}: {line.strip()}")
                except Exception as e:
                    print(f"Could not read {filepath}: {e}")

if __name__ == "__main__":
    import sys
    pattern = sys.argv[1]
    path = sys.argv[2]
    ignore_case = '--ignore-case' in sys.argv
    grep(pattern, path, ignore_case)
```

### Additional Features

#### Counting Matches

To count the number of lines that match the pattern:

```python
def grep(pattern, filename, count=False):
    regex = re.compile(pattern)
    match_count = 0
    with open(filename, 'r') as file:
        for line in file:
            if regex.search(line):
                if count:
                    match_count += 1
                else:
                    print(line.strip())
    if count:
        print(f"Total matches: {match_count}")

if __name__ == "__main__":
    import sys
    pattern = sys.argv[1]
    filename = sys.argv[2]
    count = '--count' in sys.argv
    grep(pattern, filename, count)
```

#### Printing Line Numbers

To print the line numbers of matching lines:

```python
def grep(pattern, filename, show_line_numbers=False):
    regex = re.compile(pattern)
    with open(filename, 'r') as file:
        for i, line in enumerate(file, start=1):
            if regex.search(line):
                if show_line_numbers:
                    print(f"{i}:{line.strip()}")
                else:
                    print(line.strip())

if __name__ == "__main__":
    import sys
    pattern = sys.argv[1]
    filename = sys.argv[2]
    show_line_numbers = '--line-numbers' in sys.argv
    grep(pattern, filename, show_line_numbers)
```

### Example Use Cases

#### Searching for an Email Pattern

```python
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
grep(email_pattern, 'example.txt')
```

#### Searching for a Specific Word (Case-Insensitive)

```python
word_pattern = r'\bword\b'
grep(word_pattern, 'example.txt', ignore_case=True)
```

#### Recursive Search for a Phone Number Pattern in a Directory

```python
phone_pattern = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
grep(phone_pattern, '/path/to/directory', ignore_case=True)
```

### Conclusion

By using Python’s `re` module along with standard file handling operations, you can create powerful tools that replicate and extend the functionality of grep. Whether you need simple pattern matching, case-insensitive searches, recursive directory searches, or more advanced features like counting matches and showing line numbers, Python provides the flexibility and power to achieve these tasks efficiently.

---

## Simple Matching Using Python

### Introduction

Simple matching in Python involves checking if a string contains a specific substring or pattern. This can be done using built-in string methods and the `re` module for more complex patterns. This guide covers the various methods and techniques for simple string matching in Python.

### Basic String Methods for Simple Matching

Python provides several built-in methods for basic string matching and manipulation. These methods are straightforward and efficient for simple tasks.

#### `str.find()`

- **Usage**: Finds the first occurrence of a substring in a string. Returns the index of the substring or `-1` if not found.

```python
text = "Hello, world!"
index = text.find("world")
print(index)  # Output: 7

index = text.find("Python")
print(index)  # Output: -1
```

#### `str.index()`

- **Usage**: Similar to `find()`, but raises a `ValueError` if the substring is not found.

```python
try:
    index = text.index("world")
    print(index)  # Output: 7
except ValueError:
    print("Substring not found")

try:
    index = text.index("Python")
except ValueError:
    print("Substring not found")  # Output: Substring not found
```

#### `str.startswith()`

- **Usage**: Checks if a string starts with a specified substring. Returns `True` or `False`.

```python
result = text.startswith("Hello")
print(result)  # Output: True

result = text.startswith("world")
print(result)  # Output: False
```

#### `str.endswith()`

- **Usage**: Checks if a string ends with a specified substring. Returns `True` or `False`.

```python
result = text.endswith("world!")
print(result)  # Output: True

result = text.endswith("Hello")
print(result)  # Output: False
```

#### `str.count()`

- **Usage**: Counts the occurrences of a substring in a string.

```python
text = "Hello, world! Hello, Python!"
count = text.count("Hello")
print(count)  # Output: 2

count = text.count("Java")
print(count)  # Output: 0
```

### Using the `re` Module for Simple Matching

The `re` module in Python provides more powerful and flexible functions for pattern matching using regular expressions.

#### `re.search()`

- **Usage**: Searches for the first occurrence of a pattern in a string. Returns a match object or `None`.

```python
import re

pattern = r"world"
match = re.search(pattern, text)
if match:
    print(f"Found '{match.group()}' at position {match.start()}")  # Output: Found 'world' at position 7
else:
    print("Pattern not found")
```

#### `re.match()`

- **Usage**: Checks if the pattern matches the beginning of the string. Returns a match object or `None`.

```python
pattern = r"Hello"
match = re.match(pattern, text)
if match:
    print(f"Found '{match.group()}' at the beginning of the string")  # Output: Found 'Hello' at the beginning of the string
else:
    print("Pattern not found at the beginning")
```

#### `re.fullmatch()`

- **Usage**: Checks if the pattern matches the entire string. Returns a match object or `None`.

```python
pattern = r"Hello, world!"
match = re.fullmatch(pattern, text)
if match:
    print(f"The entire string matches the pattern")  # Output: The entire string matches the pattern
else:
    print("Pattern does not match the entire string")
```

#### `re.findall()`

- **Usage**: Finds all non-overlapping matches of the pattern in the string. Returns a list of matches.

```python
pattern = r"Hello"
matches = re.findall(pattern, text)
print(matches)  # Output: ['Hello', 'Hello']
```

#### `re.finditer()`

- **Usage**: Finds all non-overlapping matches of the pattern in the string. Returns an iterator yielding match objects.

```python
pattern = r"Hello"
matches = re.finditer(pattern, text)
for match in matches:
    print(f"Found '{match.group()}' at position {match.start()}")  # Output: Found 'Hello' at position 0 and 14
```

#### `re.sub()`

- **Usage**: Replaces occurrences of the pattern with a replacement string.

```python
pattern = r"Hello"
replacement = "Hi"
result = re.sub(pattern, replacement, text)
print(result)  # Output: Hi, world! Hi, Python!
```

### Examples of Simple Matching

#### Check if a String Contains a Substring

```python
def contains_substring(text, substring):
    return substring in text

print(contains_substring("Hello, world!", "world"))  # Output: True
print(contains_substring("Hello, world!", "Python"))  # Output: False
```

#### Find All Occurrences of a Substring

```python
def find_all_occurrences(text, substring):
    return [i for i in range(len(text)) if text.startswith(substring, i)]

print(find_all_occurrences("Hello, world! Hello, Python!", "Hello"))  # Output: [0, 14]
```

#### Case-Insensitive Matching

```python
import re

def contains_substring_case_insensitive(text, substring):
    pattern = re.compile(re.escape(substring), re.IGNORECASE)
    return bool(pattern.search(text))

print(contains_substring_case_insensitive("Hello, world!", "hello"))  # Output: True
print(contains_substring_case_insensitive("Hello, world!", "WORLD"))  # Output: True
print(contains_substring_case_insensitive("Hello, world!", "Python"))  # Output: False
```

#### Validate Email Addresses

```python
import re

def is_valid_email(email):
    pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
    return re.match(pattern, email) is not None

print(is_valid_email("example@example.com"))  # Output: True
print(is_valid_email("example@.com"))  # Output: False
```

#### Extract All Email Addresses from Text

```python
import re

def extract_emails(text):
    pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
    return re.findall(pattern, text)

text = "Contact us at support@example.com or sales@example.com."
emails = extract_emails(text)
print(emails)  # Output: ['support@example.com', 'sales@example.com']
```

### Summary

- **Basic String Methods**: Use `str.find()`, `str.index()`, `str.startswith()`, `str.endswith()`, and `str.count()` for simple matching tasks.
- **The `re` Module**: Use `re.search()`, `re.match()`, `re.fullmatch()`, `re.findall()`, `re.finditer()`, and `re.sub()` for more complex pattern matching with regular expressions.
- **Examples**: Demonstrated common use cases such as checking for substrings, finding all occurrences, case-insensitive matching, validating email addresses, and extracting emails from text.

By mastering these methods and techniques, you can effectively perform simple and complex string matching tasks in Python.

---

## Wildcards and Character Classes in Python

### Introduction

Wildcards and character classes are essential components of regular expressions, enabling more flexible and powerful text matching. In Python, the `re` module is used to work with regular expressions, allowing you to create patterns that include wildcards and character classes.

### Wildcards in Regular Expressions

Wildcards are symbols that represent one or more characters in a search pattern. The most common wildcard is the dot (`.`), which matches any single character except a newline.

#### Basic Wildcard Usage

- **`.`**: Matches any single character except newline.

```python
import re

pattern = r"."
text = "abc"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'b', 'c']
```

#### Using Wildcards in Patterns

- **`a.b`**: Matches any character between `a` and `b`.

```python
pattern = r"a.b"
text = "acb aeb a#b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['acb', 'aeb', 'a#b']
```

### Character Classes in Regular Expressions

Character classes allow you to specify a set of characters to match. They are defined using square brackets `[]`.

#### Basic Character Class Usage

- **`[abc]`**: Matches any one of the characters `a`, `b`, or `c`.

```python
pattern = r"[abc]"
text = "abcdef"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'b', 'c']
```

#### Ranges in Character Classes

- **`[a-z]`**: Matches any lowercase letter from `a` to `z`.
- **`[A-Z]`**: Matches any uppercase letter from `A` to `Z`.
- **`[0-9]`**: Matches any digit from `0` to `9`.

```python
pattern = r"[a-z]"
text = "aBcDeFg"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'c', 'e', 'g']
```

#### Combining Character Classes

- **`[a-zA-Z]`**: Matches any letter, regardless of case.
- **`[a-zA-Z0-9]`**: Matches any alphanumeric character.

```python
pattern = r"[a-zA-Z0-9]"
text = "aBcDeFg123"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'B', 'c', 'D', 'e', 'F', 'g', '1', '2', '3']
```

#### Negation in Character Classes

- **`[^abc]`**: Matches any character except `a`, `b`, or `c`.
- **`[^a-z]`**: Matches any character not in the range `a` to `z`.

```python
pattern = r"[^a-z]"
text = "aBcDeFg123"
matches = re.findall(pattern, text)
print(matches)  # Output: ['B', '1', '2', '3']
```

### Predefined Character Classes

Python’s `re` module provides several predefined character classes for common patterns.

#### Digits

- **`\d`**: Matches any digit; equivalent to `[0-9]`.
- **`\D`**: Matches any non-digit; equivalent to `[^0-9]`.

```python
pattern = r"\d"
text = "abc123"
matches = re.findall(pattern, text)
print(matches)  # Output: ['1', '2', '3']
```

#### Word Characters

- **`\w`**: Matches any word character (alphanumeric + underscore); equivalent to `[a-zA-Z0-9_]`.
- **`\W`**: Matches any non-word character; equivalent to `[^a-zA-Z0-9_]`.

```python
pattern = r"\w"
text = "abc_123!"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'b', 'c', '_', '1', '2', '3']
```

#### Whitespace Characters

- **`\s`**: Matches any whitespace character (spaces, tabs, newlines).
- **`\S`**: Matches any non-whitespace character.

```python
pattern = r"\s"
text = "a b\tc\nd"
matches = re.findall(pattern, text)
print(matches)  # Output: [' ', '\t', '\n']
```

### Combining Wildcards and Character Classes

Combining wildcards and character classes can create powerful and flexible patterns.

#### Example: Matching Email Addresses

```python
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Contact us at support@example.com or sales@example.co.uk"
matches = re.findall(pattern, text)
print(matches)  # Output: ['support@example.com', 'sales@example.co.uk']
```

#### Example: Matching Phone Numbers

```python
pattern = r"\+?\d[\d -]{8,}\d"
text = "Call us at +1 123-456-7890 or 9876543210"
matches = re.findall(pattern, text)
print(matches)  # Output: ['+1 123-456-7890', '9876543210']
```

### Advanced Character Class Techniques

#### Intersection and Union

Python’s `re` module does not directly support intersection and union of character classes, but you can achieve similar results using lookaheads.

- **Intersection**: Match characters that are common to both classes.

```python
pattern = r"(?=[a-z])(?=[0-9])"
text = "a1b2c3"
matches = re.findall(pattern, text)
print(matches)  # Output: []
```

- **Union**: Match characters that belong to either class.

```python
pattern = r"[a-f0-9]"
text = "abc123xyz"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'b', 'c', '1', '2', '3']
```

#### Custom Character Classes

You can create custom character classes using combinations of predefined classes and ranges.

```python
pattern = r"[\w.-]"
text = "aB_c-123.xyz"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'B', '_', 'c', '-', '1', '2', '3', '.', 'x', 'y', 'z']
```

### Summary

- **Wildcards**: Use `.` to match any single character except a newline.
- **Character Classes**: Use `[]` to define a set of characters to match, and `[^]` for negation.
- **Ranges**: Use `-` within character classes to define ranges, e.g., `[a-z]`, `[A-Z]`, `[0-9]`.
- **Predefined Classes**: Use `\d`, `\D`, `\w`, `\W`, `\s`, and `\S` for common patterns.
- **Combining Patterns**: Combine wildcards, character classes, and predefined classes to create powerful regex patterns.
- **Advanced Techniques**: Use lookaheads for intersection and union, and create custom classes by combining predefined classes and ranges.

By mastering wildcards and character classes, you can create versatile and powerful regular expressions to match a wide variety of text patterns in Python.

---

## Repetition Qualifiers in Python

### Introduction

Repetition qualifiers in regular expressions allow you to specify how many times a particular pattern should occur. They provide a powerful way to match sequences of varying lengths. In Python, repetition qualifiers are used in conjunction with the `re` module to create flexible and dynamic patterns.

### Basic Repetition Qualifiers

There are several repetition qualifiers in regular expressions, each with specific usage:

1. **`*` (Star)**
2. **`+` (Plus)**
3. **`?` (Question Mark)**
4. **`{n}` (Exact Count)**
5. **`{n,}` (At Least n)**
6. **`{n,m}` (Between n and m)**

### `*` (Star) Qualifier

The `*` qualifier matches zero or more occurrences of the preceding element.

- **Pattern**: `a*`
- **Description**: Matches any string containing zero or more 'a' characters.

```python
import re

pattern = r"a*"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['aaa', '', '', '', '', '', '']
```

### `+` (Plus) Qualifier

The `+` qualifier matches one or more occurrences of the preceding element.

- **Pattern**: `a+`
- **Description**: Matches any string containing one or more 'a' characters.

```python
pattern = r"a+"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['aaa']
```

### `?` (Question Mark) Qualifier

The `?` qualifier matches zero or one occurrence of the preceding element.

- **Pattern**: `a?`
- **Description**: Matches any string containing zero or one 'a' character.

```python
pattern = r"a?"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a', 'a', 'a', '', '', '']
```

### `{n}` (Exact Count) Qualifier

The `{n}` qualifier matches exactly `n` occurrences of the preceding element.

- **Pattern**: `a{3}`
- **Description**: Matches any string containing exactly three 'a' characters.

```python
pattern = r"a{3}"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['aaa']
```

### `{n,}` (At Least n) Qualifier

The `{n,}` qualifier matches `n` or more occurrences of the preceding element.

- **Pattern**: `a{2,}`
- **Description**: Matches any string containing at least two 'a' characters.

```python
pattern = r"a{2,}"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['aaa']
```

### `{n,m}` (Between n and m) Qualifier

The `{n,m}` qualifier matches between `n` and `m` occurrences of the preceding element.

- **Pattern**: `a{2,3}`
- **Description**: Matches any string containing between two and three 'a' characters.

```python
pattern = r"a{2,3}"
text = "aaabbb"
matches = re.findall(pattern, text)
print(matches)  # Output: ['aaa']
```

### Greedy vs. Non-Greedy Matching

By default, repetition qualifiers are greedy, meaning they match as many characters as possible. You can make them non-greedy by appending a `?`.

#### Greedy Matching

- **Pattern**: `a.*b`
- **Description**: Matches the longest string starting with 'a' and ending with 'b'.

```python
pattern = r"a.*b"
text = "a123b456b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a123b456b']
```

#### Non-Greedy Matching

- **Pattern**: `a.*?b`
- **Description**: Matches the shortest string starting with 'a' and ending with 'b'.

```python
pattern = r"a.*?b"
text = "a123b456b"
matches = re.findall(pattern, text)
print(matches)  # Output: ['a123b', 'a456b']
```

### Using Repetition Qualifiers in Real-World Examples

#### Matching Phone Numbers

```python
pattern = r"\d{3}-\d{3}-\d{4}"
text = "Call me at 123-456-7890 or 987-654-3210."
matches = re.findall(pattern, text)
print(matches)  # Output: ['123-456-7890', '987-654-3210']
```

#### Matching Email Addresses

```python
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Contact us at support@example.com or sales@example.co.uk."
matches = re.findall(pattern, text)
print(matches)  # Output: ['support@example.com', 'sales@example.co.uk']
```

#### Matching HTML Tags

```python
pattern = r"<.*?>"
text = "<div><p>Hello, world!</p></div>"
matches = re.findall(pattern, text)
print(matches)  # Output: ['<div>', '<p>', '</p>', '</div>']
```

### Advanced Techniques with Repetition Qualifiers

#### Nested Repetition Qualifiers

You can use nested repetition qualifiers to match more complex patterns.

- **Pattern**: `((\d+)\s*)+`
- **Description**: Matches one or more groups of digits followed by zero or more spaces.

```python
pattern = r"((\d+)\s*)+"
text = "123 456 789"
matches = re.findall(pattern, text)
print(matches)  # Output: [('123 456 789', '789')]
```

#### Using Lookaheads and Lookbehinds

Lookaheads and lookbehinds can be combined with repetition qualifiers for more advanced matching.

- **Pattern**: `(?=\d{3,})(\d+\.\d{2})`
- **Description**: Matches a number with at least three digits followed by a dot and exactly two digits.

```python
pattern = r"(?=\d{3,})(\d+\.\d{2})"
text = "The prices are 12.34, 123.45, and 1234.56."
matches = re.findall(pattern, text)
print(matches)  # Output: ['123.45', '1234.56']
```

### Summary

- **Basic Repetition Qualifiers**:
  - `*`: Zero or more occurrences
  - `+`: One or more occurrences
  - `?`: Zero or one occurrence
  - `{n}`: Exactly `n` occurrences
  - `{n,}`: At least `n` occurrences
  - `{n,m}`: Between `n` and `m` occurrences

- **Greedy vs. Non-Greedy Matching**:
  - Greedy: Matches as many characters as possible.
  - Non-Greedy: Matches as few characters as possible, using `?`.

- **Real-World Examples**:
  - Matching phone numbers, email addresses, and HTML tags.

- **Advanced Techniques**:
  - Nested repetition qualifiers.
  - Combining lookaheads and lookbehinds with repetition qualifiers.

By mastering repetition qualifiers, you can create powerful and flexible regular expressions to match patterns of varying lengths and complexity in Python.

---

## Escaping Characters in Python

### Introduction

Escaping characters in Python, especially within regular expressions, is crucial for accurately matching specific characters that have special meanings in regex syntax. This guide covers the various methods and techniques for escaping characters in Python, with a focus on regular expressions using the `re` module.

### Understanding Special Characters

Regular expressions use certain characters as metacharacters, which have special meanings. Some common metacharacters include:

- `.` (dot)
- `^` (caret)
- `$` (dollar sign)
- `*` (asterisk)
- `+` (plus)
- `?` (question mark)
- `{}` (curly braces)
- `[]` (square brackets)
- `()` (parentheses)
- `|` (pipe)
- `\` (backslash)

To match these characters literally, you need to escape them.

### Escaping Special Characters

In Python regular expressions, you escape a special character by preceding it with a backslash (`\`). This tells the regex engine to treat the metacharacter as a literal character.

#### Example: Matching a Literal Dot

To match a literal dot (`.`), which normally matches any character except a newline:

```python
import re

pattern = r"\."
text = "www.example.com"
matches = re.findall(pattern, text)
print(matches)  # Output: ['.', '.']
```

### Escaping Backslashes

Backslashes are used for escaping characters, so to match a literal backslash (`\`), you need to escape it with another backslash:

```python
pattern = r"\\"
text = "C:\\path\\to\\file"
matches = re.findall(pattern, text)
print(matches)  # Output: ['\\', '\\']
```

### Common Escaped Characters in Regular Expressions

Here are some commonly escaped characters and their usage:

- `\.`: Matches a literal dot.
- `\\`: Matches a literal backslash.
- `\*`: Matches a literal asterisk.
- `\+`: Matches a literal plus.
- `\?`: Matches a literal question mark.
- `\{`: Matches a literal opening curly brace.
- `\}`: Matches a literal closing curly brace.
- `\(`: Matches a literal opening parenthesis.
- `\)`: Matches a literal closing parenthesis.
- `\[`: Matches a literal opening square bracket.
- `\]`: Matches a literal closing square bracket.
- `\|`: Matches a literal pipe.
- `\^`: Matches a literal caret.
- `\$`: Matches a literal dollar sign.

### Escaping Inside Character Classes

Inside character classes (square brackets `[]`), most metacharacters lose their special meaning, except for `\`, `^`, `-`, and `]`. However, it's good practice to escape them for clarity.

```python
pattern = r"[.\*\+\?\{\}\(\)\[\]\|\\]"
text = ". * + ? { } ( ) [ ] | \\"
matches = re.findall(pattern, text)
print(matches)  # Output: ['.', '*', '+', '?', '{', '}', '(', ')', '[', ']', '|', '\\']
```

### Using `re.escape()`

The `re` module provides a convenient function, `re.escape()`, to escape all non-alphanumeric characters in a string automatically. This is particularly useful when constructing patterns from user input or variable data.

```python
pattern = re.escape(".*+?{}()[]|\\")
text = ".*+?{}()[]|\\"
matches = re.findall(pattern, text)
print(matches)  # Output: ['.*+?{}()[]|\\']
```

### Practical Examples of Escaping Characters

#### Matching File Paths

```python
pattern = r"C:\\path\\to\\file"
text = "C:\\path\\to\\file is the location."
matches = re.findall(pattern, text)
print(matches)  # Output: ['C:\\path\\to\\file']
```

#### Matching URLs

```python
pattern = r"https?://www\.example\.com/\S*"
text = "Visit http://www.example.com/test or https://www.example.com/path."
matches = re.findall(pattern, text)
print(matches)  # Output: ['http://www.example.com/test', 'https://www.example.com/path']
```

#### Matching Email Addresses with Escaped Characters

```python
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
text = "Contact us at support@example.com or sales@example.co.uk."
matches = re.findall(pattern, text)
print(matches)  # Output: ['support@example.com', 'sales@example.co.uk']
```

### Combining Escaped Characters with Other Patterns

You can combine escaped characters with other patterns to create more complex regular expressions.

#### Example: Matching JavaScript Function Definitions

```python
pattern = r"function\s+\w+\s*\([^)]*\)\s*{[^}]*}"
text = "function test(param) { console.log(param); }"
matches = re.findall(pattern, text)
print(matches)  # Output: ['function test(param) { console.log(param); }']
```

### Summary

- **Special Characters**: Recognize the special characters in regular expressions and understand their roles.
- **Escaping Characters**: Use a backslash (`\`) to escape special characters and match them literally.
- **Escaping Backslashes**: Double the backslash (`\\`) to match a literal backslash.
- **Common Escaped Characters**: Familiarize yourself with the common escaped characters like `\.` and `\\`.
- **Character Classes**: Be aware of the special handling of certain characters inside character classes.
- **`re.escape()`**: Use `re.escape()` to automatically escape all non-alphanumeric characters in a string.
- **Practical Examples**: Apply escaping in practical scenarios like matching file paths, URLs, and email addresses.
- **Combining Patterns**: Combine escaped characters with other patterns to create complex regex patterns.

By mastering escaping characters in regular expressions, you can create precise and effective patterns for a wide range of text processing tasks in Python.

---

# Regular Expressions in Action

## Introduction

Regular expressions (regex) are sequences of characters that define a search pattern. They are used for string matching and manipulation. Python's `re` module provides support for regex.

## Basics of Regular Expressions

### Special Characters

1. **.** - Matches any character except a newline.
2. **^** - Matches the start of the string.
3. **$** - Matches the end of the string.
4. **\*** - Matches 0 or more repetitions of the preceding element.
5. **\+** - Matches 1 or more repetitions of the preceding element.
6. **?** - Matches 0 or 1 repetition of the preceding element.
7. **\***?, **\+?**, **??** - Non-greedy versions of *, +, and ?.
8. **\** - Escapes special characters.
9. **[]** - Matches any one of the characters inside the brackets.
10. **[^]** - Matches any character not inside the brackets.
11. **()** - Groups expressions and captures the matched text.
12. **|** - Matches either the expression before or the expression after the |.

### Character Classes

1. **\d** - Matches any digit; equivalent to [0-9].
2. **\D** - Matches any non-digit; equivalent to [^0-9].
3. **\w** - Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
4. **\W** - Matches any non-alphanumeric character; equivalent to [^a-zA-Z0-9_].
5. **\s** - Matches any whitespace character; equivalent to [ \t\n\r\f\v].
6. **\S** - Matches any non-whitespace character; equivalent to [^ \t\n\r\f\v].

### Quantifiers

1. **{n}** - Matches exactly n occurrences of the preceding element.
2. **{n,}** - Matches n or more occurrences of the preceding element.
3. **{n,m}** - Matches between n and m occurrences of the preceding element.

## Using the `re` Module

### Importing the Module

```python
import re
```

### Functions in the `re` Module

1. **re.compile(pattern, flags=0)**: Compiles a regex pattern into a regex object, which can be used for matching.
2. **re.search(pattern, string, flags=0)**: Searches the string for a match to the pattern.
3. **re.match(pattern, string, flags=0)**: Checks for a match only at the beginning of the string.
4. **re.fullmatch(pattern, string, flags=0)**: Checks for a match only if the entire string matches the pattern.
5. **re.findall(pattern, string, flags=0)**: Finds all substrings where the pattern matches and returns them as a list.
6. **re.finditer(pattern, string, flags=0)**: Finds all substrings where the pattern matches and returns them as an iterator of match objects.
7. **re.split(pattern, string, maxsplit=0, flags=0)**: Splits the string by occurrences of the pattern.
8. **re.sub(pattern, repl, string, count=0, flags=0)**: Replaces occurrences of the pattern with `repl` in the string.
9. **re.subn(pattern, repl, string, count=0, flags=0)**: Replaces occurrences of the pattern with `repl` in the string and returns a tuple (new_string, number_of_subs_made).

### Example Usage

1. **Compiling a Pattern**

```python
pattern = re.compile(r'\d+')
```

2. **Searching for a Pattern**

```python
match = re.search(r'\d+', 'The number is 42')
if match:
    print(match.group())  # Output: 42
```

3. **Matching a Pattern at the Beginning**

```python
match = re.match(r'\d+', '42 is the answer')
if match:
    print(match.group())  # Output: 42
```

4. **Finding All Matches**

```python
matches = re.findall(r'\d+', '12 drummers drumming, 11 pipers piping')
print(matches)  # Output: ['12', '11']
```

5. **Iterating Over Matches**

```python
for match in re.finditer(r'\d+', '12 drummers drumming, 11 pipers piping'):
    print(match.group())  # Output: 12, then 11
```

6. **Splitting a String**

```python
parts = re.split(r'\d+', 'one1two2three3four')
print(parts)  # Output: ['one', 'two', 'three', 'four']
```

7. **Substituting Patterns**

```python
new_string = re.sub(r'\d+', '#', '12 drummers drumming, 11 pipers piping')
print(new_string)  # Output: '# drummers drumming, # pipers piping'
```

8. **Substituting Patterns with Count**

```python
new_string, num_subs = re.subn(r'\d+', '#', '12 drummers drumming, 11 pipers piping')
print(new_string)  # Output: '# drummers drumming, # pipers piping'
print(num_subs)    # Output: 2
```

## Advanced Usage

### Flags

Flags modify the behavior of regex functions. Common flags include:

1. **re.IGNORECASE (re.I)** - Makes the pattern case-insensitive.
2. **re.MULTILINE (re.M)** - '^' and '$' match the start and end of each line.
3. **re.DOTALL (re.S)** - '.' matches any character, including a newline.
4. **re.VERBOSE (re.X)** - Allows for more readable regex by ignoring whitespace and comments within the pattern.

### Example with Flags

```python
pattern = re.compile(r'^hello', re.I | re.M)
matches = re.findall(pattern, 'Hello\nhello\nHello')
print(matches)  # Output: ['Hello', 'hello', 'Hello']
```

### Named Groups

Named groups allow for more readable code when working with multiple groups.

```python
pattern = re.compile(r'(?P<first>\w+)\s(?P<last>\w+)')
match = pattern.search('John Doe')
if match:
    print(match.group('first'))  # Output: John
    print(match.group('last'))   # Output: Doe
```

### Non-capturing Groups

Non-capturing groups are used when you need to group part of the regex but do not want to capture the matched text.

```python
pattern = re.compile(r'(?:\d{3})-(\d{2})-(\d{4})')
match = pattern.search('123-45-6789')
if match:
    print(match.group(1))  # Output: 45
```

### Lookahead and Lookbehind

Lookahead and lookbehind assertions are used to assert that a certain pattern is followed or preceded by another pattern.

1. **Positive Lookahead (?=...)**

```python
pattern = re.compile(r'\d+(?= dollars)')
match = pattern.search('100 dollars')
if match:
    print(match.group())  # Output: 100
```

2. **Negative Lookahead (?!...)**

```python
pattern = re.compile(r'\d+(?! dollars)')
match = pattern.search('100 euros')
if match:
    print(match.group())  # Output: 100
```

3. **Positive Lookbehind (?<=...)**

```python
pattern = re.compile(r'(?<=\$)\d+')
match = pattern.search('$100')
if match:
    print(match.group())  # Output: 100
```

4. **Negative Lookbehind (?<!...)**

```python
pattern = re.compile(r'(?<!\$)\d+')
match = pattern.search('100 dollars')
if match:
    print(match.group())  # Output: 100
```

## Practical Examples

### Validating an Email Address

```python
pattern = re.compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')
email = 'example@example.com'
if pattern.match(email):
    print('Valid email')
else:
    print('Invalid email')
```

### Extracting URLs

```python
pattern = re.compile(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+')
text = 'Visit https://www.example.com and http://example.org'
urls = pattern.findall(text)
print(urls)  # Output: ['https://www.example.com', 'http://example.org']
```

### Tokenizing a Sentence

```python
pattern = re.compile(r'\b\w+\b')
sentence = 'This is a sample sentence.'
tokens = pattern.findall(sentence)
print(tokens)  # Output: ['This', 'is', 'a', 'sample', 'sentence']
```

### Replacing Profanity

```python
pattern = re.compile(r'\b(badword1|badword2|badword3)\b', re.IGNORECASE)
text = 'This is a badword1 and BADWORD2 test.'
censored_text = pattern.sub('****', text)
print(censored_text)  # Output: This is a **** and **** test.
```

## Conclusion

Regular expressions are a powerful tool for text processing and manipulation in Python. The `re` module provides a wide range of functions and capabilities to work with regex effectively. Understanding and mastering regex can greatly enhance your ability to handle complex string operations.

---



# When Could You Use Regular Expressions in Python

## Introduction

Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. They can be used in a wide variety of situations where you need to search, match, or manipulate strings. This guide explores various scenarios and examples where regex can be particularly useful.

## Scenarios for Using Regular Expressions

### 1. Validating Input

Regular expressions are ideal for validating input data to ensure it conforms to expected patterns. Common use cases include:

#### a. Validating Email Addresses

```python
import re

def is_valid_email(email):
    pattern = re.compile(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')
    return bool(pattern.match(email))

email = 'example@example.com'
print(is_valid_email(email))  # Output: True
```

#### b. Validating Phone Numbers

```python
def is_valid_phone(phone):
    pattern = re.compile(r'^\+?\d{1,4}?\s?-?\(?\d{1,4}?\)?\s?-?\d{1,4}\s?-?\d{1,4}\s?-?\d{1,9}$')
    return bool(pattern.match(phone))

phone = '+1 (123) 456-7890'
print(is_valid_phone(phone))  # Output: True
```

#### c. Validating URLs

```python
def is_valid_url(url):
    pattern = re.compile(r'^(https?://)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$')
    return bool(pattern.match(url))

url = 'https://www.example.com'
print(is_valid_url(url))  # Output: True
```

### 2. Searching and Extracting Data

Regex can be used to search for specific patterns within strings and extract matching data. Examples include:

#### a. Extracting Dates

```python
text = 'The event is on 2024-07-08 and the deadline is 2024-08-15.'
pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
dates = pattern.findall(text)
print(dates)  # Output: ['2024-07-08', '2024-08-15']
```

#### b. Extracting Hashtags from Tweets

```python
tweet = 'Loving the weather! #sunny #summer #vacation'
pattern = re.compile(r'#\w+')
hashtags = pattern.findall(tweet)
print(hashtags)  # Output: ['#sunny', '#summer', '#vacation']
```

#### c. Extracting Prices

```python
text = 'The items cost $10, $15, and $20 each.'
pattern = re.compile(r'\$\d+')
prices = pattern.findall(text)
print(prices)  # Output: ['$10', '$15', '$20']
```

### 3. Replacing or Modifying Text

Regex is useful for replacing or modifying parts of a string that match a certain pattern. Examples include:

#### a. Censoring Profanity

```python
text = 'This is a badword and anotherbadword test.'
pattern = re.compile(r'\b(badword|anotherbadword)\b', re.IGNORECASE)
censored_text = pattern.sub('****', text)
print(censored_text)  # Output: 'This is a **** and **** test.'
```

#### b. Formatting Phone Numbers

```python
text = 'Call me at 1234567890 or 0987654321.'
pattern = re.compile(r'(\d{3})(\d{3})(\d{4})')
formatted_text = pattern.sub(r'(\1) \2-\3', text)
print(formatted_text)  # Output: 'Call me at (123) 456-7890 or (098) 765-4321.'
```

#### c. Reformatting Dates

```python
text = 'The dates are 2024-07-08 and 2024-08-15.'
pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
reformatted_text = pattern.sub(r'\2/\3/\1', text)
print(reformatted_text)  # Output: 'The dates are 07/08/2024 and 08/15/2024.'
```

### 4. Splitting Strings

Regex can be used to split strings based on complex patterns. Examples include:

#### a. Splitting by Multiple Delimiters

```python
text = 'apple, orange; banana|grape'
pattern = re.compile(r'[,\;\|]')
fruits = pattern.split(text)
print(fruits)  # Output: ['apple', ' orange', ' banana', 'grape']
```

#### b. Splitting CamelCase Strings

```python
text = 'CamelCaseStringExample'
pattern = re.compile(r'(?<!^)(?=[A-Z])')
words = pattern.split(text)
print(words)  # Output: ['Camel', 'Case', 'String', 'Example']
```

### 5. Advanced String Manipulation

Regex is useful for advanced string manipulation tasks such as:

#### a. Parsing Logs

```python
log = 'ERROR 2024-07-08 12:34:56 - Something bad happened.'
pattern = re.compile(r'(?P<level>\w+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+(?P<message>.+)')
match = pattern.match(log)
if match:
    print(match.group('level'))    # Output: ERROR
    print(match.group('date'))     # Output: 2024-07-08
    print(match.group('time'))     # Output: 12:34:56
    print(match.group('message'))  # Output: Something bad happened.
```

#### b. Cleaning and Normalizing Text

```python
text = 'This!!! is   an   example... text, with lots  of: noise.'
pattern = re.compile(r'[^\w\s]')
cleaned_text = pattern.sub('', text)
normalized_text = re.sub(r'\s+', ' ', cleaned_text).strip()
print(normalized_text)  # Output: 'This is an example text with lots of noise'
```

### 6. Conditional Text Processing

Regex allows for conditional text processing based on the presence of specific patterns.

#### a. Conditional Replacements

```python
text = 'The color is red. The colour is blue.'
pattern = re.compile(r'colou?r')
conditional_text = pattern.sub(lambda x: 'color' if x.group() == 'colour' else x.group(), text)
print(conditional_text)  # Output: 'The color is red. The color is blue.'
```

#### b. Conditional Text Extraction

```python
text = 'Price: $50, Quantity: 20, Discount: 5%'
pattern = re.compile(r'(?P<key>\w+):\s(?P<value>\$\d+|\d+%|\d+)')
for match in pattern.finditer(text):
    key = match.group('key')
    value = match.group('value')
    print(f'{key}: {value}')
# Output:
# Price: $50
# Quantity: 20
# Discount: 5%
```

### 7. Handling Complex Text Formats

Regex can simplify handling and parsing of complex text formats such as:

#### a. Parsing CSV Files with Embedded Commas

```python
text = 'name,age,location\n"John Doe",30,"New York, USA"\n"Jane Smith",25,"Los Angeles, USA"'
pattern = re.compile(r'(?<!"),(?!")')
lines = text.split('\n')
for line in lines:
    fields = pattern.split(line)
    print(fields)
# Output:
# ['name', 'age', 'location']
# ['"John Doe"', '30', '"New York, USA"']
# ['"Jane Smith"', '25', '"Los Angeles, USA"']
```

#### b. Parsing Nested Data Structures

```python
text = '{"name": "John", "details": {"age": 30, "location": "New York"}}'
pattern = re.compile(r'\"(\w+)\":\s*\"?(\w+|\d+|{.+?})\"?')
matches = pattern.findall(text)
for match in matches:
    print(match)
# Output:
# ('name', 'John')
# ('details', '{"age": 30, "location": "New York"}')
# ('age', '30')
# ('location', 'New York')
```

## Conclusion

Regular expressions are a versatile tool that can be applied in many scenarios involving text processing and manipulation. From validating input and extracting data to replacing text and handling complex text formats, regex can significantly simplify and enhance your ability to work with strings in Python. Mastering regular expressions can greatly improve your efficiency and effectiveness in handling a wide range of text-related tasks.

---



# Capturing Groups in Python's Regular Expressions

## Introduction

Capturing groups are a fundamental feature of regular expressions that allow you to isolate and extract specific parts of a string that match a given pattern. In Python, capturing groups are created using parentheses `()` in the pattern.

## Basics of Capturing Groups

### Creating a Capturing Group

To create a capturing group, enclose the part of the pattern you want to capture in parentheses.

```python
import re

pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
match = pattern.search('2024-07-08')
if match:
    print(match.group(1))  # Output: 2024
    print(match.group(2))  # Output: 07
    print(match.group(3))  # Output: 08
```

### Accessing Captured Groups

- **group()**: Returns the entire match or a specific captured group by index.
- **groups()**: Returns a tuple containing all the captured groups.

```python
match = pattern.search('2024-07-08')
if match:
    print(match.group())    # Output: 2024-07-08
    print(match.groups())   # Output: ('2024', '07', '08')
```

### Named Capturing Groups

Named capturing groups allow you to assign names to groups, making your patterns more readable and easier to manage.

```python
pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
match = pattern.search('2024-07-08')
if match:
    print(match.group('year'))   # Output: 2024
    print(match.group('month'))  # Output: 07
    print(match.group('day'))    # Output: 08
```

## Using Capturing Groups in Different Functions

### re.match() and re.search()

Both `re.match()` and `re.search()` return match objects from which you can extract captured groups.

```python
pattern = re.compile(r'(\w+)\s(\w+)')
match = pattern.search('Hello World')
if match:
    print(match.group(1))  # Output: Hello
    print(match.group(2))  # Output: World
```

### re.findall()

`re.findall()` returns a list of tuples containing all captured groups for all matches.

```python
pattern = re.compile(r'(\d{2})/(\d{2})/(\d{4})')
matches = pattern.findall('Dates: 08/07/2024, 15/08/2024')
print(matches)  # Output: [('08', '07', '2024'), ('15', '08', '2024')]
```

### re.finditer()

`re.finditer()` returns an iterator yielding match objects for all matches.

```python
pattern = re.compile(r'(\d{2})/(\d{2})/(\d{4})')
for match in pattern.finditer('Dates: 08/07/2024, 15/08/2024'):
    print(match.groups())
# Output:
# ('08', '07', '2024')
# ('15', '08', '2024')
```

### re.split()

`re.split()` uses capturing groups to include the delimiters in the resulting list.

```python
pattern = re.compile(r'(\d{2})')
parts = pattern.split('1234567890')
print(parts)  # Output: ['', '12', '34', '56', '78', '90', '']
```

### re.sub() and re.subn()

`re.sub()` and `re.subn()` allow you to use captured groups in the replacement string.

```python
pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
new_date = pattern.sub(r'\2/\3/\1', '2024-07-08')
print(new_date)  # Output: 07/08/2024
```

## Advanced Capturing Group Techniques

### Nested Groups

Groups can be nested, and the numbering is determined by the opening parenthesis from left to right.

```python
pattern = re.compile(r'((\d{4})-(\d{2})-(\d{2}))')
match = pattern.search('2024-07-08')
if match:
    print(match.group(1))  # Output: 2024-07-08
    print(match.group(2))  # Output: 2024
    print(match.group(3))  # Output: 07
    print(match.group(4))  # Output: 08
```

### Non-Capturing Groups

Non-capturing groups allow you to group parts of the pattern without capturing them. This is useful for applying quantifiers or alternations without affecting the captured groups.

```python
pattern = re.compile(r'(?:\d{4})-(\d{2})-(\d{2})')
match = pattern.search('2024-07-08')
if match:
    print(match.group(1))  # Output: 07
    print(match.group(2))  # Output: 08
```

### Lookahead and Lookbehind Assertions

Lookahead and lookbehind assertions allow you to match a pattern only if it is followed or preceded by another pattern. They do not capture text but assert conditions.

#### Positive Lookahead

```python
pattern = re.compile(r'\d+(?= dollars)')
match = pattern.search('100 dollars')
if match:
    print(match.group())  # Output: 100
```

#### Negative Lookahead

```python
pattern = re.compile(r'\d+(?! dollars)')
match = pattern.search('100 euros')
if match:
    print(match.group())  # Output: 100
```

#### Positive Lookbehind

```python
pattern = re.compile(r'(?<=\$)\d+')
match = pattern.search('$100')
if match:
    print(match.group())  # Output: 100
```

#### Negative Lookbehind

```python
pattern = re.compile(r'(?<!\$)\d+')
match = pattern.search('100 dollars')
if match:
    print(match.group())  # Output: 100
```

### Backreferences

Backreferences allow you to reuse a captured group within the same regex pattern. This is useful for matching repeated substrings.

```python
pattern = re.compile(r'(\b\w+\b) \1')
match = pattern.search('hello hello')
if match:
    print(match.group())  # Output: hello hello
```

### Named Backreferences

Named backreferences allow you to refer to a named capturing group within the same pattern.

```python
pattern = re.compile(r'(?P<word>\b\w+\b) (?P=word)')
match = pattern.search('hello hello')
if match:
    print(match.group())  # Output: hello hello
```

## Practical Examples

### Parsing Logs with Capturing Groups

```python
log_entry = 'ERROR 2024-07-08 12:34:56 - Something bad happened.'
pattern = re.compile(r'(?P<level>\w+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<time>\d{2}:\d{2}:\d{2})\s+-\s+(?P<message>.+)')
match = pattern.search(log_entry)
if match:
    print(match.group('level'))    # Output: ERROR
    print(match.group('date'))     # Output: 2024-07-08
    print(match.group('time'))     # Output: 12:34:56
    print(match.group('message'))  # Output: Something bad happened.
```

### Extracting Data with Multiple Capturing Groups

```python
text = 'John Smith, 30 years old, john@example.com'
pattern = re.compile(r'(?P<name>[A-Za-z ]+),\s+(?P<age>\d+)\s+years old,\s+(?P<email>\S+)')
match = pattern.search(text)
if match:
    print(match.group('name'))  # Output: John Smith
    print(match.group('age'))   # Output: 30
    print(match.group('email')) # Output: john@example.com
```

### Reformatting Strings with Capturing Groups

```python
text = '2024-07-08'
pattern = re.compile(r'(\d{4})-(\d{2})-(\d{2})')
new_text = pattern.sub(r'\2/\3/\1', text)
print(new_text)  # Output: 07/08/2024
```

## Conclusion

Capturing groups in regular expressions provide a powerful way to isolate, extract, and manipulate specific parts of a string. By using parentheses to define groups, you can access matched substrings, perform substitutions, and apply complex text processing techniques. Understanding and utilizing capturing groups can greatly enhance your ability to work with strings and patterns in Python.

---



# More on Repetition Qualifiers in Python's Regular Expressions

## Introduction

Repetition qualifiers in regular expressions allow you to specify how many times a particular part of a pattern can or must occur. Understanding these qualifiers is crucial for crafting effective and precise regex patterns. This guide covers the different types of repetition qualifiers and their usage in Python's `re` module.

## Basic Repetition Qualifiers

### Zero or More (`*`)

The asterisk `*` matches zero or more occurrences of the preceding element.

```python
import re

pattern = re.compile(r'ab*c')
matches = pattern.findall('ac abc abbc')
print(matches)  # Output: ['ac', 'abc', 'abbc']
```

### One or More (`+`)

The plus `+` matches one or more occurrences of the preceding element.

```python
pattern = re.compile(r'ab+c')
matches = pattern.findall('ac abc abbc')
print(matches)  # Output: ['abc', 'abbc']
```

### Zero or One (`?`)

The question mark `?` matches zero or one occurrence of the preceding element.

```python
pattern = re.compile(r'ab?c')
matches = pattern.findall('ac abc abbc')
print(matches)  # Output: ['ac', 'abc']
```

## Specifying Exact Numbers and Ranges

### Exact Number (`{n}`)

Matches exactly `n` occurrences of the preceding element.

```python
pattern = re.compile(r'a{3}')
matches = pattern.findall('aaa aa aaaa a')
print(matches)  # Output: ['aaa', 'aaa']
```

### Range (`{n,m}`)

Matches between `n` and `m` occurrences of the preceding element.

```python
pattern = re.compile(r'a{2,4}')
matches = pattern.findall('a aa aaa aaaa aaaaa')
print(matches)  # Output: ['aa', 'aaa', 'aaaa', 'aaaa']
```

### At Least (`{n,}`)

Matches at least `n` occurrences of the preceding element.

```python
pattern = re.compile(r'a{2,}')
matches = pattern.findall('a aa aaa aaaa aaaaa')
print(matches)  # Output: ['aa', 'aaa', 'aaaa', 'aaaaa']
```

### At Most (`{,m}`)

Matches at most `m` occurrences of the preceding element (though this form is rarely used).

```python
pattern = re.compile(r'a{,3}')
matches = pattern.findall('a aa aaa aaaa aaaaa')
print(matches)  # Output: ['a', 'aa', 'aaa', 'aaa']
```

## Greedy vs. Non-Greedy (Lazy) Matching

### Greedy Matching

By default, repetition qualifiers are greedy, meaning they match as many occurrences as possible.

```python
pattern = re.compile(r'a.*b')
match = pattern.search('aabab')
print(match.group())  # Output: 'aabab'
```

### Non-Greedy (Lazy) Matching

Non-greedy (or lazy) matching matches as few occurrences as possible. Add a question mark `?` after the repetition qualifier to make it non-greedy.

```python
pattern = re.compile(r'a.*?b')
match = pattern.search('aabab')
print(match.group())  # Output: 'aab'
```

## Practical Examples

### Matching Phone Numbers

```python
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
matches = pattern.findall('Contact: 123-456-7890 or 987-654-3210')
print(matches)  # Output: ['123-456-7890', '987-654-3210']
```

### Validating Passwords

A password with at least 8 characters, including one uppercase letter, one lowercase letter, one digit, and one special character.

```python
pattern = re.compile(r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$')
password = 'Password1!'
print(bool(pattern.match(password)))  # Output: True
```

### Extracting HTML Tags

```python
html = '<div><p>Hello, World!</p></div>'
pattern = re.compile(r'<.*?>')
tags = pattern.findall(html)
print(tags)  # Output: ['<div>', '<p>', '</p>', '</div>']
```

### Parsing Repeated Patterns

```python
text = 'word1 word2 word3'
pattern = re.compile(r'\b\w+\b')
words = pattern.findall(text)
print(words)  # Output: ['word1', 'word2', 'word3']
```

## Combining Repetition Qualifiers with Other Constructs

### Using Groups with Repetition Qualifiers

```python
pattern = re.compile(r'(ab){2,3}')
matches = pattern.findall('abab ababab ab')
print(matches)  # Output: ['ab', 'ab']
```

### Using Character Classes with Repetition Qualifiers

```python
pattern = re.compile(r'[A-Za-z]{2,4}')
matches = pattern.findall('a ab abc abcd abcde')
print(matches)  # Output: ['ab', 'abc', 'abcd']
```

### Using Assertions with Repetition Qualifiers

```python
pattern = re.compile(r'(?<=\d{3})-')
matches = pattern.findall('123-456-789')
print(matches)  # Output: ['-']
```

## Advanced Examples

### Validating IP Addresses

```python
pattern = re.compile(r'^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')
ip = '192.168.1.1'
print(bool(pattern.match(ip)))  # Output: True
```

### Matching Dates in Various Formats

```python
pattern = re.compile(r'\b(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})\b')
matches = pattern.findall('Dates: 12/31/2024, 1-1-24')
print(matches)  # Output: [('12', '31', '2024'), ('1', '1', '24')]
```

### Extracting Repeated Patterns with Overlapping Matches

```python
pattern = re.compile(r'(?=(ab))')
matches = [match.group(1) for match in pattern.finditer('ababab')]
print(matches)  # Output: ['ab', 'ab', 'ab']
```

## Conclusion

Repetition qualifiers in Python's regular expressions provide powerful mechanisms to match patterns that occur multiple times. By understanding and utilizing these qualifiers, you can create flexible and efficient regex patterns to handle a wide variety of text processing tasks. Whether you need to match simple repeated sequences or complex patterns with specific repetition constraints, mastering repetition qualifiers is essential for effective regex usage in Python.

---

# Extracting a PID Using Regexes in Python

## Introduction

A Process ID (PID) is a unique identifier assigned to each process running on an operating system. In various scenarios, especially in system administration and monitoring, you might need to extract PIDs from text data, such as logs, command outputs, or configuration files. Regular expressions (regexes) are powerful tools for pattern matching and can be effectively used to extract PIDs from such text data.

## Understanding PIDs

PIDs are typically numeric values. The format and range of PIDs can vary depending on the operating system, but they are usually positive integers. For simplicity, we will consider PIDs as sequences of digits.

## Basic Regex for PIDs

A basic regular expression to match a PID would be a sequence of digits. The regex pattern `\d+` matches one or more digits, which is suitable for matching PIDs.

```python
import re

pattern = re.compile(r'\d+')
```

## Extracting PIDs from Text

### Example Text

Let's consider an example text that contains multiple PIDs:

```text
User john started process with PID 12345
System process PID 6789 terminated unexpectedly
Current running processes: 1001, 1002, 1003
```

### Writing the Regex

To extract PIDs, we can use the regex pattern `\b\d+\b` which ensures that we match whole numbers only (using word boundaries `\b`).

```python
pattern = re.compile(r'\b\d+\b')
```

### Extracting PIDs

Using `re.findall()` to extract all PIDs from the text:

```python
text = '''
User john started process with PID 12345
System process PID 6789 terminated unexpectedly
Current running processes: 1001, 1002, 1003
'''

pattern = re.compile(r'\b\d+\b')
pids = pattern.findall(text)
print(pids)  # Output: ['12345', '6789', '1001', '1002', '1003']
```

### Extracting PIDs with Context

If you want to ensure that the numbers you are extracting are indeed PIDs, you might want to include some context in your regex. For example, looking for the word "PID" followed by a number.

```python
pattern = re.compile(r'\bPID\s+(\d+)\b')
matches = pattern.findall(text)
print(matches)  # Output: ['12345', '6789']
```

## Handling Various Text Formats

### Extracting PIDs from Command Output

Consider the output of a command like `ps aux` which lists running processes:

```text
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  22520  1020 ?        Ss   Jun17   0:00 /sbin/init
root       198  0.0  0.0  47432  3876 ?        Ss   Jun17   0:00 /lib/systemd/systemd-journald
```

To extract PIDs from this output:

```python
ps_output = '''
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.1  22520  1020 ?        Ss   Jun17   0:00 /sbin/init
root       198  0.0  0.0  47432  3876 ?        Ss   Jun17   0:00 /lib/systemd/systemd-journald
'''

pattern = re.compile(r'^\w+\s+(\d+)', re.MULTILINE)
pids = pattern.findall(ps_output)
print(pids)  # Output: ['1', '198']
```

### Extracting PIDs from Logs

Consider a log file with entries like:

```text
[2024-07-08 12:34:56] INFO: Process started with PID 4321
[2024-07-08 12:35:00] ERROR: Process 4321 crashed
```

To extract PIDs:

```python
log_text = '''
[2024-07-08 12:34:56] INFO: Process started with PID 4321
[2024-07-08 12:35:00] ERROR: Process 4321 crashed
'''

pattern = re.compile(r'\bPID\s+(\d+)\b')
pids = pattern.findall(log_text)
print(pids)  # Output: ['4321']
```

### Extracting PIDs with Variations

If the PIDs in your text can appear in various formats or contexts, you might need a more flexible regex. For instance:

```text
Process ID: 12345
PID=6789
Started process (PID: 1001)
```

A more flexible pattern:

```python
varied_text = '''
Process ID: 12345
PID=6789
Started process (PID: 1001)
'''

pattern = re.compile(r'PID[:=]?\s*(\d+)')
pids = pattern.findall(varied_text)
print(pids)  # Output: ['12345', '6789', '1001']
```

## Advanced Techniques

### Using Named Groups

For better readability, you can use named groups in your regex pattern.

```python
pattern = re.compile(r'\bPID\s+(?P<pid>\d+)\b')
matches = pattern.finditer(text)
for match in matches:
    print(match.group('pid'))  # Output: 12345, 6789
```

### Extracting PIDs and Other Information

If you need to extract PIDs along with other related information, you can use multiple capturing groups.

```python
log_text = '''
[2024-07-08 12:34:56] INFO: Process started with PID 4321 by user root
[2024-07-08 12:35:00] ERROR: Process 4321 crashed by user root
'''

pattern = re.compile(r'\[(?P<datetime>[^\]]+)\] (?P<level>\w+): Process started with PID (?P<pid>\d+) by user (?P<user>\w+)')
matches = pattern.finditer(log_text)
for match in matches:
    print(f"PID: {match.group('pid')}, User: {match.group('user')}, DateTime: {match.group('datetime')}, Level: {match.group('level')}")
# Output:
# PID: 4321, User: root, DateTime: 2024-07-08 12:34:56, Level: INFO
```

### Handling Overlapping Matches

If you have overlapping matches, you might need to use lookaheads or lookbehinds.

```python
overlapping_text = '12345 123 12345'
pattern = re.compile(r'(?=(\d{3,5}))')
matches = [match.group(1) for match in pattern.finditer(overlapping_text)]
print(matches)  # Output: ['12345', '123', '12345']
```

## Conclusion

Extracting PIDs using regular expressions in Python is a powerful technique for parsing text data. By understanding how to construct and apply regex patterns, you can effectively extract PIDs from various text formats, ensuring accurate and efficient text processing. Whether you are dealing with log files, command outputs, or other sources, mastering regexes will enhance your ability to handle text data in Python.

---

# Splitting and Replacing Using Python's Regular Expressions

## Introduction

Splitting and replacing are two common operations in text processing. Regular expressions (regexes) provide powerful tools for performing these operations with high flexibility and precision. Python's `re` module offers methods like `re.split()` and `re.sub()` (and `re.subn()`) to split strings and replace substrings based on regex patterns.

## Splitting Strings with Regular Expressions

### Basic Splitting with `re.split()`

The `re.split()` function splits a string by the occurrences of the regex pattern. It returns a list of substrings.

```python
import re

pattern = re.compile(r'\s+')
text = "This is a sample text."
result = pattern.split(text)
print(result)  # Output: ['This', 'is', 'a', 'sample', 'text.']
```

### Splitting by Multiple Delimiters

You can split a string by multiple delimiters by using a regex pattern that includes all desired delimiters.

```python
pattern = re.compile(r'[,\s]+')
text = "apple, orange, banana,grape"
result = pattern.split(text)
print(result)  # Output: ['apple', 'orange', 'banana', 'grape']
```

### Keeping Delimiters in the Result

If you want to keep the delimiters in the result, you can use capturing groups.

```python
pattern = re.compile(r'(\s+|,)')
text = "apple, orange, banana,grape"
result = pattern.split(text)
print(result)  # Output: ['apple', ',', ' ', 'orange', ',', ' ', 'banana', ',', 'grape']
```

### Controlling the Maximum Number of Splits

The `maxsplit` parameter limits the number of splits performed.

```python
pattern = re.compile(r'\s+')
text = "This is a sample text."
result = pattern.split(text, maxsplit=2)
print(result)  # Output: ['This', 'is', 'a sample text.']
```

## Replacing Substrings with Regular Expressions

### Basic Replacing with `re.sub()`

The `re.sub()` function replaces occurrences of the regex pattern with the replacement string.

```python
pattern = re.compile(r'\d+')
text = "There are 123 apples and 456 oranges."
result = pattern.sub('many', text)
print(result)  # Output: "There are many apples and many oranges."
```

### Advanced Replacing with `re.sub()`

You can use backreferences in the replacement string to refer to captured groups in the pattern.

```python
pattern = re.compile(r'(\d+) apples and (\d+) oranges')
text = "There are 123 apples and 456 oranges."
result = pattern.sub(r'\2 oranges and \1 apples', text)
print(result)  # Output: "There are 456 oranges and 123 apples."
```

### Using a Function as the Replacement

The `re.sub()` function allows you to use a function to generate the replacement string.

```python
def increment(match):
    return str(int(match.group()) + 1)

pattern = re.compile(r'\d+')
text = "There are 123 apples and 456 oranges."
result = pattern.sub(increment, text)
print(result)  # Output: "There are 124 apples and 457 oranges."
```

### Controlling the Maximum Number of Replacements

The `count` parameter limits the number of replacements performed.

```python
pattern = re.compile(r'\d+')
text = "There are 123 apples and 456 oranges."
result = pattern.sub('many', text, count=1)
print(result)  # Output: "There are many apples and 456 oranges."
```

### Getting the Number of Replacements with `re.subn()`

The `re.subn()` function works like `re.sub()` but returns a tuple containing the new string and the number of replacements made.

```python
pattern = re.compile(r'\d+')
text = "There are 123 apples and 456 oranges."
result, num_replacements = pattern.subn('many', text)
print(result)  # Output: "There are many apples and many oranges."
print(num_replacements)  # Output: 2
```

## Practical Examples

### Splitting CSV Lines

```python
pattern = re.compile(r'\s*,\s*')
text = "apple, orange, banana , grape"
result = pattern.split(text)
print(result)  # Output: ['apple', 'orange', 'banana', 'grape']
```

### Replacing Dates with a Standard Format

```python
pattern = re.compile(r'(\d{2})/(\d{2})/(\d{4})')
text = "The date is 08/07/2024."
result = pattern.sub(r'\3-\2-\1', text)
print(result)  # Output: "The date is 2024-07-08."
```

### Masking Sensitive Information

```python
pattern = re.compile(r'\b\d{4}(\d{4})\b')
text = "My credit card number is 1234567812345678."
result = pattern.sub(r'****\1', text)
print(result)  # Output: "My credit card number is ****5678."
```

### Parsing and Modifying Logs

```python
log_text = '''
[2024-07-08 12:34:56] INFO: User john logged in
[2024-07-08 12:35:00] ERROR: Failed login attempt
'''

pattern = re.compile(r'\[(.*?)\] (INFO|ERROR): (.*)')
result = pattern.sub(r'\1 - \2 - \3', log_text)
print(result)
# Output:
# "2024-07-08 12:34:56 - INFO - User john logged in
# 2024-07-08 12:35:00 - ERROR - Failed login attempt"
```

## Using Named Groups in Replacement

### Named Groups for Readable Patterns

```python
pattern = re.compile(r'(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4})')
text = "The date is 08/07/2024."
result = pattern.sub(r'\g<year>-\g<month>-\g<day>', text)
print(result)  # Output: "The date is 2024-07-08."
```

## Summary

Regular expressions in Python provide powerful tools for splitting and replacing substrings within text. By understanding and utilizing the functions and capabilities of the `re` module, you can perform complex text manipulations efficiently and effectively. Whether you are parsing data, formatting strings, or cleaning text, regex-based splitting and replacing offer the flexibility needed to handle a wide range of tasks.

---

# Advanced Regular Expressions in Python

Regular expressions (regex) are a powerful tool for pattern matching and text processing. While basic regex patterns handle simple tasks, advanced regex techniques provide the flexibility to handle more complex text manipulations. Python's `re` module offers extensive functionality for advanced regex operations.

## Advanced Regex Features

### Lookahead and Lookbehind Assertions

Lookahead and lookbehind assertions are zero-width assertions that match a group before or after a specified point without consuming characters.

#### Lookahead (`?=`)

Positive lookahead matches a group before a specified point.

```python
import re

pattern = re.compile(r'foo(?=bar)')
text = "foobar foobaz"
matches = pattern.findall(text)
print(matches)  # Output: ['foo']
```

Negative lookahead matches a group that is not followed by a specified pattern.

```python
pattern = re.compile(r'foo(?!bar)')
text = "foobar foobaz"
matches = pattern.findall(text)
print(matches)  # Output: ['foo']
```

#### Lookbehind (`?<=`)

Positive lookbehind matches a group after a specified point.

```python
pattern = re.compile(r'(?<=foo)bar')
text = "foobar foobaz"
matches = pattern.findall(text)
print(matches)  # Output: ['bar']
```

Negative lookbehind matches a group that is not preceded by a specified pattern.

```python
pattern = re.compile(r'(?<!foo)bar')
text = "foobar bazbar"
matches = pattern.findall(text)
print(matches)  # Output: ['bar']
```

### Non-Capturing Groups

Non-capturing groups are useful for grouping parts of a regex without creating backreferences.

```python
pattern = re.compile(r'(?:foo|bar)baz')
text = "foobaz barbaz"
matches = pattern.findall(text)
print(matches)  # Output: ['foobaz', 'barbaz']
```

### Named Groups

Named groups make patterns more readable and allow referencing groups by name.

```python
pattern = re.compile(r'(?P<first>\w+)\s(?P<last>\w+)')
text = "John Doe"
match = pattern.search(text)
print(match.group('first'))  # Output: John
print(match.group('last'))   # Output: Doe
```

### Backreferences

Backreferences allow referring to previously matched groups later in the pattern.

```python
pattern = re.compile(r'(\b\w+)\s+\1')
text = "hello hello world"
matches = pattern.findall(text)
print(matches)  # Output: ['hello']
```

### Conditional Statements

Conditional statements in regex allow choosing between two alternatives based on a condition.

```python
pattern = re.compile(r'(foo)?bar(?(1)baz|qux)')
text = "foobar foobarbaz barqux"
matches = pattern.findall(text)
print(matches)  # Output: [('foo', ''), ('foo', 'baz'), ('', 'qux')]
```

### Atomic Groups

Atomic groups prevent backtracking within the group, making pattern matching more efficient.

```python
pattern = re.compile(r'(?>foo|foobarbaz)')
text = "foobarbaz"
match = pattern.search(text)
print(match.group())  # Output: foobarbaz
```

### Possessive Quantifiers

Possessive quantifiers match as many characters as possible and do not backtrack. They are not natively supported in Python's `re` module but can be mimicked with atomic groups.

### Recursive Patterns

Python's `re` module does not support recursive patterns directly, but similar functionality can be achieved using other techniques.

## Advanced Examples

### Matching Balanced Parentheses

Matching balanced parentheses is a complex task that can be tackled using advanced regex features.

```python
pattern = re.compile(r'\((?:[^)(]+|(?R))*\)')
text = "(a(b)c)"
matches = pattern.findall(text)
print(matches)  # Output: ['(a(b)c)']
```

### Parsing and Modifying Complex Data

Consider parsing a CSV file with fields that may contain escaped quotes and commas.

```python
pattern = re.compile(r'''
    "(?:[^"\\]|\\.)*"   # Match double-quoted fields
    |                   # or
    [^,]+               # Match unquoted fields
''', re.VERBOSE)

text = 'John, "Doe, Jane", "Escaped \\"Quote\\""'
matches = pattern.findall(text)
print(matches)  # Output: ['John', '"Doe, Jane"', '"Escaped \\"Quote\\""']
```

### Extracting Nested HTML Tags

Extracting nested HTML tags using regex requires advanced techniques.

```python
pattern = re.compile(r'<(\w+)(?:[^<]*)(?:(?R)|[^<]*)*<\/\1>')
html = '<div><p>Hello, <em>World!</em></p></div>'
matches = pattern.findall(html)
print(matches)  # Output: ['div']
```

### Validating Complex Passwords

Validating a password with multiple conditions: at least 8 characters, one uppercase letter, one lowercase letter, one digit, and one special character.

```python
pattern = re.compile(r'''
    (?=.*[A-Z])       # At least one uppercase letter
    (?=.*[a-z])       # At least one lowercase letter
    (?=.*\d)          # At least one digit
    (?=.*[@$!%*?&])   # At least one special character
    [A-Za-z\d@$!%*?&] # Allowed characters
    {8,}              # At least 8 characters long
''', re.VERBOSE)

password = 'Password1!'
print(bool(pattern.match(password)))  # Output: True
```

## Combining Multiple Patterns

Advanced regex often involves combining multiple patterns to achieve the desired result.

### Example: Extracting Data from Logs

Consider extracting IP addresses and timestamps from log entries.

```python
log_text = '''
[2024-07-08 12:34:56] INFO: User logged in from 192.168.1.1
[2024-07-08 12:35:00] ERROR: Failed login attempt from 10.0.0.1
'''

pattern = re.compile(r'\[(?P<timestamp>[^\]]+)\].*?from (?P<ip>\d+\.\d+\.\d+\.\d+)')
matches = pattern.finditer(log_text)
for match in matches:
    print(f"Timestamp: {match.group('timestamp')}, IP: {match.group('ip')}")
# Output:
# Timestamp: 2024-07-08 12:34:56, IP: 192.168.1.1
# Timestamp: 2024-07-08 12:35:00, IP: 10.0.0.1
```

## Performance Considerations

### Optimizing Regex Patterns

1. **Avoiding Catastrophic Backtracking:** Be cautious with patterns that can cause excessive backtracking.
2. **Using Atomic Groups:** Use atomic groups to prevent backtracking within certain parts of the pattern.
3. **Compiling Patterns:** Compile regex patterns once if they are used multiple times to improve performance.

```python
# Example of compiling a regex pattern
pattern = re.compile(r'\bfoo\b')
```

### Profiling and Benchmarking

Use tools like `timeit` and `cProfile` to measure the performance of regex operations and identify bottlenecks.

```python
import timeit

setup = '''
import re
pattern = re.compile(r'\bfoo\b')
text = "foo bar foo baz"
'''

stmt = '''
matches = pattern.findall(text)
'''

print(timeit.timeit(stmt, setup=setup, number=100000))
```

## Conclusion

Advanced regular expressions in Python provide powerful tools for complex text processing tasks. By mastering lookaheads, lookbehinds, named groups, backreferences, and other advanced features, you can handle intricate pattern matching and text manipulation efficiently. Understanding performance considerations and best practices ensures that your regex patterns are both effective and performant. Whether you are parsing logs, validating input, or extracting data, advanced regex techniques are invaluable for sophisticated text processing in Python.

---



# Summary of Regular Expressions in Python

## Introduction to Regular Expressions

Regular expressions (regex) are sequences of characters that define a search pattern, primarily used for string pattern matching. Python's `re` module provides powerful tools to work with regex for various text processing tasks.

## Basic Components of Regex

### Special Characters and Meta-characters

- `.` : Matches any character except a newline.
- `^` : Matches the start of the string.
- `$` : Matches the end of the string.
- `*` : Matches 0 or more repetitions of the preceding element.
- `+` : Matches 1 or more repetitions of the preceding element.
- `?` : Matches 0 or 1 repetition of the preceding element.
- `{n}` : Matches exactly n repetitions of the preceding element.
- `{n,}` : Matches n or more repetitions of the preceding element.
- `{n,m}` : Matches between n and m repetitions of the preceding element.
- `[]` : Matches any one of the characters inside the brackets.
- `\` : Escapes special characters or signals a special sequence.
- `|` : Acts as a logical OR between patterns.

### Character Classes

- `\d` : Matches any digit, equivalent to `[0-9]`.
- `\D` : Matches any non-digit.
- `\w` : Matches any word character (alphanumeric + underscore), equivalent to `[a-zA-Z0-9_]`.
- `\W` : Matches any non-word character.
- `\s` : Matches any whitespace character (spaces, tabs, line breaks).
- `\S` : Matches any non-whitespace character.

## Basic Regex Functions in Python

### `re.compile()`

Compiles a regex pattern into a regex object, which can be reused.

```python
import re

pattern = re.compile(r'\d+')
```

### `re.match()`

Checks for a match only at the beginning of the string.

```python
result = re.match(r'\d+', '123abc')
print(result.group())  # Output: '123'
```

### `re.search()`

Searches for the first occurrence of the pattern within the string.

```python
result = re.search(r'\d+', 'abc123')
print(result.group())  # Output: '123'
```

### `re.findall()`

Finds all occurrences of the pattern within the string and returns them as a list.

```python
result = re.findall(r'\d+', 'abc123def456')
print(result)  # Output: ['123', '456']
```

### `re.finditer()`

Returns an iterator yielding match objects for all non-overlapping matches.

```python
results = re.finditer(r'\d+', 'abc123def456')
for result in results:
    print(result.group())  # Output: '123' '456'
```

### `re.sub()`

Replaces occurrences of the pattern with a replacement string.

```python
result = re.sub(r'\d+', 'number', 'abc123def456')
print(result)  # Output: 'abcnumberdefnumber'
```

### `re.split()`

Splits the string by occurrences of the pattern.

```python
result = re.split(r'\d+', 'abc123def456')
print(result)  # Output: ['abc', 'def', '']
```

### `re.subn()`

Similar to `re.sub()`, but returns a tuple containing the new string and the number of replacements made.

```python
result = re.subn(r'\d+', 'number', 'abc123def456')
print(result)  # Output: ('abcnumberdefnumber', 2)
```

## Advanced Regex Features

### Lookahead and Lookbehind Assertions

- **Lookahead** (`?=` and `?!`): Asserts whether a pattern is followed by another pattern.
- **Lookbehind** (`?<=` and `?<!`): Asserts whether a pattern is preceded by another pattern.

```python
positive_lookahead = re.compile(r'foo(?=bar)')
negative_lookahead = re.compile(r'foo(?!bar)')
positive_lookbehind = re.compile(r'(?<=foo)bar')
negative_lookbehind = re.compile(r'(?<!foo)bar')
```

### Non-Capturing Groups

Groups part of a pattern without creating a backreference.

```python
pattern = re.compile(r'(?:foo|bar)baz')
```

### Named Groups

Names a group, allowing for easier reference and readability.

```python
pattern = re.compile(r'(?P<first>\w+)\s(?P<last>\w+)')
```

### Backreferences

References a previously matched group.

```python
pattern = re.compile(r'(\b\w+)\s+\1')
```

### Conditional Statements

Chooses between patterns based on a condition.

```python
pattern = re.compile(r'(foo)?bar(?(1)baz|qux)')
```

### Atomic Groups

Prevents backtracking within the group.

```python
pattern = re.compile(r'(?>foo|foobarbaz)')
```

## Practical Examples

### Extracting Data from Text

```python
log_text = '''
[2024-07-08 12:34:56] INFO: User logged in from 192.168.1.1
[2024-07-08 12:35:00] ERROR: Failed login attempt from 10.0.0.1
'''

pattern = re.compile(r'\[(?P<timestamp>[^\]]+)\].*?from (?P<ip>\d+\.\d+\.\d+\.\d+)')
matches = pattern.finditer(log_text)
for match in matches:
    print(f"Timestamp: {match.group('timestamp')}, IP: {match.group('ip')}")
```

### Validating Complex Input

```python
pattern = re.compile(r'''
    (?=.*[A-Z])       # At least one uppercase letter
    (?=.*[a-z])       # At least one lowercase letter
    (?=.*\d)          # At least one digit
    (?=.*[@$!%*?&])   # At least one special character
    [A-Za-z\d@$!%*?&] # Allowed characters
    {8,}              # At least 8 characters long
''', re.VERBOSE)

password = 'Password1!'
print(bool(pattern.match(password)))  # Output: True
```

## Performance Considerations

### Avoiding Catastrophic Backtracking

Be cautious with patterns that can cause excessive backtracking.

### Using Atomic Groups

Use atomic groups to prevent backtracking within certain parts of the pattern.

### Compiling Patterns

Compile regex patterns once if they are used multiple times to improve performance.

```python
pattern = re.compile(r'\bfoo\b')
```

### Profiling and Benchmarking

Use tools like `timeit` and `cProfile` to measure the performance of regex operations and identify bottlenecks.

```python
import timeit

setup = '''
import re
pattern = re.compile(r'\bfoo\b')
text = "foo bar foo baz"
'''

stmt = '''
matches = pattern.findall(text)
'''

print(timeit.timeit(stmt, setup=setup, number=100000))
```

## Conclusion

Regular expressions in Python offer a powerful and flexible way to handle text processing tasks. From basic searches and replacements to advanced pattern matching with lookaheads, lookbehinds, and named groups, regex can significantly simplify and enhance your text processing capabilities. Understanding these tools and best practices ensures that your regex usage is both effective and efficient, making it an invaluable skill in data manipulation, validation, and extraction.