# **Regular Expressions**



**Introduction to Regular Expressions**

    What are regular expressions?

    Why are regular expressions useful?

    Basic syntax of regular expressions in Python.

    Literal characters and metacharacters.

    Using the re module in Python for regular expressions.

    Simple pattern matching using re.search() and re.match().

**Character Classes and Quantifiers**

    Understanding character classes ([...]) and their usage.

    Using predefined character classes (\d, \w, \s, etc.).

    Quantifiers: *, +, ?, {}.

    Greedy vs. non-greedy matching.

    Matching specific quantities of characters.

**Anchors and Boundaries**

    Using anchors: ^ and $.

    Utilizing word boundaries: \b and \B.

    Understanding the importance of anchors and boundaries.

    Examples of using anchors and boundaries.

**Groups and Capturing**

    Introduction to groups with parentheses.

    Non-capturing groups (?:...).

    Capturing groups and accessing matched groups.

    Backreferences: Using captured groups in the pattern.

    Practical applications of groups and capturing.

**Alternation and Flags**

    Alternation with the pipe character |.

    Case-insensitive matching with flags.

    Other useful flags: re.IGNORECASE, re.MULTILINE, etc.

    Practical examples demonstrating alternation and flags.

**Lookahead and Lookbehind Assertions**

    Positive and negative lookahead assertions.

    Positive and negative lookbehind assertions.

    Understanding the importance of lookahead and lookbehind.

    Practical examples showcasing lookahead and lookbehind assertions.

**Advanced Topics and Optimization Techniques**

    Recursive patterns and their applications.

    Performance considerations and optimization techniques.

    Using compiled regex objects for improved performance.

    Practical examples demonstrating advanced topics and optimization techniques.

**Project**

#***Overview of Regular Expressions:***





**Regular expressions** (regex or regexp) are sequences of characters that define a search pattern. They are widely used in text processing tasks to search, manipulate, and validate strings.


**Basic Syntax of Regular Expressions in Python:**

A regular expression is typically a string pattern composed of literal characters and metacharacters.

**Examples of metacharacters** include . (matches any single character except newline), * (matches zero or more occurrences), + (matches one or more occurrences), ? (matches zero or one occurrence), [] (defines a character class), () (defines a group), etc.

**Literal Characters and Metacharacters:**

Literal characters match themselves in the input string. For example, the regex cat will match the string "cat" in the input.

Metacharacters have special meanings in regular expressions. For example, . matches any character except newline, * matches zero or more occurrences of the preceding character, etc.

Using the **re Module** in Python for Regular Expressions:

Python's re module provides functions and methods to work with regular expressions.

Commonly used functions include re.search(), re.match(), re.findall(), re.sub(), etc.

In [1]:
import re

# Using re.search()
pattern = r'fox'
text = "The quick brown fox jumps over the lazy dog"
match = re.search(pattern, text)
if match:
    print("Pattern found:", match.group())
else:
    print("Pattern not found")

# Using re.match()
pattern = r'The'
text = "The quick brown fox jumps over the lazy dog"
match = re.match(pattern, text)
if match:
    print("Pattern found at the beginning of the string:", match.group())
else:
    print("Pattern not found at the beginning of the string")


Pattern found: fox
Pattern found at the beginning of the string: The


In this example, re.search() searches for the pattern "fox" in the text and re.match() matches the pattern "The" only at the beginning of the string.

Regular expressions are powerful tools for string manipulation and pattern matching in Python. Understanding their basic syntax and functions is essential for effective text processing tasks.

#**Character Classes and Quantifiers**

**Character Classes and Quantifiers:**

**Character classes** [...] allow matching of any character within the brackets. For example, [aeiou] matches any vowel.

**Quantifiers** specify how many occurrences of the preceding element should be matched.
For instance, * matches zero or more occurrences, + matches one or more occurrences, ? matches zero or one occurrence, {n} matches exactly n occurrences, {n,} matches n or more occurrences, and {n,m} matches between n and m occurrences.

In [2]:
import re

# Character Classes
pattern = r'[aeiou]'
text = "The quick brown fox jumps over the lazy dog"
matches = re.findall(pattern, text)
print("Vowels found:", matches)

# Quantifiers
pattern = r'ab*'
text = "ab abbb abb"
matches = re.findall(pattern, text)
print("Matches found:", matches)


Vowels found: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']
Matches found: ['ab', 'abbb', 'abb']


#**Anchors and Boundaries**

**Anchors and Boundaries:**

**Anchors** ^ and $ are used to match the beginning and end of a string, respectively.

**Word boundaries** \b and \B are used to match the position between a word character and a non-word character, or between two word characters, respectively.

In [3]:
import re

# Anchors
pattern = r'^The'
text = "The quick brown fox jumps over the lazy dog"
match = re.match(pattern, text)
if match:
    print("Pattern found at the beginning of the string:", match.group())
else:
    print("Pattern not found at the beginning of the string")

# Word Boundaries
pattern = r'\bfox\b'
text = "The quick brown fox jumps over the lazy dog"
match = re.search(pattern, text)
if match:
    print("Pattern found:", match.group())
else:
    print("Pattern not found")


Pattern found at the beginning of the string: The
Pattern found: fox


#**Groups and Capturing**

In this example, ^The matches the beginning of the string, while \bfox\b matches the word "fox" only when it appears as a separate word.

**Groups and Capturing:**

**Groups** (...) are used to group multiple characters together. They also enable capturing of the matched text.

**Capturing** groups allow extracting specific parts of the matched text.

In [4]:
import re

# Groups and Capturing
pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = "Phone numbers: 123-456-7890, 456-789-1234"
matches = re.findall(pattern, text)
for match in matches:
    print("Phone number:", "-".join(match))


Phone number: 123-456-7890
Phone number: 456-789-1234


In this example, the pattern (\d{3})-(\d{3})-(\d{4}) captures phone numbers in the format XXX-XXX-XXXX.


Regular expressions provide a flexible and efficient way to search, manipulate, and validate strings in Python.

#**Alternation and Flags**

**Alternation and Flags:**

**Alternation:**

Alternation is represented by the pipe character | and allows matching one of several patterns.

In [5]:
import re

# Alternation
pattern = r'cat|dog'
text = "I have a cat and a dog"
matches = re.findall(pattern, text)
print("Matches found:", matches)


Matches found: ['cat', 'dog']


In this example, the pattern cat|dog matches either "cat" or "dog" in the input text.

**Flags:**

Flags modify the behavior of regular expressions.

Common flags include **re.IGNORECASE** for case-insensitive matching and **re.MULTILINE** for multiline matching.

In [6]:
import re

# Flags
pattern = r'dog'
text = "I have a Dog and a dog"
matches = re.findall(pattern, text, flags=re.IGNORECASE)
print("Matches found with case-insensitive flag:", matches)


Matches found with case-insensitive flag: ['Dog', 'dog']


In this example, **re.IGNORECASE** flag enables case-insensitive matching, so the pattern dog matches both "Dog" and "dog" in the input text.

#**Lookahead and Lookbehind Assertions**

**Lookahead and Lookbehind Assertions:**

**Lookahead assertions** (?=...) and (?!...) assert whether a pattern is followed by another pattern or not.

**Lookbehind assertions** (?<=...) and (?<!...) assert whether a pattern is preceded by another pattern or not.

In [7]:
import re

# Lookahead and Lookbehind Assertions
pattern = r'(?<=quick\s)(brown)'
text = "The quick brown fox jumps over the lazy dog"
match = re.search(pattern, text)
if match:
    print("Match found using lookbehind assertion:", match.group())
else:
    print("Match not found")

pattern = r'(brown)(?=\sfox)'
text = "The quick brown fox jumps over the lazy dog"
match = re.search(pattern, text)
if match:
    print("Match found using lookahead assertion:", match.group())
else:
    print("Match not found")


Match found using lookbehind assertion: brown
Match found using lookahead assertion: brown


In this example, the lookbehind assertion (?<=quick\s) ensures that "brown" is preceded by "quick ", while the lookahead assertion (?=\sfox) ensures that "brown" is followed by " fox".

# **Advanced Topics and Optimization Techniques**

**Recursive Patterns:**

Recursive patterns allow defining patterns that refer to themselves.

In [8]:
import re

# Recursive Patterns
pattern = r'(\w+)\s(\1)'
text = "The quick brown fox jumps over the lazy dog dog"
matches = re.findall(pattern, text)
print("Matches found with recursive pattern:", matches)


Matches found with recursive pattern: [('dog', 'dog')]


In this example, (\w+)\s(\1) matches repeated words in the input text.


Performance Considerations and Optimization Techniques:

Regular expressions can sometimes be inefficient, especially for complex patterns or large input strings.

Techniques such as optimizing the regex pattern, using compiled regex objects, and avoiding unnecessary backtracking can improve performance.

**Email Address Validation:**

Regular expressions are commonly used to validate email addresses.

In [9]:
import re

email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

def validate_email(email):
    if re.match(email_pattern, email):
        return True
    else:
        return False

emails = ["user@example.com", "invalid_email", "another.user@example.co.uk"]
for email in emails:
    if validate_email(email):
        print(email, "is a valid email address")
    else:
        print(email, "is not a valid email address")


user@example.com is a valid email address
invalid_email is not a valid email address
another.user@example.co.uk is a valid email address


In this example, the regex pattern ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ validates email addresses according to a standard format.

**Data Extraction from Text:**

Regular expressions are useful for extracting structured data from unstructured text.

In [10]:
import re

text = "Product: Apple iPhone 12, Price: $999.99, In stock: Yes"
product_pattern = r'Product: (\w+\s\w+), Price: (\$\d+\.\d+), In stock: (Yes|No)'

match = re.match(product_pattern, text)
if match:
    product_name = match.group(1)
    price = match.group(2)
    in_stock = match.group(3)
    print("Product:", product_name)
    print("Price:", price)
    print("In stock:", in_stock)
else:
    print("No match found")


No match found


In this example, the regex pattern Product: (\w+\s\w+), Price: (\$\d+\.\d+), In stock: (Yes|No) extracts product name, price, and stock status from a text string.

**Text Replacement and Formatting:**

Regular expressions can be used for find and replace operations and text formatting.

In [11]:
import re

text = "The quick brown fox jumps over the lazy dog"
new_text = re.sub(r'brown', 'red', text)
print("Original text:", text)
print("Modified text:", new_text)


Original text: The quick brown fox jumps over the lazy dog
Modified text: The quick red fox jumps over the lazy dog


**URL Parsing:**

Regular expressions are helpful in parsing and extracting information from URLs.

In [12]:
import re

url = "https://www.example.com/path/to/page?param1=value1&param2=value2"
url_pattern = r'https?://([\w.-]+)/([\w./-]+)\?([\w=&]+)'

match = re.match(url_pattern, url)
if match:
    domain = match.group(1)
    path = match.group(2)
    query_params = match.group(3)
    print("Domain:", domain)
    print("Path:", path)
    print("Query parameters:", query_params)
else:
    print("No match found")


Domain: www.example.com
Path: path/to/page
Query parameters: param1=value1&param2=value2


In this example, the regex pattern https?://([\w.-]+)/([\w./-]+)\?([\w=&]+) parses the domain, path, and query parameters from a URL.

# **Project**

**Project Ideas:**

Here are some project ideas for individual or group work:

**Email Address Validator:**


Build a program that validates email addresses using regular expressions.
Ensure the email addresses adhere to standard formats and rules.

**Data Extraction Tool:**


Create a tool that extracts specific data from a given text using regular expressions.
For example, extract phone numbers, dates, or URLs from a text document.

**Web Scraper with Regex:**

Develop a web scraper that utilizes regular expressions to extract desired information from web pages.
Extract product names, prices, or other relevant data from e-commerce websites.

**Log File Analyzer:**

Build a log file analyzer that parses log files using regular expressions.
Extract important information such as error messages, timestamps, and IP addresses.

**Text Search and Replace Tool:**

Create a text search and replace tool that allows users to search for specific patterns using regular expressions and replace them with desired text.
Provide options for case-sensitive or case-insensitive searches.

**URL Validator and Parser:**

Develop a program that validates URLs and parses them into components (scheme, domain, path, query parameters, etc.) using regular expressions.
Ensure that the URLs follow standard formats and protocols.

**Password Strength Checker:**

Build a tool that checks the strength of passwords using regular expressions.
Define criteria for strong passwords (e.g., minimum length, use of special characters, etc.) and validate user input accordingly.

**CSV Data Validator:**

Create a program that validates CSV data using regular expressions.
Ensure that the CSV data conforms to specified formats and rules (e.g., correct number of columns, proper data types, etc.).