## What are Regular Expressions?
Regular expressions (regex or regexp) are sequences of characters that define a search pattern. They are used for pattern matching within strings.

## Real-World Applications:

    - Data validation: Validating user input, such as email addresses, phone numbers, or dates
    - Text processing: Extracting information from unstructured text data
    - Web scraping: Parsing HTML or XML documents to extract specific data
    - Log analysis: Searching and extracting relevant information from log files
    - Data cleaning: Removing unwanted characters or formatting from text data
    - Code analysis: Searching for patterns or specific constructs in source code

Regular expression pattern for email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

## Basic Syntax of Regular Expressions

### Literal Characters:
    - Matches exactly what you type. For example, cat matches the string "cat".

### Special Characters:

    .: Matches any character except a newline.
    *: Matches 0 or more repetitions of the preceding element.
    +: Matches 1 or more repetitions of the preceding element.
    ?: Matches 0 or 1 repetition of the preceding element.
    {n}: Matches exactly n repetitions of the preceding element.
    {n,}: Matches n or more repetitions.
    {n,m}: Matches between n and m repetitions.
    []: Matches any one of the characters inside the brackets. For example, [aeiou] matches any vowel.
    ^: Matches the start of the string.
    $: Matches the end of the string.
    |: Alternation, matches either the pattern before or after the |.
    (): Groups patterns together.
    \: Escapes special characters.

### Character Classes:

    \d: Matches any digit (0-9).
    \w: Matches any alphanumeric character (word character).
    \s: Matches any whitespace character.

## Creating and Using Regular Expressions in Python

### Importing the re Module:

In [None]:
import re

## Basic Functions:

    - re.search(pattern, string): Searches for the first location where the pattern matches in the string.
    - re.match(pattern, string): Checks for a match only at the beginning of the string.
    - re.findall(pattern, string): Finds all substrings where the pattern matches and returns them as a list.
    - re.sub(pattern, repl, string): Replaces the matches with the replacement string.

In Python, the re module provides functions for working with regular expressions. One of these functions is re.search(), 
which searches for a pattern within a string. When re.search() finds a match, it returns a match object. The match.group() 
method is used to retrieve the matched portion of the string from this match object.


## re.search()

In [4]:

sentence = "The quick brown fox jumps over the lazy dog on 12/12/2020 at 10:00 AM."

# Find the literal word "fox"
pattern = r"fox"
match = re.search(pattern, sentence)
print(match.group())


fox



## Basic Syntax - Metacharacters


| Symbol|Description|
|------|-----------|
|`.`| Dot/period matches any single character except newline.|
|`^`| Caret matches the start of a string.|
|`$`| Dollar matches the end of a string.|


In [None]:
pattern = r"h.llo"
string1 = "hello"
string2 = "hallo"

# Using the re.search() function to search string1 and string1 for a match to the pattern
match1 = re.search(pattern, string1)
match2 = re.search(pattern, string2)

print(match1)  # Output: <re.Match object; span=(0, 5), match='hello'>
print(match2)  # Output: <re.Match object; span=(0, 5), match='hallo'>

In [None]:
pattern = r"^hello$"
string1 = "hello"
string2 = "hello world"

# Using the re.search() function to search string1 and string1 for a match to the pattern
match1 = re.search(pattern, string1)
match2 = re.search(pattern, string2)

print(match1)  # Output: <re.Match object; span=(0, 5), match='hello'>
print(match2)  # Output: None

## Basic Syntax - Shorthand Character Classes

| Class|Description|
|------|-----------|
| `\d`|Matches any digit (0-9).|
| `\w`|Matches any word character (a-z, A-Z, 0-9, _).|
| `\s`|Matches any whitespace character (space, tab, newline).|

In [None]:
pattern = r"\d"
string = "Class starts at 08:00 and ends at 10:50"
matches = re.findall(pattern, string)
print(matches)  # Output: ['0', '8', '0', '0', '1', '0', '5', '0']

## Basic Syntax - Negated Shorthand Character Classes

| Class|Description|
|------|-----------|
|`\D`| Matches any **non**-digit character.|
|`\W`| Matches any **non**-word character.|
|`\S`| Matches any **non**-whitespace character.|

In [None]:
pattern = r"\D"
string = "Class starts at 08:00 and ends at 10:50"
matches = re.findall(pattern, string)
print(matches)  # Output: ['C', 'l', 'a', 's', 's', ' ', 's', 't', 'a', 'r', 't', 's', ' ', 'a', 't', ' ', ':', ' ', 'a', 'n', 'd', 's', ' ', 'a', 't', ' ', ':']

## Braces

Braces '{}' specify an exact count or a range of occurrences.


| |Description|
|------|-----------|
|`{n}`| Exact count matches exactly n occurrences of the preceding character or group.|
|`{n,m}`| Range matches between n and m occurrences of the preceding character or group.|



In [None]:
pattern = r"a{3}"
string1 = "aaa"
string2 = "aa"
string3 = "aaaa"
match1 = re.search(pattern, string1)
match2 = re.search(pattern, string2)
match3 = re.search(pattern, string3)
print(match1)  # Output: <re.Match object; span=(0, 3), match='aaa'>
print(match2)  # Output: None
print(match3)  # Output: <re.Match object; span=(0, 3), match='aaa'>

## Grouping and Capturing

Parentheses `()` group characters together and create a capturing group

Captured groups can be referred to later using backreferences `\1`,`\2`, etc.

In [18]:
pattern = r"(\d{3}) (\d{3} \d{4})"
string = "My phone number is 078 123 4567"
match = re.search(pattern, string)
print(match.group(1))  # Output: 078
print(match.group(2))  # Output: 123 4567

078 123 4567
078
123 4567


In [None]:
sentence = "Contact us at support@example.com."
pattern = r"(\w+)@(\w+\.\w+)"
match = re.search(pattern, sentence)

if match:
    print(match.group())    
    print(match.group(1))   
    print(match.group(2))   


let's break down the meaning of each character within the pattern r"(\w+)@(\w+\.\w+)" 

Group 1: "(\w+)"

   - "(":This opening parenthesis marks the beginning of a capturing group.
   - "\w": This matches a single word character. Word characters include letters (uppercase and lowercase), digits (0-9), and underscores (_).
   - "+`: This quantifier indicates that the preceding element ("\w" in this case) can be matched one or more times. 


Group 2: "(\w+\.\w+)"

   - "(": Similar to group 1, this marks the beginning of a capturing group.
   - "\w+": As explained before, this matches one or more word characters.
   - "\.`: This literally matches a single dot (.).
   - "\w+`: Another occurrence of one or more word characters.

## Alternation

Vertical bar `|` matches either the expression before or after it

In [None]:
pattern = r"cat|dog"
string1 = "I have a cat"
string2 = "I have a dog"
string3 = "I have a bird"
match1 = re.search(pattern, string1)
match2 = re.search(pattern, string2)
match3 = re.search(pattern, string3)
print(match1)  # Output: <re.Match object; span=(9, 12), match='cat'>
print(match2)  # Output: <re.Match object; span=(9, 12), match='dog'>
print(match3)  # Output: None

## re.findall()
The re.findall() function finds all non-overlapping matches of a pattern in a string and returns them as a list of strings.

In [None]:
# Find all digits

sentence = "The quick brown fox jumps over the lazy dog on 12/12/2020 at 10:00 AM."
pattern = r"\d+"
matches = re.findall(pattern, sentence)
print(matches)


In [None]:
sentence = "The quick brown fox jumps over the lazy dog on 12/12/2020 at 10:00 AM."

# Find words starting with 'b'
pattern = r"\bb\w+"
matches = re.findall(pattern, sentence)
print(matches)  

# Find any three-letter word
pattern = r"\b\w{3}\b"
matches = re.findall(pattern, sentence)
print(matches)

## Python Regex Functions - `re.finditer()`

- `re.finditer()` returns an iterator yielding Match objects for all non-overlapping matches

In [None]:
pattern = r"\d+"
string = "I have 2 apples and 3 oranges"
matches = re.finditer(pattern, string)
for match in matches:
    print(match.group())  # Output: 2 \n 3

## Basic Syntax - Character Classes

- **Definition:** A set of characters enclosed in square brackets [ ] that matches any one character in the set. 


| Class|Description|
|------|-----------|
| `[a-z]`|Matches any lowercase letter.|
| `[A-Z]`|Matches any uppercase letter.|
| `[0-9]`|Matches any digit.|


In [None]:
pattern = r"[aeiou]"
string = "hello"
matches = re.findall(pattern, string)
print(matches)  # Output: ['e', 'o']

## Named Groups

- Named groups allow you to assign names to capturing groups using the syntax `(?P<name>...)`
- They can be referenced using the group name instead of the group number

In [2]:
import re 
pattern = r"(?P<first_name>\w+) (?P<last_name>\w+)"
string = "John Doe"
match = re.search(pattern, string)
print(match.group("first_name"))  # Output: John
print(match.group("last_name"))  # Output: Doe

John
Doe


In [7]:
pattern = r"(?P<first_part>\d{3}) (?P<Second_part>\d{3} \d{4})"
string = "My phone number is 078 123 4567"
match = re.search(pattern, string)
print(match.group('first_part'))  # Output: 078
print(match.group('Second_part'))  # Output: 123 4567

078
123 4567


## re.match()
The re.match() function specifically searches for the pattern at the beginning of the string.
It returns a match object if the pattern matches the entire string from the start, otherwise it returns None.


In [27]:
import re

text = "This is a string to search."

# Search for "is" anywhere in the string
match_search = re.match("This", text)

# Search for "This" at the beginning of the string
match_match = re.match("is", text)

print(match_search)  # Output: is

print(match_match)  # Output: This



<re.Match object; span=(0, 4), match='This'>
None


## Example: Extracting Phone Numbers

Use regular expressions to find all phone numbers in the text. Assume phone numbers can be in formats like:

    (123) 456-7890
    123-456-7890
    123.456.7890
    1234567890

In [None]:
import re

# Sample text
text = """
Contact us at (123) 456-7890 or 123-456-7890 or 123.456.7890 or 1234567890.
"""

# Regular expression pattern for phone numbers
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'

phone_numbers = re.findall(phone_pattern, text)
print(phone_numbers)


## Python Regex Substitution Functions - `re.sub()` and `re.subn()` 

- `re.sub()`: Replaces all occurrences of the pattern with a replacement string
- `re.subn()`: Same as `re.sub()` but also returns the number of replacements made

In [3]:
pattern = r"\d+"
string = "I have 2 apples and 3 oranges"
new_string = re.sub(pattern, "N", string)
print(new_string)  # Output: I have N apples and N oranges

I have N apples and N oranges


In [4]:
pattern = r"\d+"
string = "I have 2 apples and 3 oranges"
new_string, count = re.subn(pattern, "N", string)
print(new_string)  # Output: I have N apples and N oranges
print(count)  # Output: 2

I have N apples and N oranges
2


## Python Regex Splitting Strings Functions - `re.split()`

- `re.split()`: Splits the string by the occurrences of the pattern

In [None]:
pattern = r"\d+"
string = "I have 2 apples and 3 oranges"
result = re.split(pattern, string)
print(result)  # Output: ['I have ', ' apples and ', ' oranges']

## Example: Extract Information from the Sentence

Given the following sentance
"Contact us at support@example.com or visit us at http://example.com on 05/20/2024."
    1. Find and extract the email address.
    2. Find and extract the URL.
    3. Find and extract the date.

In [None]:
sentence = "Contact us at support@example.com or visit us at http://example.com on 05/20/2024."

# Find the email address
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
email_match = re.search(email_pattern, sentence)
print(email_match.group())  # Output: support@example.com

# Find the URL
url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
url_match = re.search(url_pattern, sentence)
print(url_match.group())  # Output: http://example.com

# Find the date
date_pattern = r"\d{2}/\d{2}/\d{4}"
date_match = re.search(date_pattern, sentence)
print(date_match.group())  # Output: 05/20/2024


## Best Practices

- Compile regular expressions for better performance, especially when using them multiple times
- Use raw strings `r"..."` to avoid escaping backslashes
- Keep regular expressions readable and maintainable by using comments and whitespace

## Compiling Regular Expressions

- `re.compile()` compiles a regular expression pattern into a regex object that can be reused
- This is useful when you want to use the same pattern multiple times
- It improves performance when using the same pattern multiple time


In [None]:
pattern = re.compile(r"\d+")
string = "I have 2 apples and 3 oranges"
matches = pattern.findall(string)
print(matches)  # Output: ['2', '3']

## Best Practices - Avoiding Complex Regular Expressions


- Regular expressions can become complex and hard to read
- Sometimes, simpler solutions like string methods or list comprehensions can be more appropriate
- Consider the readability and maintainability of your code

## Resources
- Python re module documentation: https://docs.python.org/3/library/re.html
- Regular Expression HOWTO: https://docs.python.org/3/howto/regex.html
- Regular Expression 101 (online regex tester): https://regex101.com/
- Regular Expressions Cookbook (book): https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/

## Conclusion

Regular expressions are a powerful tool for pattern matching and text manipulation. 
Understanding their syntax and how to use them in Python will greatly enhance your ability 
to process and analyze text data.