# Introduction to Regular Expressions (Regex)

<img src="https://miro.medium.com/max/2392/0*1-i9w0e4kklVQl5B.jpg">

It is estimated that 80% of the data is **unstructured**

And **unstructured** data is basically **text data**!

Text is present in every major business process, from support tickets to product feedback and customer interactions.

There is not doubt that text analysis has a broad range of business applications and use cases:
* Understand customer 
* Risk managment
* Prediction and prevention of crime
* Personalized adversitsing
* ...

## Introduction

Regular Expressions, often abbreviated as Regex, are essential tools in data analytics for processing and manipulating textual data. Regex provides a method for searching and manipulating strings using a specialized syntax that defines patterns.

## What is Regex?

Regex is a sequence of characters that forms a search pattern. It can be used for performing various text processing tasks such as:

- **Pattern Matching**: Searching for specific patterns within text.
- **Data Validation**: Ensuring that data is in a correct format.
- **Data Extraction**: Extracting specific portions from text based on patterns.
- **Text Substitution**: Replacing parts of text using pattern matching.

## Historical Context of Regular Expressions in Data Analytics

Regular Expressions (Regex) have a rich history dating back to the 1950s, with foundational concepts developed by the American mathematician Stephen Kleene. This section aims to provide a deeper historical insight into Regex and its evolution in the context of data analytics.

## The Origin of Regular Expressions

- **Stephen Kleene's Contribution**: In 1956, Stephen Kleene introduced the concept of Regular Expressions in his paper "Representation of Events in Nerve Nets and Finite Automata" in the book "Automata Studies", edited by Claude Shannon and John McCarthy. This formalized the description of regular languages.
- **Unix and Regex**: The widespread use of Regex began with Unix text processing utilities like `ed`, an editor, and `grep` (global regular expression print), a filter. These tools made Regex a fundamental part of text processing and pattern recognition in computing.

## Regex in Computing

- **Regular Expression Processor**: This processor translates a regular expression into a nondeterministic finite automaton (NFA), where several states can result from a given state and symbol. This automaton is then made deterministic (with only one possible state transition for a particular symbol) and is used to recognize substrings that match the regular expression.


## The Role of Regex in Data Analytics

In data analytics, Regex becomes particularly powerful in scenarios like:

- **Text Data Preprocessing**: Cleaning and standardizing text data for analysis.
- **Log File Analysis**: Parsing log files to extract relevant information.
- **Natural Language Processing**: Identifying and manipulating linguistic patterns.
- **Data Scraping**: Extracting information from unstructured data sources.

## Regex Syntax and Operations

Regex operations are based on a unique syntax that includes a variety of special characters and constructs. Some common elements include:

- `.`: Matches any character
- `*`: Zero or more occurrences
- `+`: One or more occurrences
- `?`: Zero or one occurrence
- `[ ]`: A set of characters
- `{ }`: A specific number of occurrences
- `( )`: Capture and group

## Resources and Further Reading

For those new to Regex, or looking to deepen their understanding, the following resources can be invaluable:

- [Regular Expressions Quick Start](https://www.regular-expressions.info/quickstart.html)
- [RegexOne - Learn Regular Expressions with simple, interactive exercises](https://regexone.com/)
- [Python's `re` Module Documentation](https://docs.python.org/3/library/re.html)

By mastering Regex, data analysts can perform more sophisticated text analysis and data manipulation, enhancing their overall data analysis capabilities.



In [2]:
# Importing the regular expression module from Python's standard library
import re

# Strings to be searched for matching regex patterns
str1 = "varks Aard belíng to the Captain"
str2 = "Albert's famous equation, E = mc^2."
str3 = "Located at 455 Serra Mall."
str4 = "Beware of the shape-shifters!"

# Creating a list of strings to test regex patterns on
test_strings = [str1, str2, str3, str4]

In [4]:
# Looping through each string in the test_strings list
for test_string in test_strings:
    # Printing the test string for reference
    print('\nThe test string is "' + test_string + '"')
    
    # Using re.search() to find the first location where the regex pattern '[í]' matches
    # '[í]' is a regex pattern that searches for the character 'í' in the string
    match = re.search('[í]', test_string)

    # Checking if a match was found
    if match:
        # Printing the matched character if a match is found
        # match.group() returns the part of the string where there is a match
        print('- The first possible match is: ' + match.group())
    else:
        # Indicating that no match was found if the regex pattern doesn't match any part of the string
        print('- ** no match. **')


The test string is "varks Aard belíng to the Captain"
- The first possible match is: í

The test string is "Albert's famous equation, E = mc^2."
- ** no match. **

The test string is "Located at 455 Serra Mall."
- ** no match. **

The test string is "Beware of the shape-shifters!"
- ** no match. **


Let's break down the code above line by line for better understanding:

### 1. Iterating Over Each String in the List
for test_string in test_strings:
`test_strings` is a list containing multiple strings. In this `for` loop, we iterate over each element of the list. During each iteration, `test_string` refers to the current string being processed.

### 2. Printing the Current Test String
print('The test string is "' + test_string + '"')
This line simply outputs the current string being examined in the loop. It helps in tracking which string the regex is being applied to.

### 3. Searching for a Pattern in the String
match = re.search(r'[A-Z]', test_string)
Here, `re.search()` is used to find the first location within `test_string` where the regex pattern `[A-Z]` matches. This pattern looks for any uppercase letter from A to Z. The function returns an `SRE_Match` object if a match is found, and `None` if no match is found.

### 4. Checking for a Match and Printing the Result
 `
if match:
    print('The first possible match is: ' + match.group())
else:
    print('no match.')
 `

In this section, we check if `match` is an `SRE_Match` object or `None`. If it's `SRE_Match`, it means a match was found, and we print the matched substring using `match.group()`. `group()` is a method of `SRE_Match` objects that returns the part of the string where the match was found. If `match` is `None`, it implies no match was found, and we print a message stating so.

### 5. Understanding the Behavior of `re.search()`
- Single Character Match: Since the pattern `[A-Z]` is designed to match a single uppercase character, `re.search()` returns only the first matching character.
- First Match Only: `re.search()` stops searching after finding the first match. It does not continue to look for further matches in the string.

### 6. Alternative: Finding All Matches
If the goal is to find all matches of a pattern in a string, `re.findall()` should be used instead. This function returns a list containing all the matches of the pattern in the string, not just the first one.
matches = re.findall(r'[A-Z]', test_string)
This would return a list of all uppercase characters found in `test_string`.

In [None]:
# As a recap of the test_string
test_strings = [
    "varks Aard belíng to the Captain",
    "Albert's famous equation, E = mc^2.",
    "Located at 455 Serra Mall.",
    "Beware of the shape-shifters!"
]

# Looping through each string in the test_strings list
for string in test_strings:
    # Printing the current string
    print(string)
    
    # Using re.findall to search for all occurrences of the regex pattern in the string
    # The pattern '[A-Z]' matches any uppercase letter from A to Z
    matches = re.findall(r'[A-Z]', string)
    
    # Printing the list of matches found in the current string
    # Each match is an uppercase letter from the string
    print("-", matches, "\n")


When working with regular expressions in Python, you have the option to compile your regex patterns into pattern objects. This can improve performance, especially if you're going to use the same pattern multiple times. Precompilation turns your regex pattern into an `SRE_Pattern` object, which can then be used to perform match, search, and other operations.

## Why Precompile Regex?

- **Performance**: Compiling a pattern once and using it multiple times is more efficient than interpreting the same pattern repeatedly.
- **Organization**: If you have multiple patterns, compiling them into objects helps keep your code organized.
- **Reusability**: Once compiled, the same pattern object can be used in multiple match/search operations without recompilation.
- **Legibility**: It allows you to assign descriptive names to your patterns, making your code more readable.

## How to Precompile Regex Patterns

Here's an example of how to precompile regex patterns in Python:

```python
import re

# Precompile the pattern to match any uppercase letters
pattern_uppercase = re.compile(r'[A-Z]')

# Now you can use pattern_uppercase to search within strings
match = pattern_uppercase.search('Hello World')
if match:
    print('Uppercase letter found:', match.group())
```

In this code, `re.compile` is used to compile the regex pattern `[A-Z]` which matches any uppercase letter. The resulting `pattern_uppercase` object can be used to search through strings without having to recompile the pattern each time, leading to more efficient execution, particularly when dealing with large amounts of text or many searches.

You can even store multiple compiled patterns in a list and iterate over them, as shown in the following example:

In [None]:
# Import the regex module
import re

# Define a list of regex patterns
patterns = [
    '[ABC]',                # Matches any one of 'A', 'B', or 'C'
    '[^ABC]',               # Matches any character except 'A', 'B', or 'C'
    '[ABC^]',               # Matches 'A', 'B', 'C', or '^'
    '[0-9]',                # Matches any single digit from '0' to '9'
    '[0-4]',                # Matches any single digit from '0' to '4'
    '[A-Z]',                # Matches any uppercase letter from 'A' to 'Z'
    '[a-z]',                # Matches any lowercase letter from 'a' to 'z'
    '[A-Za-z]',             # Matches any letter regardless of case
    '[A-Za-z0-9]',          # Matches any alphanumeric character
    '[-a-z]',               # Matches '-' or any lowercase letter
    '[- a-z]'               # Matches '-', space, or any lowercase letter
]

# Compile the patterns to create SRE_Pattern objects for efficient matching
compiled_patterns = [re.compile(p) for p in patterns]

# Function to find the first match of a pattern in a given string
def find_match(compiled_pattern, string):
    match = compiled_pattern.search(string)  # Perform the search using the compiled pattern
    return match.group() if match else 'no match.'  # Return the matched text or 'no match.'

# List of test strings to match against
test_strings = [
    "ABC easy as 123",
    "Simple as do re mi",
    "ABC, 123, baby, you and me girl"
]

# Iterate over each string in the test_strings list
for test_string in test_strings:
    # Print the test string for clarity
    print(f"In: \"{test_string}\"")
    # Find and print the first match for each compiled pattern
    for compiled_pattern in compiled_patterns:
        # Retrieve the pattern's string representation for the output
        pattern_text = compiled_pattern.pattern
        # Find the first match for the pattern in the current test string
        match_text = find_match(compiled_pattern, test_string)
        # Print the pattern and its first match (or 'no match.')
        print(f' - The first potential match for "{pattern_text}" \t is: {match_text}')
    # Print a newline for better separation of output in the console
    print()

This Python script demonstrates how to use regular expressions (regex) to find patterns within strings. The comments in the code will help explain each step of the process.

## Defining Regex Patterns

We define a list of regex patterns. Each pattern is a string that specifies a rule for what constitutes a match:

- `[ABC]`: Matches any one of 'A', 'B', or 'C'.
- `[^ABC]`: Matches any character except 'A', 'B', or 'C'.
- `[ABC^]`: Matches 'A', 'B', 'C', or '^'.
- `[0-9]`: Matches any single digit from '0' to '9'.
- `[0-4]`: Matches any single digit from '0' to '4'.
- `[A-Z]`: Matches any uppercase letter from 'A' to 'Z'.
- `[a-z]`: Matches any lowercase letter from 'a' to 'z'.
- `[A-Za-z]`: Matches any letter regardless of case.
- `[A-Za-z0-9]`: Matches any alphanumeric character.
- `[-a-z]`: Matches '-' or any lowercase letter.
- `[- a-z]`: Matches '-', space, or any lowercase letter.

patterns = [...]

## Compiling the Patterns

Each pattern in the list is compiled for efficient matching. This is particularly useful when a pattern is used multiple times.

`
compiled_patterns = [re.compile(p) for p in patterns]
`

## Defining the `find_match` Function

The `find_match` function searches for the first match of a compiled pattern within a given string. If a match is found, it returns the matched text; otherwise, it returns 'no match.'

`
def find_match(compiled_pattern, string):
    ...
`

## Test Strings

A list of test strings is defined, which will be searched for matches against the patterns.

`
test_strings = [...]
`

## Performing the Pattern Matching

The script iterates over each string in `test_strings`, and for each string, it applies all the compiled regex patterns. For each pattern, it prints the first match found or 'no match.' if no match is found.

`
for test_string in test_strings:
    print(f"In: \"{test_string}\"")
    for compiled_pattern in compiled_patterns:
        ...
    print()
`

## Working with Lists of Compiled Patterns

You can even store multiple compiled patterns in a list and iterate over them, as shown in the following example:

`
print(patterns[1])
print(patterns[1].pattern)
`

In this section, we demonstrate how to access and use individual compiled patterns from the list. The `patterns[1]` expression references the second compiled pattern in the list. By printing `patterns[1].pattern`, we can see the actual regex pattern as a string.

This process helps in identifying which part of the string matches the given patterns and is a practical way to learn and understand regex applications in Python.

In [None]:
import re  # Importing the regular expression module

# Defining the string we are going to check
needle = 'needlers'

# Python approach: Using list comprehension and any()
# This line checks if the string 'needle' ends with any of the specified suffixes ('ly', 'ed', 'ing', 'ers')
# any() returns True if at least one of the conditions is True
print(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))

# On-the-fly Regular expression in Python
# This uses regular expressions to check if 'needle' ends with the specified suffixes
# The search() function looks through 'needle' for any match to the regular expression pattern
# bool() is used to convert the result to True or False
print(bool(re.search(r'(ly|ed|ing|ers)$', needle)))

# Compiled Regular expression in Python
# Compiling the regular expression pattern for faster reuse
# This is more efficient if the pattern is used multiple times
comp = re.compile(r'(ly|ed|ing|ers)$')
print(bool(comp.search(needle)))

# The %timeit commands are used in Jupyter Notebooks to measure the execution time of small code snippets
# -n 1000 specifies that the command will run 1000 times in each loop
# -r 50 indicates that there will be 50 such loops
# This is used to get a more accurate measure of execution time by averaging over multiple runs

# %timeit for the Python approach
%timeit -n 1000 -r 50 bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))

# %timeit for the on-the-fly regular expression
%timeit -n 1000 -r 50 bool(re.search(r'(ly|ed|ing|ers)$', needle))

# %timeit for the compiled regular expression
%timeit -n 1000 -r 50 bool(comp.search(needle))

### Summary of terms for regular expressions

### Regular Expressions Terminology

Regular expressions (regex or regexp) are powerful tools for pattern matching and text manipulation. Here's a detailed explanation of some commonly used terms and symbols in regular expressions:

- **[ ] (Character Set)**: Square brackets denote a character set, where one element inside must match. For example, `[abc]` would match any of the characters 'a,' 'b,' or 'c.' You can also specify ranges like `[0-9]` to match any digit.

- **| (Pipe or Alternation)**: The pipe symbol represents an "or" element. For instance, `a|b` would match either 'a' or 'b.'

- **{ } (Interval Quantifier)**: Curly braces are used to specify an interval or the number of times a pattern should repeat. For example, `a{2,4}` would match 'aa,' 'aaa,' or 'aaaa,' where the number of 'a's falls between 2 and 4.

- **\ (Backslash)**: The backslash is an escape character that identifies the next character as a literal character and not a special regular expression symbol. For example, `\.` would match a period (.) rather than any character.

- **. (Dot)**: In the default mode, the dot matches any character except a newline. For instance, `a.b` would match 'axb,' 'a#b,' or 'a$b,' where 'x,' '#,' and '$' can be any character except a newline.

- **^ (Caret)**: The caret symbol matches the start of the string. In MULTILINE mode, it also matches immediately after each newline. For example, `^abc` would match 'abc' at the beginning of a line.

- **$ (Dollar Sign)**: The dollar sign matches the end of the string or just before the newline at the end of the string. In MULTILINE mode, it also matches before a newline. For instance, `xyz$` would match 'xyz' at the end of a line.

- **\* (Asterisk)**: The asterisk causes the resulting regex to match 0 or more repetitions of the preceding pattern, as many repetitions as are possible. For example, `ab*` will match 'a' or 'ab' followed by any number of 'b's.

- **+ (Plus)**: The plus sign causes the regex to match 1 or more repetitions of the preceding pattern. For instance, `ab+` will match 'a' followed by any non-zero number of 'b's but will not match just 'a.'

- **? (Question Mark)**: The question mark causes the regex to match 0 or 1 repetitions of the preceding pattern. For example, `ab?` will match either 'a' or 'ab.'

#### Additional Character Classes:

- **\d (Digit)**: Matches any decimal digit; this is equivalent to the character class [0-9].

- **\D (Non-Digit)**: Matches any non-digit character; this is equivalent to the character class [^0-9].

- **\s (Whitespace)**: Matches any whitespace character, including space, tab, newline, carriage return, form feed, and vertical tab. This is equivalent to the character class [ \t\n\r\f\v].

- **\S (Non-Whitespace)**: Matches any non-whitespace character. This is equivalent to the character class [^ \t\n\r\f\v].

- **\w (Word Character)**: Matches any alphanumeric character or underscore; this is equivalent to the character class [a-zA-Z0-9_].

- **\W (Non-Word Character)**: Matches any character that is not alphanumeric or underscore; this is equivalent to the character class [^a-zA-Z0-9_].

Regular expressions can be a bit complex, but they are incredibly powerful for text processing tasks. For more comprehensive and complete documentation, refer to the [Python Regular Expression Documentation](http://docs.python.org/2/library/re.html#re-syntax).

### Regular Expressions Examples

Let's explore some basic regular expression patterns and their explanations:

- **Example 1 - Matching Digits ( \d )**:
  - Pattern: `\d+`
  - Explanation: This pattern will match one or more digits (0-9) in a string. For instance, it will match '123' in 'abc123xyz'.

- **Example 2 - Matching Email Addresses**:
  - Pattern: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
  - Explanation: This pattern matches a basic email address. It looks for a sequence of characters, followed by '@,' then another sequence, a dot, and a domain with at least two letters. For example, it matches 'user@example.com.'

- **Example 3 - Matching Dates (dd/mm/yyyy)**:
  - Pattern: `\d{2}/\d{2}/\d{4}`
  - Explanation: This pattern matches a date in the format 'dd/mm/yyyy,' where 'dd' represents the day, 'mm' represents the month, and 'yyyy' represents the year. For example, it matches '25/12/2022.'

- **Example 4 - Extracting URLs from Text**:
  - Pattern: `https?://\S+`
  - Explanation: This pattern captures URLs that start with 'http://' or 'https://' and continue until the first whitespace character. For example, it matches 'https://www.example.com' within a text.

- **Example 5 - Matching Words with Hyphens**:
  - Pattern: `\w+-\w+`
  - Explanation: This pattern matches words separated by hyphens. It looks for a sequence of alphanumeric characters, a hyphen, and another sequence of alphanumeric characters. For example, it matches 'cloud-based' in 'cloud-based solution.'

Feel free to use these examples to practice regular expressions. You can adjust the patterns and test them against different strings to gain a better understanding of how regex works.

# Exercise: Keep the good work and practice regex

In [None]:
import re

# Exercise 1 - Matching Email Addresses
# Write a regex pattern to match email addresses and find all email addresses in the given text.
text1 = "Contact us at support@example.com or info@company.net for assistance."

# Exercise 2 - Extracting Dates
# Write a regex pattern to extract dates in the format 'dd/mm/yyyy' from the given text.
text2 = "Meeting scheduled for 25/12/2022. Please RSVP by 31/12/2022."

# Exercise 3 - Finding URLs
# Write a regex pattern to find URLs that start with 'http://' or 'https://'.
text3 = "Visit our website at http://www.example.com or explore more at https://www.sample-site.org."

# Exercise 4 - Extracting Phone Numbers
# Write a regex pattern to extract phone numbers in the format '(xxx) xxx-xxxx' from the given text.
text4 = "Contact us at (123) 456-7890 or reach us at (987) 654-3210 for inquiries."

# Exercise 5 - Matching Words with Hyphens
# Write a regex pattern to match words separated by hyphens.
text5 = "This is a text with cloud-based solutions and state-of-the-art technology."

# Exercise 6 - Custom Pattern
# Write your custom regex pattern and test it with a text of your choice.

# Exercise 7 - Matching Dates (mm/dd/yyyy)
# Write a regex pattern to match dates in the format 'mm/dd/yyyy' from the given text.
text7 = "Meeting scheduled for 12/25/2022. Please RSVP by 12/31/2022."

# Exercise 8 - Extracting Hashtags
# Write a regex pattern to extract hashtags (words starting with #) from the given text.
text8 = "Join the conversation with #topic1, #discussion, and #feedback."

# Exercise 9 - Extracting Mentioned Users
# Write a regex pattern to extract mentioned usernames (words starting with @) from the given text.
text9 = "Contact @user123 for assistance or follow @officialpage for updates."

# Exercise 10 - Matching Phone Numbers (with optional country code)
# Write a regex pattern to match phone numbers in the format '(+xx) xxx-xxxx' or '(xxx) xxx-xxxx' from the given text.
text10 = "Contact us at (+1) 123-456-7890 or (987) 654-3210 for inquiries."

# Define regex patterns for each exercise
pattern1 = r''
pattern2 = r''
pattern3 = r''
pattern4 = r''
pattern5 = r''
pattern7 = r''
pattern8 = r''
pattern9 = r''
pattern10 = r''

# Find and print matches for each exercise
print("Exercise 1 - Matching Email Addresses:")
print(re.findall(pattern1, text1))
print()

print("Exercise 2 - Extracting Dates:")
print(re.findall(pattern2, text2))
print()

print("Exercise 3 - Finding URLs:")
print(re.findall(pattern3, text3))
print()

print("Exercise 4 - Extracting Phone Numbers:")
print(re.findall(pattern4, text4))
print()

print("Exercise 5 - Matching Words with Hyphens:")
print(re.findall(pattern5, text5))
print()

print("Exercise 7 - Matching Dates (mm/dd/yyyy):")
print(re.findall(pattern7, text7))
print()

print("Exercise 8 - Extracting Hashtags:")
print(re.findall(pattern8, text8))
print()

print("Exercise 9 - Extracting Mentioned Users:")
print(re.findall(pattern9, text9))
print()

print("Exercise 10 - Matching Phone Numbers (with optional country code):")
print(re.findall(pattern10, text10))

### Capturing Groups in Regular Expressions

Capturing groups are a powerful feature in regular expressions that allow you to extract specific parts of a matched text. In Python, you can work with capturing groups using the `SRE_Match` objects and methods like `.groups()` and `.group()`.

#### What Are Capturing Groups?

A capturing group is a portion of a regex pattern enclosed in parentheses `( )`. It serves two main purposes:

1. **Grouping:** You can use parentheses to group parts of a pattern together. This is helpful for applying quantifiers like `*`, `+`, or `?` to multiple characters or subpatterns. For example, `(ab)+` will match one or more occurrences of 'ab' as a group.

2. **Extraction:** Capturing groups allow you to extract specific portions of the matched text. Each set of parentheses creates a separate capturing group, and you can access the matched content of these groups individually.

#### Using Capturing Groups in Python

In Python, when you use regular expressions, the resulting match object (`SRE_Match`) provides methods for working with capturing groups:

- **`.groups()`**: This method returns a tuple containing all the captured groups. The 0th element of this tuple is the entire match of the whole regex.

- **`.group(n)`**: To access a specific capturing group, you pass its index `n` to the `.group()` method. The index is based on the order of opening parentheses in the regex pattern, starting from 1. Index 0 refers to the entire match.

#### Example:

Let's illustrate the concept with an example:

In [None]:
import re

# Suppose we want to extract dates in the format "dd/mm/yyyy" from a text.
text = "Meeting scheduled for 25/12/2022 and 31/12/2022."

# Define the regex pattern with capturing groups for day, month, and year.
pattern = r'(\d{2})/(\d{2})/(\d{4})'

# Search for matches using the pattern.
matches = re.finditer(pattern, text)

# Iterate through the matches and access capturing groups.
for match in matches:
    # The entire match (0th group) is accessible as match.group(0).
    print(f"Full Match: {match.group(0)}")
    
    # Access individual capturing groups using match.group(n).
    day = match.group(1)
    month = match.group(2)
    year = match.group(3)
    
    print(f"Day: {day}, Month: {month}, Year: {year}")

In this example, we use capturing groups to extract day, month, and year components of the date. Students can see how to access these components using the `.group(n)` method and understand the utility of capturing groups in extracting specific information from text matched by a regex.

### Summary of Useful Functions for Regular Expressions

When working with regular expressions in Python, you can utilize several built-in functions provided by the `re` module. Here's an overview of these functions and what they do:

- **`re.match(pattern, string)`**: This function checks if the regex pattern matches at the beginning of the input string. It returns a match object if a match is found at the start of the string or `None` otherwise.

- **`re.search(pattern, string)`**: This function scans through a string, looking for any location where the regex pattern matches. It returns a match object for the first occurrence found or `None` if no match is found.

- **`re.findall(pattern, string)`**: This function finds all non-overlapping substrings where the regex pattern matches in the input string and returns them as a list.

- **`re.finditer(pattern, string)`**: This function finds all non-overlapping substrings where the regex pattern matches in the input string and returns them as an iterator.

These functions are essential for various text processing tasks, enabling you to search for and manipulate patterns within strings efficiently.


### Use case examples using regular expressions

In [None]:
def check_sentences(pattern, sentences):
    """
    Check sentences against the provided regular expression pattern.

    Args:
        pattern (str): The regular expression pattern to match against.
        sentences (list of tuple): A list of tuples where each tuple contains a sentence and an expected result (True or False).

    Returns:
        None

    Prints:
        Feedback for each sentence based on whether it matches the pattern or not.
    """

    for sentence, expected_result in sentences:
        # Check if the pattern matches the entire string and compare it to the expected result.
        is_match = bool(re.match(pattern, sentence))

        # Determine if the result matches the expected outcome.
        is_valid = is_match == expected_result

        # Print feedback based on the match and validity.
        if is_valid:
            result_message = 'Pass'
        else:
            result_message = 'Not Pass'

        validity_message = '(Valid)' if expected_result else '(Not Valid)'
        print(f'{result_message} --> {sentence} {validity_message}')

**Explanation of the `check_sentences` Function**

The `check_sentences` function is designed to evaluate a list of sentences against a provided regular expression pattern and provide feedback on whether each sentence matches the pattern as expected.

**Function Parameters**

- `pattern`: This parameter represents the regular expression pattern to which each sentence will be compared.

- `sentences`: This parameter is a list of tuples where each tuple contains a sentence and an expected result (True or False). The expected result indicates whether the sentence is expected to match the pattern or not.

**Function Purpose**

The primary purpose of this function is to check each sentence in the `sentences` list against the provided regular expression pattern and provide feedback based on whether the match is as expected. It assists in validating whether sentences conform to a particular pattern.

**Function Execution**

1. The function iterates through each tuple in the `sentences` list, extracting the sentence and its associated expected result.

2. For each sentence, it uses the `re.match()` function to determine if the provided regular expression pattern matches the entire sentence. The result of this match is stored in a variable.

3. The function then evaluates whether the match is valid by comparing it to the expected result. If the match and expected result match, the sentence is considered valid; otherwise, it is not valid.

4. Based on the validity of the match, the function assigns a result message such as "Pass" or "Not Pass" for feedback.

5. The function also determines whether the sentence is "Valid" or "Not Valid" based on the expected result and prepares a validity message.

6. Finally, the function prints feedback for each sentence, indicating whether it passed or not, along with the sentence itself and whether it is considered valid or not valid.

**Function Output**

The function prints feedback messages for each sentence, providing insights into whether each sentence conforms to the specified regular expression pattern. This feedback assists in validating and verifying text data against a defined pattern.

The `check_sentences` function is a valuable tool for quality control and validation tasks involving text data, enabling the assessment of data integrity against predefined patterns or rules.

#### 1. Identify files via file extensions
<p>A regular expression to check for file extensions.  </p>

In [None]:
import re

# Define a regex pattern to match file names with specific extensions (gif, jpeg, jpg, TIF).
pattern = r'[\w]+\.(gif|jpeg|jpg|TIF)$'

# Define a list of sentences to be checked against the pattern, along with their expected results.
sentences = [('test.gif', True), 
            ('image.jpeg', True),
            ('image.jpg', True),
            ('image.TIF', True),
            ('test', False),
            ('test.pdf', False),
            ('test.gif.gif', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

### Checking for numbers
#### 2. Positive integers

In [None]:
# Define a regex pattern to match strings consisting of one or more digits.
pattern = r'\d*$'

# Define a list of sentences to be checked against the pattern, along with their expected results.
sentences = [('123', True), 
            ('1', True),
            ('abc', False),
            ('1.1', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

##### 3. Negative integers

In [None]:
# Define a regular expression pattern to match strings starting with a hyphen followed by one or more digits at the end.
pattern = r'-\d+$'

# Create a list of sentences, each with an associated expected result (True or False).
sentences = [('-123', True),
            ('-1', True),
            ('123', False),
            ('-abc', False),
            ('-1.1', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

##### 4. All integers

In [None]:
# Define a regular expression pattern to match strings that may start with an optional hyphen followed by one or more digits at the end.
pattern = r'-?\d+$'

# Create a list of sentences, each with an associated expected result (True or False).
sentences = [('-123', True),
            ('-1', True),
            ('123', True),
            ('123.0', False),
            ('-abc', False),
            ('-1.1', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

##### 5. Positive numbers

In [None]:
# Define a regular expression pattern to match strings representing decimal numbers.
pattern = r'\d*\.?\d+$'

# Create a list of sentences, each with an associated expected result (True or False).
sentences = [('1', True),
            ('123', True),
            ('1.234', True),
            ('0.2', True),
            ('.2', True),
            ('-123.0', False),
            ('-abc', False),
            ('-123.1', False)]

# Print the pattern being used for matching.
print("PATTERN:", pattern)

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

##### 6. Negative numbers

In [None]:
# Define a regular expression pattern to match strings representing negative decimal numbers.
pattern = r'-\d*\.?\d+$'

# Create a list of sentences, each with an associated expected result (True or False).
sentences = [('-1', True),
            ('-123', True),
            ('-1.234', True),
            ('123', False),
            ('-abc', False),
            ('123.1', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

##### 7. All numbers

In [None]:
# Define a regular expression pattern to match strings representing decimal numbers, including optional negative sign.
pattern = r'-?\d*\.?\d+$'

# Create a list of sentences, each with an associated expected result (True or False).
sentences = [('1', True),
            ('123', True),
            ('1.234', True),
            ('-234', True),
            ('-1.234', True),
            ('a', False),
            ('-abc', False),
            ('a1', False)]

# Call the check_sentences function with the pattern and sentences to perform the checks.
check_sentences(pattern, sentences)

#### 8. Username validation
<p>Checking for a valid user name that has a certain minimum and maximum length.</p>
<p>Allowed characters:</p>
<ul>
<li>letters (upper- and lower-case)</li>
<li>numbers</li>
<li>dashes</li>
<li>underscores</li>

In [None]:
min_len = 5 # minimum length for a valid username
max_len = 15 # maximum length for a valid username

pattern = r'[\w_-]{5,15}$'

sentences = [('user123',True), ('123_user', True),('Username',True),
            ('user',False),('username1234_is-way-too-long',False),('user$34354',False)]

check_sentences(pattern,sentences)

#### 9. Checking for valid email addresses
A regular expression that captures most email addresses.

In [None]:
pattern = r'(^(?i)(\w+\.|\w+-)*\w+@(\w+\.|\w+-)*\w+\.[a-z]{2,3}$)'

sentences = [('l-l.l@mail.Aom.PP',True), ('ds@mail.com', True),
            ('testmail.com',False),('test@mail.com.',False),('@testmail.com',False),('test@mailcom',False)]

check_sentences(pattern,sentences)

### Use case challenges using regular expressions

#### 1. Validating dates and time
Validates dates in mm/dd/yyyy format. note: Some dates are not verified such as 2080 to be invalid. 

In [None]:
pattern = r''



sentences = [('01/08/2014',True), ('12/30/2014', True),
            ('22/08/2014',False),('-123',False),('1/8/2014',False),('1/08/2014',False),('01/8/2014',False)]

check_sentences(pattern,sentences)

#### 2. 12-Hour format

In [None]:
pattern = r''



sentences = [('2:00pm',True), ('7:30 AM', True), ('12:05 am', True),
            ('22:00pm',False),('14:00',False),('3:12',False),('03:12pm',False)]

check_sentences(pattern,sentences)

#### 3. 24-Hour format

In [None]:
pattern = r''


sentences = [('14:00',True), ('00:30', True), 
            ('22:00pm',False),('4:00',False),('03:12pm',False)]

check_sentences(pattern,sentences)

#### 4. Checking for HTML/XML, etc. tags (a very simple approach)

In [None]:
pattern = r''

sentences = [('<a>',True), ('<a href="somethinG">', True),  ('</a>', True),  ('<img src>', True), 
            ('a>',False),('<a',False),('< a >',False)]

check_sentences(pattern,sentences)

#### 5. ID/Passport/NIF

In [None]:
pattern = r''


sentences = [('12345678D',True), ('X1234567F', True), 
             ('123456F',False),('X12367F',False),('123Ff456F',False)]


check_sentences(pattern,sentences)

#### 6. Website Names
Define the pattern to detect if a string corresponds to a website name. The pattern follows these rules:
* It can start with either three "w"s or directly with the domain name.
* It is followed by the domain name, which can contain any letters and numbers.
* It can have a maximum of 2 subdomains (composed of letters and numbers).
* It ends with a period followed by 2 or 3 letters.

You should detect the following cases:
* Positives:
    * www.ds.com
    * www.data.science.com
    * datascience.com
    * wab.a.com
* Negatives:
    * ww.4com
    * www.ww.a
    * www.d.s.c.d.com

In [None]:
pattern = ''


sentences = [('www.ds.com',True), ('www.data.science.com', True),  ('datascience.com', True),  ('data.sc.com', True), 
            ('ww.4com',False),('www.ww.a',False),('www.d.s.c.d.com',False)]


check_sentences(pattern,sentences)

## Solutions

In [None]:
# Define regex patterns for each exercise
pattern1 = r'\w+@\w+\.\w+'
pattern2 = r'\d{2}/\d{2}/\d{4}'
pattern3 = r'https?://\S+'
pattern4 = r'\(\d{3}\) \d{3}-\d{4}'
pattern5 = r'\w+-\w+'
pattern7 = r'\d{2}/\d{2}/\d{4}'
pattern8 = r'#\w+'
pattern9 = r'@\w+'
pattern10 = r'\(?\+\d{1,2}\)? \d{3}-\d{4}'