## Lecture 4: [1/24]

## Regular Expressions 

---

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Regular Expressions**
* A regular expression is a sequence of characters that defines a search pattern.  It's like a mini-language for describing text structures.  You use them to:

* **Match**: Find specific patterns within a string.
* **Search**: Locate the first occurrence of a pattern.
* **Replace**: Substitute parts of a string that match a pattern.
* **Split**: Divide a string into substrings based on a pattern.

**Key Concepts and Components:**
* **Characters**:  Most characters match themselves literally.  For example, the regex "hello" will match the string "hello".
* **Special Characters** (Metacharacters): These characters have special meanings and allow for more complex pattern matching.  Some common ones include:
    * `.` (dot): Matches any character except a newline.
    * `*` (asterisk): Matches zero or more occurrences of the preceding character or group.
    * `+` (plus): Matches one or more occurrences of the preceding character or group.   
    * `?` (question mark): Matches zero or one occurrence of the preceding character or group.   
    * `[]` (square brackets): Defines a character class, matching any character within the brackets. For example, [aeiou] matches any vowel.
    * `()` (parentheses): Groups expressions together.
    * `|` (pipe): Acts as an "or" operator.
    * `^` (caret): Matches the beginning of a string.
    * `$` (dollar sign): Matches the end of a string.
    * `\` (backslash): Escapes special characters or represents special character classes (e.g., \d for digits, \s for whitespace).Character Classes:  Shorthand notations for common character sets:

**Character Classes:  Shorthand notations for common character sets:**
* `\d`: Matches any digit (0-9).
* `\w`: Matches any word character (letters, numbers, and underscore).
*  `\s`: Matches any whitespace character (spaces, tabs, newlines).
* `r'\d'` : regular expression pattern
* `r''`: denotes raw string
* `\d` is a special sequence that matches any digit (0-9) 

**Quantifiers: Specify how many times a character or group can occur**
* `{n}`: Matches exactly n occurrences.
* `{n,}`: Matches n or more occurrences.
*  `{n,m}`: Matches between n and m occurrences.


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Importing re module**

`import:`  In Python, the import statement is used to bring in external modules or libraries of code that provide additional functionality beyond the basic built-in features of Python.

`re:` This specifically refers to the "regular expression" module.  Regular expressions are a powerful way to search, match, and manipulate text based on patterns.

In [2]:
import re

## I. Begins-With Foo (No RegEx)

**Case Sensitive**

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Appraoches & Case Sensitivity**

* **If-Else (Case Sensitive)**:  This version explicitly checks the first 3 characters using slicing (`s[:3]`) and compares them to "foo". If they match exactly (case-sensitive), it returns `True`, otherwise `False`.

* **Index Comparison (Case Sensitive)**:  This is a more concise way to do the same thing. It directly returns the result of the comparison `s[:3] == 'foo'`.  Since comparisons evaluate to `True` or `False`, the `if-else` is unnecessary.

* **startswith() (Case Sensitive)**: This is the most Pythonic and recommended way.  The built-in string method `s.startswith('foo')` directly checks if the string `s` starts with "foo" (case-sensitive).

* **casefold() and startswith() (Case Insensitive)** : This version first converts the string `s` to lowercase using `s.casefold()` and then uses `startswith('foo')`.  `casefold()` handles more Unicode characters than `lower()`, making it a more robust way to do a case-insensitive comparison.


**Variation Using `If-Else` [Case Sensitive]**

In [6]:
def begins_with_foo(s):   #define the variable where (s) means string
    if s[:3] == 'foo':    #the if s[:3] uses index which means if the first 3 characters are foo then return true
        return True
    else:
        return False
begins_with_foo("Fool")   #returned false because case-sensitive

False

**Variation Using `index == `[Case Sensitive]**

In [7]:
def begins_with_foo(s):
    """Return True if s begins with substring `foo` (case-sensitive)"""
    return s[:3] == 'foo'
begins_with_foo("Fool") #returned false since case sensitive

False

In [None]:
<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 


**Variation Using `.startswith` [Case Sensitive]**

In [10]:
def begins_with_foo(s):
    """Return True if s begins with substring foo, False otherwise"""
    return s.startswith('foo')
begins_with_foo("Fool")

False

**Variation Using `.startswith` & `s.lower()` [Case Insensitive]**

In [11]:
def begins_with_foo_insensitive(s):
    s = s.lower()
    return s.startswith('foo')
begins_with_foo_insensitive("FOoL")

True

**Variation Using COMBO `.startswith` & `s.lower()` [Case Insensitive]**

In [12]:
def begins_with_foo_insensitive(s):
    return s.lower().startswith('foo')
begins_with_foo_insensitive("FOOLERS")

True

**Variation Using`.casefold` [Case Insensitive]**


In [13]:
def begins_with_foo_insensitive(s):
    return s.casefold().startswith('foo')
begins_with_foo_insensitive("FOOLMETOO")

True

## `re.match` & `re.search` (Begins-With)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.match`**
* pattern should beginning of the string
* show match object or none type

**`re.search`**
* means that pattern can be anywhere in the string


<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.match` [Case-Sensitive]**

* `re.match('foo', s)`: The `re.match()` function attempts to match the regular expression `'foo'` at the beginning of the string `s`. It's case-sensitive. If the string starts with "foo", it returns a match object; otherwise, it returns `None`.
* `bool(...)`: The `bool()` function converts the result of `re.match()` to a boolean value. A match object is considered "truthy" (evaluates to `True` in a boolean context), while `None` is "falsy" (evaluates to False).
* `begins_with_foo("foodie")`: This calls the function with the string "foodie". Since it starts with "foo", `re.match()` returns a match object, which is converted to `True`.

In [17]:
def begins_with_foo(s):
    return bool(re.match('foo', s))

begins_with_foo("foodie")  

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.match with `re.IGNORECASE` (Case Insensitive)**


*`re.match('foo', s, flags=re.IGNORECASE)`: This is similar to the first example, but the re.IGNORECASE flag is added. This flag makes the regular expression matching case-insensitive.
* `begins_with_foo_insensitive("foodies")`: This calls the function with the string "foodies". Even though the capitalization is different, the `re.IGNORECASE` flag makes the match successful, resulting in `True`.

In [18]:
def begins_with_foo_insensitive(s):
    return bool(re.match('foo', s, flags=re.IGNORECASE))

begins_with_foo_insensitive("foodies")  # Output: True

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.search` with `^` (Case Sensitive, Beginning of String)**

* `re.search('^foo', s)`: The `re.search()` function searches for the pattern anywhere in the string. However, the `^` character in the regular expression specifically anchors the match to the beginning of the string. So, even though `re.search()` can search anywhere, the `^` forces it to only consider matches at the start. This is functionally similar to `re.match()` in this specific case.

* `begins_with_foo("foodless")`: This calls the function with "foodless". Since "foodless" starts with "foo", the match is successful, and the function returns `True`.



In [19]:
def begins_with_foo(s):
    return bool(re.search('^foo', s))

begins_with_foo("foodless")  # Output: True

True

## II. Ends- With Foo (No RegEx)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Appraoches & Case Sensitivity**

* **Index Comparison (Case Sensitive)**:  This version checks if the last three characters of the string (`s[-3:]`) are exactly equal to "foo". It's case-sensitive, so "Foo" or "FOO" wouldn't match.

* **`.endswith()` (Case Sensitive)** : This is the standard and most Pythonic way. The built-in string method `s.endswith('foo')` directly checks if the string `s` ends with "foo" (case-sensitive).

* **`.endswith()` with `.lower()` (Case Insensitive)**: This converts the string `s` to lowercase using `s.lower()` before checking with `endswith('foo')`. This makes the check case-insensitive, but only works reliably for standard ASCII characters.

* **`.endswith()` with `.casefold()` (Case Insensitive)** : This is the most robust way for case-insensitive comparisons. It uses s.casefold() which handles a wider range of Unicode characters compared to lower().  It then checks with endswith('foo')

**Variation Using `index==` [Case Sensitive]**

In [20]:
def ends_with_foo(s):
    return s[-3:] == 'foo'
ends_with_foo("mafoo")

True

**Variation Using `.endswith`[Case Sensitive]**

In [22]:
def ends_with_foo(s):
    return s.endswith('foo')
ends_with_foo("chefoo")

True

**Variation Using COMBO `.endswith` & `s.lower()` [Case Insensitive]**

In [24]:
def ends_with_foo_insensitive(s):
    return s.lower().endswith('foo')
ends_with_foo_insensitive("Foolboy")

False

**Variation Using COMBO `.endswith` & `.casefold` [Case Insensitive]**

In [26]:
def ends_with_foo_insensitive(s):
    return s.casefold().endswith('foo')
ends_with_foo_insensitive("GEEFOO")

True

## `re.match` & `re.search` (Ends-With)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.match`**
* pattern should beginning of the string
* show match object or none type

**`re.search`**
* means that pattern can be anywhere in the string

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Case-Insensitive Check**

* `re.search('foo$', s, flags=re.I)`:
    * `re.search()`: This function searches for the pattern anywhere in the string.
    *`'foo$'`: This is the regular expression pattern. `foo` matches the literal characters "foo". The $ symbol is an anchor that means "end of the string". So, this pattern matches "foo" only if it occurs at the very end of the string.
    * `s`: This is the input string being searched.
    * `flags=re.I`: This is the crucial part. The `re.I` flag (or `re.IGNORECASE`) makes the search case-insensitive. So, "Foo", "FOO", "foo", etc., will all match.

* `bool(...)`: The `bool()` function converts the result of `re.search()` to a boolean (True or False). If a match is found, `re.search()` returns a match object (which is "truthy"), otherwise it returns `None` (which is "falsy").

* `ends_with_foo_insensitive("mafoo")`: This calls the function with the string "mafoo". Because the search is case-insensitive and "mafoo" ends with "foo", the function returns `True`.

In [31]:
def ends_with_foo_insensitive(s):
    return bool(re.search('foo$', s, flags=re.I))

ends_with_foo_insensitive("mafoo") 

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Case-Sensitive Check**

* `re.search('foo$', s)`: This is almost the same as the previous example, except there is no re.I flag. Therefore, the search is case-sensitive.
* `ends_with_foo("mefoo")`: This calls the function with the string "mefoo". Because the search is case-sensitive and "mefoo" ends with "foo", the function returns `True`.


In [32]:
def ends_with_foo(s):
    """Return True if s ends with substring foo, False otherwise"""
    return bool(re.search('foo$', s))

ends_with_foo("mefoo") 

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Key Difference**
* The core difference is the presence or absence of the `re.I` flag. The first function will return `True` for strings ending with "foo", "Foo", "FOO", etc., while the second function will only return `True` for strings ending with "foo" (lowercase).

**Which to use:**
* Use the case-insensitive version `(re.I)` if you want to match "foo" regardless of capitalization. Use the case-sensitive version if you need an exact, literal match of "foo".

## III. Has Foo (No RegEx) 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Case Sensitive Check** 
* `'foo' in s`: This is the core of the check. The `in` operator in Python checks for substring presence. It returns `True` if the substring "foo" is found anywhere within the string `s`, and `False` otherwise. Critically, this check is case-sensitive.

* `has_foo("fool")`: This calls the function with the string "fool". Since "foo" is present in "fool", the function returns `True`.

**Variation Using `in` [Case Sensitive]**

In [33]:

def has_foo(s):
    return 'foo' in s
has_foo("fool")

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Case Insensitive Check**

* `s.casefold()`: This part is the key difference. `The .casefold()` method converts the string `s` to lowercase before the `in` operator checks for the substring. `casefold()` is a more robust way to lowercase strings for comparison than `.lower()` as it handles a wider range of Unicode characters.
* `'foo' in s.casefold()`: Now, the `in` operator is checking if "foo" is present in the lowercased version of `s`. This makes the check case-insensitive.
* `has_foo_insensitive("MAFOOSA")`: This calls the function with the string "MAFOOSA". `"MAFOOSA".casefold()` becomes "mafoosa". Since "foo" is in "mafoosa", the function returns `True`.

**Variation Using `.casefold` [Case-Insensitive]**


In [34]:
def has_foo_insensitive(s):
    return 'foo' in s.casefold()
has_foo_insensitive("MAFOOSA")

True

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**Key Difference**: 

* The essential difference is the use of `.casefold()` in the second snippet.  The first snippet will only return `True` if "foo" is found in the exact case. The second snippet will return `True` if "foo", "Foo", "FOO", "fOo", etc., are found because the case is ignored.

**Which to use**:
* Use the case-sensitive version if you need an exact, literal match. Use the case-insensitive version (with `.casefold()`) if you want to find the substring regardless of capitalization.  The case-insensitive version is often more user-friendly when dealing with text input where you don't want to be too strict about capitalization.

## `re.match` & `re.search` (Has Foo)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

**`re.match`**
* pattern should beginning of the string
* show match object or none type

**`re.search`**
* means that pattern can be anywhere in the string

**Variation Using `re.search` [Case Sensitive]**


In [36]:
def has_foo(s):
    return bool(re.search('foo', s))
has_foo("mafoosa")

True

**Variation Using `re.search` & `re.I` [Case Insensitive]**

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `def has_foo_insensitive(s):`: This line defines a function named has_foo_insensitive that takes one argument, s, which is expected to be a string.
* `return bool(re.search('foo', s, flags=re.I))`: This is the core of the function. Let's analyze it piece by piece:
    * `re.search('foo', s, flags=re.I)`: This part uses the `re.search()` function from the `re` (regular expression) module to search for the substring "foo" within the string `s`.
        * `'foo'` is the regular expression pattern we are searching for.
        * `s` is the string being searched.
        * `flags=re.I` is the key for case-insensitivity. The `re.I` flag (or equivalently `re.IGNORECASE`) makes the search case-insensitive, so it will find "foo", "Foo", "FOO", etc.
    * `bool(...)`: The `re.search()` function returns a match object if the pattern is found, and `None` if it's not found. The `bool()` function converts this result to a boolean value: `True` if a match object is returned (meaning "foo" was found), and `False` if `None` is returned (meaning "foo" was not found).
    * `return ...:` The function returns the boolean value (`True` or `False`) indicating whether the substring "foo" was found in the string s (case-insensitively).
* `has_foo_insensitive("food")`: This line calls the function with the string "food" as input. Since "food" contains "foo", the function will return `True`.

In [37]:
def has_foo_insensitive(s):
    return bool(re.search('foo', s, flags=re.I))

has_foo_insensitive("food")

True

## IV. Begins with Number (No RegEx)

**Variation Using `.isnumeric`**

In [38]:
def begins_with_a_number(s):
    return s[0].isnumeric()
begins_with_a_number("21czarina")

True

## `re.match` & `re.search` (Begins-With Number)

**Variation Using  `re.match` & `r'\d'`**

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `r'\d'`  is the regular expression pattern.
* `r''` denotes a raw string, which is a best practice when defining regular expressions to avoid issues with backslashes.
* `\d` is a special sequence that matches any digit (0-9).

In [60]:
def begins_with_a_number(s):
    return bool(re.match(r'\d', s))
begins_with_a_number("2czarina")

True

## Begins With Number then Letter (No Reg Ex)

**Variation Using COMBO `.isdigit` & `.isalpha`**

In [53]:
def begins_with_number_then_letter(s):
    return s[0].isdigit() and s[1].isalpha()

begins_with_number_then_letter("1jas")  ##Return True if s contains a number followed by a letter

True

## `re.match` & `re.search` (Begins with Number then Letter)

**Variation Using `re.match` & `r'\d[a-zA-Z]'`** 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

[a-zA-Z]` : next character is any lower or upper case letter in the alphabet

In [58]:
def begins_with_number_then_letter(s):
    return bool(re.match(r'\d[a-zA-Z]', s))
begins_with_number_then_letter("2Zzarina") #Return True if s begins with a number followed by a letter

True

**Variation Using `re.search` & `^` with `'\d`**

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `^` = begins with 
* `\d` = digit (one didit)


In [None]:
begins_with_a_number("8foo")
def begins_with_a_number(s):
    return bool(re.search(r'^\dfoo', s))

## Match and Search 

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `lazydog = """...""":` This line assigns a multiline string to the variable `lazydog`. This string contains the text you'll be searching within.

In [61]:
lazydog = '''The quick brown fox jumps over the lazydog. The quick brown fox jumps over the lazy dog.
The quick brown then fox 123 abc123 999xyz jumps he over the -he lazy      she dog. The quick brown fox jumps over the lazy$!@dog.'''

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`match = re.match('The', lazydog):  This is the core of the example`:

* `re.match()`: This function attempts to match the regular expression pattern at the beginning of the string.
* `'The'`: This is the regular expression pattern. It's a literal string, meaning it will search for the exact characters "The" in that order.
* `lazydog`: This is the string you're searching within.
* The result of `re.match()` (either a match object or `None`) is stored in the variable `match`.

`match:`  This line simply displays the value of the match variable. Since "The" is found at the beginning of the `lazydog` string, `re.match()` returns a match object, which is printed. The output `<re.Match object; span=(0, 3), match='The'>` indicates that a match was found, it spans from index 0 to 3 (exclusive), and the matched text is "The".

In [63]:
match = re.match('The', lazydog)

In [64]:
match

<re.Match object; span=(0, 3), match='The'>

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `match.group()`: This extracts and returns the actual matched text, which is "The".
* `match.start()`: This returns the starting index of the match, which is 0.
* `match.end()`: This returns the ending index of the match (exclusive), which is 3.

In [66]:
match.group()

'The'

In [67]:
match.start()

0

In [68]:
match.end()

3

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `lazydog[match.start(): match.end()]`: This uses string slicing to extract the portion of the `lazydog` string that was matched. Since the match starts at index 0 and ends at index 3, it extracts the substring "The".

In [69]:
lazydog[match.start():match.end()]

'The'

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `match.span()`: This returns a tuple containing the starting and ending indices of the match: `(0, 3)`.

In [70]:
match.span()

(0, 3)

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

`re.search`
* `re.search('The', lazydog)`: The `re.search()` function searches for the first occurrence of the pattern `'The'` within the string `lazydog`. Unlike `re.match()`, which only checks the beginning of the string, `re.search()` looks for the pattern anywhere in the string.
* Output: The output is a match object. `<re.Match object; span=(0, 3), match='The'>` tells you that a match was found:
    * `span=(0, 3)`: The match occurs from index 0 up to (but not including) index 3.
    * `match='The'`: The actual text that matched is "The".

In [72]:
re.search('The', lazydog)

<re.Match object; span=(0, 3), match='The'>

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `find_all(pattern, s)`: This defines a custom function to find all non-overlapping matches of a pattern in a string.   
* `while True` loop: The loop continues until `re.search()` returns `None` (no more matches are found).
* `match = re.search(pattern, s[start_index:])`: This is the key. It searches for the pattern in the remaining part of the string, starting from `start_index`.
* `start_index += match.end()`: After a match is found, the `start_index` is updated to the end of the current match. This is crucial to avoid finding the same match repeatedly.
* `matches.append(match.group())`: If a match is found, the matched text is added to the `matches` list.
* `return matches`: When no more matches are found, the function returns the `matches` list.
* `find_all('The', lazydog)`: This calls the function to find all occurrences of "The" in `lazydog`.

In [73]:
def find_all(pattern, s):
    """Return list of all matches of pattern in s"""
    start_index = 0
    matches = []
    while True:
        match = re.search(pattern, s[start_index:])
        if match:
            matches.append(match.group())
            start_index += match.end()
        else:
            return matches

<div style="border: 2px solid black; padding: 10px; border-radius: 5px;"> 

* `re.findall('The', lazydog)`: The `re.findall()` function directly returns a list of all non-overlapping matches of the pattern in the string. It's a much simpler way to achieve the same result as the `find_all()` function in the previous block.
* Output: `['The', 'The', 'The', 'The']` - A list containing all four occurrences of "The" in lazydog.

In [74]:
re.findall('The', lazydog)

['The', 'The', 'The', 'The']

**Key Differences and When to Use Which:**

* `re.search()`: Use when you only need to find the first occurrence of a pattern.   
* `find_all()` (custom function): This demonstrates how you might implement finding all matches yourself, but it's generally better to use `re.findall()` directly.
* `re.findall()`: Use when you need all non-overlapping matches of a pattern in a string. It's the most efficient and Pythonic way to do this.   

* In most cases where you want all matches, `re.findall()` is the best choice due to its simplicity and efficiency.  `re.search()` is useful when you only care about the first match.  The custom `find_all()` function is educational but not typically used in practice since `re.findall()` exists