In [None]:
import re

### RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

**Function	Description**

1. **findall** :	Returns a list containing all matches
2. **search** :	Returns a Match object if there is a match anywhere in the string
3. **split** :	Returns a list where the string has been split at each match
4. **sub** :	Replaces one or many matches with a string

### Metacharacters

Metacharacters are characters with a special meaning:

1. `[]`	A set of characters
  * "[a-m]"
2. `\`	Signals a special sequence (can also be used to escape special characters)
  * "\d"
3. `.`	Any character (except newline character)
  * "he..o"
4. `^`	Starts with
  * "^hello"
5. `$`	Ends with
  * "world$"



1. **span()** returns a tuple (start, end) representing the position of the match in the string.

2. **match()** only looks for matches at the start of the string.

3. **finditer()** If you wanted to find all occurrences regardless of position, you'd use or modify the regex pattern.

### Zero or more occurrences: "aix*"

In [None]:
pattern = r"aix*"
test_strings = ["ai", "aix", "aixxx", "ax"]

for s in test_strings:
    if re.fullmatch(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")


'ai' matches the pattern 'aix*'
'aix' matches the pattern 'aix*'
'aixxx' matches the pattern 'aix*'
'ax' does not match the pattern 'aix*'


### One or more occurrences: "aix+"

In [None]:
pattern = r"aix+"
test_strings = ["ai", "aix", "aixxx", "ax"]

for s in test_strings:
    if re.fullmatch(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")


'ai' does not match the pattern 'aix+'
'aix' matches the pattern 'aix+'
'aixxx' matches the pattern 'aix+'
'ax' does not match the pattern 'aix+'


### Exactly the specified number of occurrences: "al{2}"

In [None]:
pattern = r"al{2}"
test_strings = ["a", "al", "all", "alll"]

for s in test_strings:
    if re.fullmatch(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")


'a' does not match the pattern 'al{2}'
'al' does not match the pattern 'al{2}'
'all' matches the pattern 'al{2}'
'alll' does not match the pattern 'al{2}'


### Either or |: "falls|stays"

In [None]:
pattern = r"falls|stays"
test_strings = ["falls", "stays", "falling", "stay", "hello"]

for s in test_strings:
    if re.fullmatch(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")


'falls' matches the pattern 'falls|stays'
'stays' matches the pattern 'falls|stays'
'falling' does not match the pattern 'falls|stays'
'stay' does not match the pattern 'falls|stays'
'hello' does not match the pattern 'falls|stays'


### Capture and group: ()

In [None]:
pattern = r"(hello|world)"
test_strings = ["hello", "world", "helloworld", "goodbye"]

for s in test_strings:
    match = re.fullmatch(pattern, s)
    if match:
        print(f"'{s}' matches the pattern '{pattern}'. Captured group: {match.group(0)}")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")

'hello' matches the pattern '(hello|world)'. Captured group: hello
'world' matches the pattern '(hello|world)'. Captured group: world
'helloworld' does not match the pattern '(hello|world)'
'goodbye' does not match the pattern '(hello|world)'


### Special Sequences

A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:

`\A`	Returns a match if the specified characters are at the beginning of the string
  * "\AThe"

`\b`	Returns a match where the specified characters are at the beginning or at the end of a word
  * r"\bain" r"ain\b"

`\B`	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
* r"\Bain" r"ain\B"

`\d`	Returns a match where the string contains digits (numbers from 0-9)

`\D`	Returns a match where the string DOES NOT contain digits

`\s`	Returns a match where the string contains a white space character

`\S`	Returns a match where the string DOES NOT contain a white space character

`\w`	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)

`\W`	Returns a match where the string DOES NOT contain any word characters

`\Z`	Returns a match if the specified characters are at the end of the
string
  * "Spain\Z"

If you want to check for a match at the beginning of a string, you would use `^` instead.  

In [None]:
pattern = r"\AThe"
test_strings = ["The quick brown fox", "Not The start", "The end"]

for s in test_strings:
    if re.search(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")

'The quick brown fox' matches the pattern '\AThe'
'Not The start' does not match the pattern '\AThe'
'The end' matches the pattern '\AThe'


In [None]:
# The \b in the first pattern (\brain\b) ensures that "brain" is matched only when it's a complete word, while the second pattern (\brain) matches "brain" regardless of whether it's part of another word.

patterns = [r"\brain\b", r"\brain"]
test_strings = ["rain brain pain", "brainstorm", "train"]

for s in test_strings:
    for p in patterns:
        if re.search(p, s):
            print(f"'{s}' matches the pattern '{p}'")
        else:
            print(f"'{s}' does not match the pattern '{p}'")


'rain brain pain' matches the pattern '\brain\b'
'rain brain pain' matches the pattern '\brain'
'brainstorm' does not match the pattern '\brain\b'
'brainstorm' does not match the pattern '\brain'
'train' does not match the pattern '\brain\b'
'train' does not match the pattern '\brain'


In [None]:
patterns = [r"\Brain\B", r"\Brain"]
test_strings = ["brainstorm", "train brain", "brain"]

for s in test_strings:
    for p in patterns:
        if re.search(p, s):
            print(f"'{s}' matches the pattern '{p}'")
        else:
            print(f"'{s}' does not match the pattern '{p}'")

'brainstorm' matches the pattern '\Brain\B'
'brainstorm' matches the pattern '\Brain'
'train brain' does not match the pattern '\Brain\B'
'train brain' matches the pattern '\Brain'
'brain' does not match the pattern '\Brain\B'
'brain' matches the pattern '\Brain'


In [None]:
pattern = r"\d"
test_strings = ["123abc", "abcdef", "a1b2c3", "xyz"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains digits: {matches}")
    else:
        print(f"'{s}' does not contain digits")

'123abc' contains digits: ['1', '2', '3']
'abcdef' does not contain digits
'a1b2c3' contains digits: ['1', '2', '3']
'xyz' does not contain digits


In [None]:
pattern = r"\D"
test_strings = ["123abc", "abcdef", "a1b2c3", "123"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains non-digit characters: {matches}")
    else:
        print(f"'{s}' contains only digits")

'123abc' contains non-digit characters: ['a', 'b', 'c']
'abcdef' contains non-digit characters: ['a', 'b', 'c', 'd', 'e', 'f']
'a1b2c3' contains non-digit characters: ['a', 'b', 'c']
'123' contains only digits


In [None]:
# \s - White space
pattern = r"\s"
test_strings = ["Hello World", "NoSpacesHere", "   ", "\t\n\r"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains whitespace characters: {len(matches)} occurrences")
    else:
        print(f"'{s}' does not contain whitespace characters")

'Hello World' contains whitespace characters: 1 occurrences
'NoSpacesHere' does not contain whitespace characters
'   ' contains whitespace characters: 3 occurrences
'	
' contains whitespace characters: 3 occurrences


In [None]:
pattern = r"\S"
test_strings = ["Hello World", "NoSpacesHere", "   ", "\t\n\r", "a b c"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains non-whitespace characters: {len(matches)} occurrences")
    else:
        print(f"'{s}' contains only whitespace characters")


# From "Hello": 5 characters (H, e, l, l, o) 5 non-whitespace characters
# From "World": 5 characters (W, o, r, l, d) 5 non-whitespace characters
# total is 10


# ********************************
# \t: This is the escape sequence for a tab character. It adds horizontal space in text and is often used for indentation.
# \n: This is the escape sequence for a newline character (also known as a line feed). It moves the cursor to the beginning of the next line.
# \r: This is the escape sequence for a carriage return. It moves the cursor to the beginning of the current line without advancing to the next line.

'Hello World' contains non-whitespace characters: 10 occurrences
'NoSpacesHere' contains non-whitespace characters: 12 occurrences
'   ' contains only whitespace characters
'	
' contains only whitespace characters
'a b c' contains non-whitespace characters: 3 occurrences


In [None]:
pattern = r"\w"
test_strings = ["Hello123World", "abcdef", "a1b2_c3", "!@#$"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains word characters: {matches}")
    else:
        print(f"'{s}' does not contain word characters")

'Hello123World' contains word characters: ['H', 'e', 'l', 'l', 'o', '1', '2', '3', 'W', 'o', 'r', 'l', 'd']
'abcdef' contains word characters: ['a', 'b', 'c', 'd', 'e', 'f']
'a1b2_c3' contains word characters: ['a', '1', 'b', '2', '_', 'c', '3']
'!@#$' does not contain word characters


In [None]:
pattern = r"\W"
test_strings = ["Hello123World", "abcdef", "a1b2_c3", "!@#$"]

for s in test_strings:
    matches = re.findall(pattern, s)
    if matches:
        print(f"'{s}' contains non-word characters: {matches}")
    else:
        print(f"'{s}' does not contain non-word characters")

'Hello123World' does not contain non-word characters
'abcdef' does not contain non-word characters
'a1b2_c3' does not contain non-word characters
'!@#$' contains non-word characters: ['!', '@', '#', '$']


In [None]:
pattern = r"Spain\Z"
test_strings = ["I love Spain", "Spain is beautiful", "Spanish food is great"]

for s in test_strings:
    if re.search(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")

'I love Spain' matches the pattern 'Spain\Z'
'Spain is beautiful' does not match the pattern 'Spain\Z'
'Spanish food is great' does not match the pattern 'Spain\Z'


In [None]:
pattern = r"^Spain"
test_strings = ["I love Spain", "Spain is beautiful", "Spanish food is great"]

for s in test_strings:
    if re.search(pattern, s):
        print(f"'{s}' matches the pattern '{pattern}'")
    else:
        print(f"'{s}' does not match the pattern '{pattern}'")

'I love Spain' does not match the pattern '^Spain'
'Spain is beautiful' matches the pattern '^Spain'
'Spanish food is great' does not match the pattern '^Spain'


### Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

Set	Description

`[arn]`	Returns a match where one of the specified characters (a, r, or n) are present

`[a-n]`	Returns a match for any lower case character, alphabetically between a and n

`[^arn]`	Returns a match for any character EXCEPT a, r, and n

`[0123]`	Returns a match where any of the specified digits (0, 1, 2, or 3) are present

`[0-9]`	Returns a match for any digit between 0 and 9

`[0-5][0-9]`	Returns a match for any two-digit numbers from 00 and 59

`[a-zA-Z]`	Returns a match for any character alphabetically between a and z, lower case OR upper case

`[+]`	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string



In [None]:
text = "The rain in Spain"
pattern = r"[arn]"
matches = re.findall(pattern, text)
print(matches)

['r', 'a', 'n', 'n', 'a', 'n']


In [None]:
text = "The rain in Spain"
pattern = r"[a-n]"
matches = re.findall(pattern, text)
print(matches)

['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']


In [None]:
text = "The rain in Spain"
pattern = r"[^arn]"
matches = re.findall(pattern, text)
print(matches)

['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']


In [None]:
text = "There are 2 apples and 3 oranges."
pattern = r"[0123]"
matches = re.findall(pattern, text)
print(matches)

['2', '3']


In [None]:
text = "My phone number is 1234567890."
pattern = r"[0-9]"
matches = re.findall(pattern, text)
print(matches)

['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


In [None]:
# [0-5][0-9]
# Description: Returns a match for any two-digit numbers from 00 to 59.
text = "The times are 12:30 and 45:15."
pattern = r"[0-5][0-9]"
matches = re.findall(pattern, text)
print(matches)

['12', '30', '45', '15']


In [None]:
# [a-zA-Z]
# Description: Returns a match for any character alphabetically between a and z (lowercase or uppercase).

text = "Hello World!"
pattern = r"[a-zA-Z]"
matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']


In [None]:
# [+]
# Description: In sets, +, *, ., |, (), $, {} has no special meaning, so [+] means return a match for any + character in the string.

text = "This is a test + string with + signs."
pattern = r"[+]"
matches = re.findall(pattern, text)
print(matches)


['+', '+']


In [None]:
text = "This is a test > string."
pattern = r"\>"
matches = re.findall(pattern, text)
print(matches)  # Output: ['>']

['>']


In [None]:
text = "aaaab"
pattern = r"a*"
matches = re.findall(pattern, text)
print(matches)

['aaaa', '', '']


In [None]:
text = "ab ac ad"
pattern = r"a?"
matches = re.findall(pattern, text)
print(matches)

['a', '', '', 'a', '', '', 'a', '', '']


In [None]:
text = "aaaab"
pattern = r"a*?"
matches = re.findall(pattern, text)
print(matches)

['', 'a', '', 'a', '', 'a', '', 'a', '', '']


In [None]:
text = "aaaab"
pattern = r"a+?"
matches = re.findall(pattern, text)
print(matches)

['a', 'a', 'a', 'a']


In [None]:
text = "ab ac ad"
pattern = r"a??"
matches = re.findall(pattern, text)
print(matches)

['', 'a', '', '', '', 'a', '', '', '', 'a', '', '']


In [None]:
text = "aaabbbccc"
pattern = r"a{3}"
matches = re.findall(pattern, text)
print(matches)

['aaa']


In [None]:
text = "aaabbbccc"
pattern = r"a{3,}"
matches = re.findall(pattern, text)
print(matches)

['aaa']


In [None]:
text = "aaaabbbccc"
pattern = r"a{3,}"
matches = re.findall(pattern, text)
print(matches)

['aaaa']


In [None]:
text = "aaabbbccc"
pattern = r"a{3,5}"
matches = re.findall(pattern, text)
print(matches)

['aaa']


In [None]:
text = "aaabbbccc"
pattern = r"a{3,5}?"
matches = re.findall(pattern, text)
print(matches)

['aaa']


1. The pattern **r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"** is designed to match most **common email address formats.**
2. However, it doesn't account for some special characters or numbers at the beginning of the local part (before the @).


In [None]:
text = "Please contact us at support@example.com or sales@example.org."
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
matches = re.findall(pattern, text)
print(matches)

['support@example.com', 'sales@example.org']


In [None]:
str = "The rain in Spain"
x = re.findall("ai", str)
print(x)

['ai', 'ai']


In [None]:
str = "The rain in Spain"
x = re.findall("Portugal", str)
print(x)

[]


In [None]:
str = "The rain in Spain"
x = re.search("\s", str)

print("The first white-space character is located in position:", x.start())

# In the string "The rain in Spain", the first whitespace character occurs between "The" and "rain". Its position (index) is 3 (since indexing starts at 0).

The first white-space character is located in position: 3


In [None]:
str = "The rain in Spain"
x = re.search("Portugal", str)
print(x)

None


In [None]:
str = "The rain in Spain"
x = re.split("\s", str)
print(x)

['The', 'rain', 'in', 'Spain']


In [None]:
str = "The rain in Spain"
x = re.split("\s", str, 1)
print(x)

['The', 'rain in Spain']


In [None]:
str = "The rain in Spain"
x = re.sub("\s", "9", str)
print(x)

# Replace every white-space character with the number 9

The9rain9in9Spain


In [None]:
str = "The rain in Spain"
x = re.sub("\s", "9", str, 2)
print(x)

The9rain9in Spain


In [None]:
str = "The rain in Spain"
x = re.search("ai", str)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has **properties and methods** used to retrieve information about the search, and the result:

**Method 1:.span()**	returns a tuple containing the start-, and end positions of the match.

**Method 2:.string**	returns the string passed into the function

**Method 3:.group()**	returns the part of the string where there was a match

In [None]:
# 1.  The re.search(r"\bS\w+", str) will look for a word that starts with "S" and is followed by one or more word characters.
# 2.  x.span() provides the start and end indices of this match.

str = "The rain in Spain"
x = re.search(r"\bS\w+", str)
print(x.span())

(12, 17)


In [None]:
str = "The rain in Spain"
x = re.search(r"\bS\w+", str)
print(x.string)

The rain in Spain


In [None]:
str = "The rain in Spain"
x = re.search(r"\bS\w+", str)
print(x.group())


# 1. The pattern finds "Spain" because it starts with S (capitalized) and is followed by other word characters.
# 2. The \b ensures that it's matching a whole word starting with S, not just part of another word.
# 3. The \w+ allows it to match the rest of the word after the initial S.



Spain


In [None]:
str = "The rain in spain"
x = re.search(r"\bS\w+", str)
print(x.string)


# AttributeError: 'NoneType' object has no attribute 'string'

AttributeError: 'NoneType' object has no attribute 'string'

In [None]:
str = "The rain in spain"
x = re.search(r"\bs\w+", str, re.IGNORECASE)
print(x.group())  # This will print "spain"

spain


### differences between r"^" and r"[^]"

In [None]:
str= [
    "arnold",
    "arthritis",
    "arnica",
    "barnacle",
    "paranoid",
    "arnival"
]

pattern = r"^arn"

for s in str:
    if re.match(pattern, s):
        print(f"'{s}' starts with 'arn'")
    else:
        print(f"'{s}' does not start with 'arn'")

'arnold' starts with 'arn'
'arthritis' does not start with 'arn'
'arnica' starts with 'arn'
'barnacle' does not start with 'arn'
'paranoid' does not start with 'arn'
'arnival' starts with 'arn'


In [None]:
text = "The rain in Spain stays mainly in the plain."
pattern = r"[^arn]"
matches = re.findall(pattern, text)

print("Characters not 'a', 'r', or 'n':", matches)

Characters not 'a', 'r', or 'n': ['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i', ' ', 's', 't', 'y', 's', ' ', 'm', 'i', 'l', 'y', ' ', 'i', ' ', 't', 'h', 'e', ' ', 'p', 'l', 'i', '.']


### Regular Expressions in Pandas DataFrame


In [None]:
import pandas as pd

### Example 1: Extracting Phone Numbers

In [None]:
df = pd.DataFrame({'phone': ['(123) 456-7890', '(456) 789-0123', '(789) 012-3456']})
phone_column = df['phone']

area_code = phone_column.str.extract(r'\((\d{3})\)')
last_four_digits = phone_column.str.extract(r'-(\d{4})')

df['area_code'] = area_code
df['last_four_digits'] = last_four_digits

print(df)


            phone area_code last_four_digits
0  (123) 456-7890       123             7890
1  (456) 789-0123       456             0123
2  (789) 012-3456       789             3456


1. area_code = phone_column.str.extract(r'\((\d{3})\)')

* This line extracts the area code from each phone number:
  * phone_column.str.extract(): This method applies a regex extraction to each element in the series.
  * r'\((\d{3})\)': This is the regex pattern:
  * \(: Matches a literal opening parenthesis
  * (\d{3}): Captures exactly 3 digits
  * \) : Matches a literal closing parenthesis

The parentheses around \d{3} create a capturing group, which means the extracted result will be just the 3-digit area code, not including the parentheses.

2. last_four_digits = phone_column.str.extract(r'-(\d{4})')

* This line extracts the last four digits of each phone number:

  * Again, phone_column.str.extract() is used to apply the regex to each element.
  * r'-(\d{4})': This regex pattern:
  * -: Matches a literal hyphen
  * (\d{4}): Captures exactly 4 digits

The parentheses around \d{4} create a capturing group, extracting just the last four digits, not including the hyphen.

### Example 2: Cleaning Text Data

In [None]:
# df = pd.DataFrame({'text': ['hello   world!', '   how are you? ', '   $%!# 123']})
# text_column = df['text']

# df

In [None]:
df = pd.DataFrame({'text': ['hello    world!', '  how are you? ', '   $%!# 123']})
text_column = df['text']

clean_text = text_column.str.replace(r'[^a-zA-Z0-9 ]+', '', regex=True).str.replace(r'\s+', ' ', regex=True).str.strip()

df['clean_text'] = clean_text

print(df)

              text   clean_text
0  hello    world!  hello world
1    how are you?   how are you
2         $%!# 123          123


**a. First replacement:**
1. **.str.replace(r'[^a-zA-Z0-9 ]+', '')** :
  * This regex pattern matches any character that is not a letter, number, or space.
  * It replaces all such characters with an empty string, effectively removing them.
  * For example, '$%!#' is removed.

**b. Second replacement:**
2. **.str.replace(r'\s+', ' ')** :
  * This matches one or more whitespace characters.
  * It replaces all consecutive whitespace characters with a single space.
  * For example, ' ' becomes ' '.

**c. Stripping whitespace:**
4. **.str.strip()** :
  * This removes leading and trailing whitespace from each string

### Example 3: Filtering Rows Based on Pattern

In [None]:
# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Email': ['alice@example.com', 'bob@gmail.com', 'charlie@example.com', 'david@example.com']}
df = pd.DataFrame(data)

# Applying regex to filter rows
pattern = r'@example\.com$'
filtered_df = df[df['Email'].str.contains(pattern)]

print(filtered_df)

      Name                Email
0    Alice    alice@example.com
2  Charlie  charlie@example.com
3    David    david@example.com


**pattern = r'@example\.com$'** :
* The `backslash before .com` is crucial for creating a precise match for the exact string ".com" at the end of the email address. It prevents the dot from acting as a wildcard character, ensuring that only emails ending with ".com" are matched.

In [None]:
# https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/#google_vignette

# https://freedium.cfd/https://towardsdatascience.com/regular-expressions-regex-with-examples-in-python-and-pandas-461228335670

# https://stackoverflow.com/search?q=regular+expression

In [None]:
text = 'There was further decline of the UHC'

# Find all occurrences of "the"
# The pattern \bthe\b uses word boundaries (\b) to ensure we match whole words only.
matches = re.finditer(r"\bthe\b", text, flags=re.IGNORECASE)

# Print the position of each match
for match in matches:
    print(f"'the' found at position {match.start()}")

'the' found at position 29


In [None]:
text = 'There was further decline of the UHC'
substring = "the"

index = text.lower().find(substring.lower())
while index != -1:
    print(f"The substring '{substring}' was found at position {index}.")
    index = text.lower().find(substring.lower(), index + 1)
if index == -1:
    print(f"No more occurrences of '{substring}' found.")

The substring 'the' was found at position 0.
The substring 'the' was found at position 13.
The substring 'the' was found at position 29.
No more occurrences of 'the' found.


In [None]:
re.findall("the", text, flags=re.I)

['The', 'the', 'the']

In [None]:
re.search("the", text)

<re.Match object; span=(13, 16), match='the'>

In [None]:
text = 'There was further decline of the UHC'

match_obj = re.search("the", text)
#index span of matched string
print(match_obj.span())

(13, 16)


In [None]:
#the matched string
print(match_obj.group())

the


In [None]:
#start position of match
print(match_obj.start())

13


In [None]:
#end position of match
print(match_obj.end())

16


In [None]:
re.match('the', text)

We can also use an if-else statement that prints a custom message if a pattern is present or not.

In [None]:
text = 'The focus is on 2022'

is_match = re.match('the', text, re.I)

if is_match:
    print(f"'{is_match.group()}'appears at '{is_match.span()}'")

else: #None
    print(is_match)


'The'appears at '(0, 3)'


In [None]:
print(f"'{is_match.group().lower()}' appears at '{is_match.span()}'")


'the' appears at '(0, 3)'


**re.finditer(pattern, text)** — This returns an iterator of match objects that we then wrap with a list to display them.

If you wanted to find all occurrences regardless of position, you'd use finditer() or modify the regex pattern.

In [None]:
text = 'There was further decline of the UHC'
match = re.finditer('the', text,
                    flags=re.I)
list(match)

[<re.Match object; span=(0, 3), match='The'>,
 <re.Match object; span=(13, 16), match='the'>,
 <re.Match object; span=(29, 32), match='the'>]

**re.sub(pattern, repl, text)** — this replaces the matched substring(s) with the 'repl' string.

In [None]:
text = 'my pin is 4444'
re.sub('4', '*', text)

'my pin is ****'

**re.split(pattern, text)** — This splits the text at the position of the match(es), into elements in a list.

In [None]:
text = "wow! nice! love it! bye! "
re.split("!", text)

['wow', ' nice', ' love it', ' bye', ' ']

### Regex metacharacters

In this section, we'll explore the different metacharacters, and use **re.findall()** to check for the presence of a pattern in a string and return all the matched substrings.

**Escaping regex characters with a backslash:**
* When you want to **exactly search for any of the below regex symbols** in a text, **you have to escape them with a backslash** (while also using the r raw string) so that they lose their special regex meaning too.

**1. `.` (The dot character, or wildcard)** —this matches and returns any character in the string, except a new line.

This could be a digit, white space, letter, or punctuation.

In [None]:
pattern = r'.'
re.findall(pattern,
        "Wow! We're now_25")

['W',
 'o',
 'w',
 '!',
 ' ',
 'W',
 'e',
 "'",
 'r',
 'e',
 ' ',
 'n',
 'o',
 'w',
 '_',
 '2',
 '5']

**2. `\w (lowercase w)`** — Any alphanumeric character (letter, digit, or underscore).

In [None]:
pattern = r'\w'
re.findall(pattern,
        "Wow! We're now_25")

['W', 'o', 'w', 'W', 'e', 'r', 'e', 'n', 'o', 'w', '_', '2', '5']

**3. `\W (uppercase w)`** — anything that is not \w such as spaces, and special characters.

In [None]:
pattern = r'\W'
re.findall(pattern,
        "Wow! We're now_25")

['!', ' ', "'", ' ']

**4.`\d‍`** — any digit, 0 to 9.

In [None]:
pattern = r'\d'
re.findall(pattern,
        "Wow! We're now_25")

['2', '5']

**5. `\D`** — Any non-digit. Negates \d.

In [None]:
pattern = r'\D'
re.findall(pattern,
        "Wow! now_25")

['W', 'o', 'w', '!', ' ', 'n', 'o', 'w', '_']

**6. `\s (lowercase s)`** — A white space.

In [None]:
pattern = r'\s'
re.findall(pattern,
        "Wow! We're now_25")

[' ', ' ']

In [None]:
# Define the pattern for whitespace
pattern = r'\s'

# Use re.finditer to get match objects
obj_iter_1 = re.finditer(pattern, "Wow! We're now_25")

# Print the span of each match
for match in obj_iter_1:
    print(match.span())

(4, 5)
(10, 11)


**7. `\S (uppercase s)`** — Negates \s. Returns anything that is not a white space.

In [None]:
pattern = r'\S'
re.findall(pattern,
        "Wow! Now_25")

['W', 'o', 'w', '!', 'N', 'o', 'w', '_', '2', '5']

### Character sets

**8.** `[]` matches any of the characters inside the square brackets.
* For example, **the pattern '[abc]' looks for either a or b or c in the text**,
* and can also be written as **'a|b|c'**.
* You can also define `a range inside the brackets using a dash`, instead of writing down every single character.
  * **For example, [a-fA-F]** matches any lowercase or uppercase letters from a to f. The code below returns any vowels.

In [None]:
pattern = r'[aeiou]'
re.findall(pattern,
        "Wow! We're now_25")

['o', 'e', 'e', 'o']

In [None]:
pattern = r'[aeiou]'
obj_iter_2 = re.finditer(pattern,
        "Wow! We're now_25")

# Print the span of each match
for match in obj_iter_2:
    print(match.span())

(1, 2)
(6, 7)
(9, 10)
(12, 13)


**9.** `[^]` Having a hat ^ character right after the opening square bracket negates the character set.
* It returns the **opposite of the characters or ranges inside the square brackets.**
  * The code below returns everything except the letters m to z.

In [None]:
#Any char except letters m to z
pattern = r'[^m-zM-Z]'
re.findall(pattern,
        "Wow! We're now_25")

['!', ' ', 'e', "'", 'e', ' ', '_', '2', '5']

### Repetition regex patterns Also called quantifiers
quantifiers, these special characters are written **right after a pattern or character** to tell the regex engine how many times to match it.


**10.** `+(once or more)` — Matches if the previous pattern appears one or more times.
  * The code below matches the character 'o' that is preceded by 'hell'.

In [None]:
#match o in hello once or many times
text = 'hell helo  hhhelllooo hello ago helloo hellooo'
pattern = r'hello+'
re.findall(pattern, text)

['hello', 'helloo', 'hellooo']

**11.** `* (zero or more)`—Matches if the previous pattern appears zero or many times.

In [None]:
#match o in hello zero or many times
text = 'hel hell hheelloo hello ago helloo hellooo hellooou bellooo'
pattern = r'hello*'
re.findall(pattern, text)

['hell', 'hello', 'helloo', 'hellooo', 'hellooo']

**12.** `? (zero or once)`— Matches if the previous pattern appears zero or one time.

In [None]:
#match o in hello zero times or once
text = 'hell helo hello ago helloo hellooo'
pattern = r'hello?'
re.findall(pattern, text)

# hell': Matches "hell" (zero "o"s)
# 'hello': Matches "hello" (one "o")
# 'hello': Matches "hello" again (one "o")
# 'hello': Matches "hello" in "hellooo" (only one "o")
# 'ago': Doesn't match because it doesn't start with "hell".
# 'helloo' and 'hellooo': Only the first "o" is matched, leaving "oo" unmatched.


['hell', 'hello', 'hello', 'hello']

In [None]:
text = "Helloooo World!"
pattern_lazy = r'o+?'  # Lazy match
pattern_greedy = r'o+'  # Greedy match

print(re.findall(pattern_lazy, text))
print(re.findall(pattern_greedy, text))

['o', 'o', 'o', 'o', 'o']
['oooo', 'o']


In [None]:
text = "color colour"
pattern = r'colou?r'

print(re.findall(pattern, text)) # u? makes the 'u' optional, matching both "color" and "colour".

['color', 'colour']


**13.** `{n}` —Defines the exact number of times to match the previous character or pattern.
* e.g **'d{3}' matches 'ddd'**.

In [None]:
#Extract years
text = '7.6% in 2020 now 2022/23 budget'
pattern = r'\d{4}'
re.findall(pattern, text)

['2020', '2022']

In [None]:
#Extract years
text = '7.6% in 2020 now 2022/23 budget'
pattern = r'\d{3}'
re.findall(pattern, text)

['202', '202']

In [None]:
#Extract years
text = '7.6% in 2020 now 2022/23 budget'
pattern = r'\d{2}'
re.findall(pattern, text)

['20', '20', '20', '22', '23']

**14.** `{min,max}` — Defines the minimum (min) and maximum (max) times to match the previous pattern.
  * e.g. 'd{2,4}' matches 'dd', 'ddd' and 'dddd'.

In [None]:
#Dot followed by 2 to 5 word chars
text = 'may@gmail.com cal@web.me ian@me.biz'
pattern = r'\.\w{2,5}'
re.findall(pattern, text)

['.com', '.me', '.biz']

In [None]:
#Dot followed by 2 to 5 word chars
text = 'may@gmail.com cal@web.me ian@me.biz'
pattern = r'\@\w{2,5}'
re.findall(pattern, text)

['@gmail', '@web', '@me']

In [None]:
#Dot followed by 2 to 5 word chars
text = 'may@gmail.com cal@web.me ian@me.biz'
pattern = r'\w{2,5}\@'
re.findall(pattern, text)

['may@', 'cal@', 'ian@']

**15.** `{min, }` — matches the previous element at least 'min' times.

In [None]:
#Long words
text = 'universal healthcare is low'
pattern = r'\w{5,}'
re.findall(pattern, text)

['universal', 'healthcare']

Greedy quantifiers — All the above quantifiers are said to be greedy, in that they attempt to take up as many characters as possible for every match, resulting in the longest match as long as the pattern is satisfied.

**For example**, re.findall('b+', 'bbbb') returns one match ['bbbb'], which is the longest possible match, even though ['b', 'b', 'b', 'b'] is still a valid match but with shorter matches.

Non-greedy (lazy) — You can make a quantifier non-greedy by adding a question mark (?) after the quantifier. This means that the regex engine will return the least characters per match. The image below shows a comparison of the quantifiers' behaviors in greedy vs non-greedy modes.

### Boundary/ anchors

**16.** `^` — matches only the start of a text, and therefore ^ is written as the first character in the pattern.

### Note that `^` is different from `[^..]` which negates the pattern enclosed in square brackets.

In [None]:
#Starts with two digits
text = '512,000 units'
pattern = r'^\d\d\d'
re.findall(pattern, text)

['512']

**17.** `$` — matches the end of the string and is therefore written at the end of a pattern.

In [None]:
#Ends with two digits
text = '500,000 units'
pattern = r'\d\d$'
re.findall(pattern, text)

[]

**18.** `\b (word boundary)` — Matches the boundary right before or after a word, or the empty string between a \w and a \W.

In [None]:
pattern = r'\b'
re.findall(pattern,"Wow! We're now_25")

['', '', '', '', '', '', '', '']

In [None]:
pattern = r'\b|(?=\W)|(?<=\W)'
matches = re.findall(pattern, text)
print(matches)

['', '', '', '', '', '']


To see the boundaries, use the **re.sub()** function to replace \b with the `~` symbol.

In [None]:
pattern = r'\b'
re.sub(pattern, '~', "Wow! We're now_25")

"~Wow~! ~We~'~re~ ~now_25~"

1. The pattern r'\btest\b' uses word boundaries (\b) before and after "test".
2. This ensures that we only match "test" when it's surrounded by word boundaries.

In [None]:
text = "I'm taking a test tomorrow. But testing is fun too. Test"
pattern = r'\btest\b'

matches = re.findall(pattern, text)
print(matches)

['test']


In [None]:
text = "I'm taking a test tomorrow. But testing is fun too. Test"
pattern = r'\bt(est|esting)?\b'  # Matches "test" and "testing"

matches = re.findall(pattern, text)
print(matches)


['est', 'esting']


In [None]:
text = "I'm taking a test tomorrow. But testing is fun too. Testo"
pattern = r'\bt(est|esting|esto)\b'  # Matches "test" and "testing"

matches = re.findall(pattern, text, flags=re.IGNORECASE)
print(matches)


['est', 'esting', 'esto']


### Groups


**19.** `()` — When you write a regex pattern, you can define groups using parentheses.

This is useful for extracting and returning details from a string.

Note that the parentheses do not change the results of a pattern, rather they group it into sections that you can retrieve separately.

In [None]:
text = 'Yvonne worked for von'
pattern = r'(.o.)'
re.findall(pattern, text)

['von', 'wor', 'for', 'von']

In [None]:
text = 'Yvonne worked for von'
pattern = r'(.o.).'
matches = re.findall(pattern, text)
print(matches)

['von', 'wor', 'for']


In [None]:
text = 'Yvonne worked for von'
pattern = r'\b(.o.)\b'
matches = re.findall(pattern, text)
print(matches)

['for', 'von']


In [None]:
text = 'this is @sue email sue@gmail.com'
pattern = r'(\w+)@(\w+)\.(\w+)\b'
m = re.search(pattern, text)
#match object
print(m)

<re.Match object; span=(19, 32), match='sue@gmail.com'>


In [None]:
#full match
m.group(0)

'sue@gmail.com'

In [None]:
m.group(1)

'sue'

In [None]:
m.group(2)

'gmail'

In [None]:
m.group(3)

'com'

In [None]:
text = 'hello, we need 22 books'
pattern = r'(\w)\1'
list(re.finditer(pattern, text))

[<re.Match object; span=(2, 4), match='ll'>,
 <re.Match object; span=(11, 13), match='ee'>,
 <re.Match object; span=(15, 17), match='22'>,
 <re.Match object; span=(19, 21), match='oo'>]

Naming and accessing captured groups using `?P<name>` and `?P=name` respectively

You can assign a name to a group to access it later.

This is better than the \grp_number notation when you have many groups, and it increases the readability of your regex.

To access matched groups, use m.group('name').

In [None]:
text = '08 Dec'
pattern = '(?P<day>\d{2})\s(?P<month>\w{3})'
m = re.search(pattern, text)
m.group('day'), m.group('month')


('08', 'Dec')

### Non-capturing groups

**20.** `?:` — matches but doesn't capture the group.

Include `?:` in the group you wish to omit.

The code below matches numbers with percentage signs and returns the numbers only.

In [None]:
text = 'date 23 total 42% date 17 total 35%'
pattern = r'(\d+)(?:%)'
re.findall(pattern, text)

['42', '35']

In [None]:
text = "Hello world"
pattern_capturing = r'(world)'
pattern_non_capturing = r'(?:world)'

match_capture = re.search(pattern_capturing, text)
match_non_capture = re.search(pattern_non_capturing, text)

print(match_capture.group(1))  # Outputs: world
try:
    print(match_non_capture.group(1))  # Raises IndexError
except IndexError:
    print("No capturing group")

world
No capturing group


**21.** `| (or)` — This returns all matches of either one pattern or another.

In [None]:
text = 'the sunny sun shines'
re.findall(r'sun|shine', text)

['sun', 'sun', 'shine']

### Pandas and regular expressions

Pandas contains several functions that support pattern-matching with regex, just as we saw with the re library. Below are three major functions we'll use in this tutorial. Read about other Pandas regex functions here.

  * `Series.str.contains(pattern)` — This function **checks for a pattern in a column (Series)** to return True and False values (a mask) where the pattern matches. The mask can then be applied to the entire dataframe to only return True rows.
  * `Series.str.extract(pattern, expand, flags)` — To use this function, we must define groups using parentheses inside the pattern. The function extracts the matches and returns the groups as columns in a dataframe. When you have only one group in the pattern, use expand=False to return a series instead of a dataframe object.
  * `Series.str.replace(pattern, repl, flag)` — Similar to re.sub(), this function replaces matches with the repl string.



  In this section, we'll tackle seven regular expression tasks to perform the following actions;

  1. filter data to return rows that match certain criteria.
  2. extract substrings and other details from a column.
  3. replace values in a column.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/refs/heads/master/titanic.csv')

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


1. **Filtering a dataframe — s.str.contains(pattern)**

Task 1: Filter the dataframe to return rows where the ticket numbers had C and A.

Below is a list of all the 'C A' ticket variations that must all be matched.

[**'CA.', 'CA', 'C.A.'**, 'C', 'C.A./SOTON']

The regex pattern:
1. starts with capital C,
2. followed by an optional dot,
3. then capital A,
4. followed by an optional dot.
5. We have escaped the dot symbol to exactly match a period, not the wildcard regex character.

In [None]:
# This pattern breaks down as follows:

# C: Matches the literal character 'C'.
# \.: Matches a period (.) literally.
# ?: Makes the preceding element optional (in this case, the period).
# A: Matches the literal character 'A'.
# Another \. and ?, repeating the optional period.

pattern = r'C\.?A\.?'
mask = df['Ticket'].str.contains(pattern)
df[mask].head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
56,57,1,2,"Rugg, Miss. Emily",female,21.0,0,0,C.A. 31026,10.5,,S
58,59,1,2,"West, Miss. Constance Mirium",female,5.0,1,2,C.A. 34651,27.75,,S
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S
66,67,1,2,"Nye, Mrs. (Elizabeth Ramell)",female,29.0,0,0,C.A. 29395,10.5,F33,S
70,71,0,2,"Jenkin, Mr. Stephen Curnow",male,32.0,0,0,C.A. 33111,10.5,,S
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S
93,94,0,3,"Dean, Mr. Bertram Frank",male,26.0,1,2,C.A. 2315,20.575,,S
134,135,0,2,"Sobey, Mr. Samuel James Hayden",male,25.0,0,0,C.A. 29178,13.0,,S
145,146,0,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S


This pattern will match strings that contain:
* Exactly "CA" (no periods)
* "C.A." (with one period)
* "C..A" (with two periods)
* Any number of periods between C and A


**"preceding element"**:

The term `"preceding element" in regular expressions` refers to the character or pattern that **comes immediately before a quantifier (such as *, +, ?, or {n,m}).**

To get the **total number of rows with CA**, use `mask.sum()` which sums up all the True values which appear as 1 and False as 0.

In [None]:
mask.sum()

42

### 2. Extracting data — s.str.extract(patt)

Task 2: Extract all unique titles such as Mr, Miss, and Mrs from passenger names.

Regex pattern:
1. this will search for a white space,
2. followed by a sequence of letters (enclosed in parentheses),
3. then a dot character.
3. We use parentheses to group the substring we want to capture and return.
6. expand=False returns a Series object enabling us to call value_counts() on it.

In [None]:
# Wheadon, Mr. Edward H
# West, Miss. Constance Mirium

pattern = '\s(\w+)\.'
all_ts = df['Name'].str.extract(
                    pattern,
                    expand=False)
unique_ts = all_ts.value_counts()
unique_ts

Unnamed: 0_level_0,count
Name,Unnamed: 1_level_1
Mr,517
Miss,182
Mrs,125
Master,40
Dr,7
Rev,6
Mlle,2
Major,2
Col,2
Countess,1


Task 3a: From the 'Name' column, extract the titles, first names, and last names, and return them as columns in a new dataframe.

Regex pattern;
1. Each name contains a 'sequence of one or many word characters' (last name),
2. then a comma,
3. a white space,
4. another sequence of characters(title),
5. a period(.),
6. a space,
7. another sequence of characters (first name),
8. then zero to many other characters.

In [None]:
# Wheadon, Mr. Edward H
# West, Miss. Constance Mirium

pattern = r'(\w+), (\w+\.) (\w+).*'
df_names = df['Name'].str.extract(
                            pattern,
                            flags=re.I)
df_names

Unnamed: 0,0,1,2
0,Braund,Mr.,Owen
1,Cumings,Mrs.,John
2,Heikkinen,Miss.,Laina
3,Futrelle,Mrs.,Jacques
4,Allen,Mr.,William
...,...,...,...
886,Montvila,Rev.,Juozas
887,Graham,Miss.,Margaret
888,Johnston,Miss.,Catherine
889,Behr,Mr.,Karl


Task 3b: Clean up the dataframe above with named and ordered columns

As mentioned earlier, named groups are useful for capturing and accessing groups. With Pandas, this is especially convenient, as these names will now be the columns-names in our new dataframe.

The regex below is the same as above, except that:
1. the groups are named using `(?P<name>)`.
2. The code extracts the three named columns,
3. then we use df.reindex() to reorder them.

In [None]:
pattern = r'(?P<lastname>\w+), (?P<title>\w+\.) (?P<firstname>\w+).*'

df_named = df['Name'].str.extract(
                            pattern,
                            flags=re.I)
df_clean = df_named.reindex(columns = ['title','firstname',
                             'lastname'])
df_clean.head()

Unnamed: 0,title,firstname,lastname
0,Mr.,Owen,Braund
1,Mrs.,John,Cumings
2,Miss.,Laina,Heikkinen
3,Mrs.,Jacques,Futrelle
4,Mr.,William,Allen


### 4. Replacing values in a column — s.str.replace(pattern, repl)

Task 4a: Replace all the titles with capital letters.

1. The regex pattern searches for a white space,
2. then one or many word characters (enclosed in parentheses),
3. then a period(.).
4. We then replace the captured group with its capitalized version.

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
pattern = r'\s(\w+)\. ' # The pattern specifically targets the title by looking for a period followed by whitespace.
df['Name'].str.replace(pattern,
            lambda m:m.group().upper(), regex=True)

Unnamed: 0,Name
0,"Braund, MR. Owen Harris"
1,"Cumings, MRS. John Bradley (Florence Briggs Th..."
2,"Heikkinen, MISS. Laina"
3,"Futrelle, MRS. Jacques Heath (Lily May Peel)"
4,"Allen, MR. William Henry"
...,...
886,"Montvila, REV. Juozas"
887,"Graham, MISS. Margaret Edith"
888,"Johnston, MISS. Catherine Helen ""Carrie"""
889,"Behr, MR. Karl Howell"


Task 4b: Capitalize only Mr. and Mrs. titles. In this case, we use | inside parentheses.

In [None]:
pattern = r'\s(Mr|Mrs)\. '
df['Name'].str.replace(pattern,
            lambda m:m.group().upper(), regex=True)

Unnamed: 0,Name
0,"Braund, MR. Owen Harris"
1,"Cumings, MRS. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, MRS. Jacques Heath (Lily May Peel)"
4,"Allen, MR. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, MR. Karl Howell"


Task 5: Clean the dates in the column below by inserting dashes to show the day, month, and year.

We want to conserve the other words in the column, therefore cannot directly call pd.to_datetime().

The pattern:
1. Search for two digits,
2. then two digits again,
3. then four digits,
4. and use parentheses to capture them in three groups.
5. The lambda function means that for every match,
6. join the groups with a dash after the first and the second group.

In [27]:
data_list = ['01012023', '15032024', '25092025', '31062026', '07082027']

df = pd.DataFrame(data_list, columns=['date'])
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,date
0,1012023
1,15032024
2,25092025
3,31062026
4,7082027


In [30]:
pattern = r'(\d{2})(\d{2})(\d{4})'
df['date'].str.replace(pattern,
        lambda m: m.group(1)+'-'+
                   m.group(2)+'-'+
                   m.group(3), regex=True)

Unnamed: 0,date
0,01-01-2023
1,15-03-2024
2,25-09-2025
3,31-06-2026
4,07-08-2027



* m.group(1) refers to the first group (first 2 digits)
* m.group(2) refers to the second group (second 2 digits)
* m.group(3) refers to the third group (last 4 digits)

### Replacement Process
* The lambda function takes the entire matched string (m) as input.
* It extracts the captured groups using group() method.
* It concatenates these groups with hyphens between them.

### Example

Let's say we have a date like "01012023":
1. m.group(1) would return "01"
2. m.group(2) would return "02"
3. m.group(3) would return "23"

The final result would be "01-02-23".