In [None]:
Regex :
 >Regular expressions (regex) are incredibly useful for data cleaning and text processing in Python. 
 >Here's how to use them with Python's main libraries for data handling:

In [None]:
Tips

In [None]:
Raw strings: Always use raw strings (r'pattern') for regex to avoid escaping issues

Test patterns: Use re.findall() to test patterns before applying to DataFrames

Vectorized operations: Pandas string methods are faster than applying Python's re functions row by row

Common patterns:

\d - digits

\w - word characters

\s - whitespace

^ - start of string

$ - end of string

Flags (for case-insensitive, multiline, etc.):

In [None]:
re.findall(r'pattern', text, flags=re.IGNORECASE)

In [None]:
Core Python Regex Module

In [None]:
First, you need to import Python's built-in re module:

In [14]:
import re

In [None]:
Basic Regex Functions

In [None]:
1.Search - Find first match

In [15]:
match = re.search(r'\d+', 'Order 12345')
if match:
    print(f"Found: {match.group()}")  # Found: 12345

Found: 12345


In [None]:
2.Find All - Get all matches

In [16]:
numbers = re.findall(r'\d+', 'Orders: 123, 456, 789')
# ['123', '456', '789']

In [None]:
3.Substitute - Replace matches

In [17]:
clean_text = re.sub(r'\s+', ' ', 'Too   many   spaces')
# 'Too many spaces'

Pandas for DataFrames

In [None]:
Pandas has excellent regex support through string methods:

In [18]:
import pandas as pd

df = pd.DataFrame({
    'text': ['abc123', 'def456', 'ghi789'],
    'emails': ['a@b.com', 'invalid', 'x@y.org']
})

In [None]:
Common Pandas Regex Operations

In [None]:
1.Extract patterns

In [19]:
df['numbers'] = df['text'].str.extract(r'(\d+)')

In [None]:
2.Find rows matching pattern

In [20]:
valid_emails = df[df['emails'].str.contains(r'^\w+@\w+\.\w+$')]

In [None]:
3.Replace using regex

In [21]:
df['clean_text'] = df['text'].str.replace(r'\d', 'X')
# abcXXX, defXXX, ghiXXX

In [None]:
4.Split strings

In [22]:
df['parts'] = df['text'].str.split(r'\d+')
# ['abc', ''], ['def', ''], ['ghi', '']

In [None]:
Practical Data Cleaning Examples

In [None]:
1.Clean phone numbers

In [None]:
df['phone'] = df['phone'].str.replace(r'[^\d]', '')

In [None]:
Explanation of df['phone'] = df['phone'].str.replace(r'[^\d]', '')
This line of code cleans phone numbers in a Pandas DataFrame column by removing all non-digit characters (like -, (, ), , etc.).

Breakdown:
df['phone']

Selects the phone column from DataFrame df.

.str.replace()

Pandas string method to replace substrings using regex.

r'[^\d]'

\d matches any digit (0-9).

[^...] is a negated character class (matches anything not in the brackets).

So, [^\d] matches any character that is not a digit.

''

Replaces matched non-digit characters with an empty string (i.e., removes them).

Example Input → Output:
"(123) 456-7890" → "1234567890"

"+1 (800) 555-1234" → "18005551234"

In [None]:
df['mixed_column'] = df['mixed_column'].str.replace(r'[^\d]', '')

In [None]:
Example Input → Output:
"Order #12345" → "12345"

"Price: $99.99" → "9999" (Note: Removes . too)

In [None]:
2.Extract hashtags from text

In [None]:
df['hashtags'] = df['tweet'].str.findall(r'#\w+')

In [None]:
Explanation of df['hashtags'] = df['tweet'].str.findall(r'#\w+')
This line extracts all hashtags (e.g., #Python, #DataScience) from a DataFrame column (tweet) and stores them as a list of strings in a new column (hashtags).

Breakdown:
df['tweet']

Selects the tweet column containing text (e.g., tweets or social media posts).

.str.findall()

Pandas string method to find all occurrences of a regex pattern in each row.

r'#\w+'

#: Matches the literal hashtag symbol #.

\w+: Matches word characters (letters, numbers, underscores) after #.

+ means "1 or more occurrences".

Example matches: #AI, #100DaysOfCode, #hello_world.

Result

Each row in df['hashtags'] becomes a list of hashtags (or an empty list if none are found).

Example Input → Output:
tweet	                            hashtags
"Learning #Python is fun! #Coding"	['#Python', '#Coding']
"No hashtags here"	                []
"#DataScience #AI #ML"	            ['#DataScience', '#AI', '#ML']


In [None]:
Another Example: Extracting Mentions (@username)

In [None]:
df['mentions'] = df['tweet'].str.findall(r'@\w+')

In [None]:
Example Input → Output:
tweet	                            mentions
"Hey @sara, check this out!"	    ['@sara']
"No mentions here"	                 []
"@team1 and @team2 collaborating"	['@team1', '@team2']


In [None]:
3.Validate email format

In [None]:
df['is_valid_email'] = df['email'].str.match(r'^\w+@\w+\.\w+$')

In [None]:
Explanation of df['is_valid_email'] = df['email'].str.match(r'^\w+@\w+\.\w+$')
This line checks if each email address in the email column follows a basic valid email pattern and stores True/False results in a new column is_valid_email.

Breakdown:
df['email']

Selects the column containing email addresses.

.str.match()

Pandas method that checks if the entire string matches a regex pattern (returns True/False).

Regex Pattern r'^\w+@\w+\.\w+$'

^ : Start of the string.

\w+ : 1+ word characters (letters, numbers, underscores) for the username.

@ : Literal @ symbol.

\w+ : 1+ word characters for the domain name (e.g., gmail).

\. : Literal . (escaped because . is a regex metacharacter).

\w+$ : 1+ word characters for the TLD (e.g., com) before the string ends ($).

Result

A Boolean column where True = valid email format, False = invalid.

Example Input → Output:
email	is_valid_email
user@example.com	True
invalid.email	False
missing@tld	False
name@sub.domain.org	False (fails for subdomains)
Limitations of This Pattern
The regex ^\w+@\w+\.\w+$ is simplistic and may reject valid emails like:

Subdomains (user@mail.example.com).

Dots in usernames (first.last@domain.com).

Special TLDs (user@domain.co.uk).

In [None]:
Improved Email Validation

In [None]:
df['is_valid_email'] = df['email'].str.match(
    r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
)

In [None]:
What’s Changed?
[a-zA-Z0-9._%+-]+ : Allows dots (.), percent (%), plus (+), and hyphens (-) in usernames.

[a-zA-Z0-9.-]+ : Allows dots/hyphens in domain names (e.g., sub.domain).

\.[a-zA-Z]{2,}$ : Requires a TLD with 2+ letters (e.g., .com, .io).

Improved Matches:
email	is_valid_email
first.last@domain.com	True
user+filter@sub.domain.co.uk	True
invalid@.com	False