In [None]:
Regex :
 >Regular expressions (regex) are incredibly useful for data cleaning and text processing in Python. 
 >Here's how to use them with Python's main libraries for data handling:

In [None]:
Core Python Regex Module

In [None]:
First, you need to import Python's built-in re module:

In [14]:
import re

In [None]:
Basic Regex Functions

In [None]:
1.Search - Find first match

In [15]:
match = re.search(r'\d+', 'Order 12345')
if match:
    print(f"Found: {match.group()}")  # Found: 12345

Found: 12345


In [None]:
2.Find All - Get all matches

In [16]:
numbers = re.findall(r'\d+', 'Orders: 123, 456, 789')
# ['123', '456', '789']

In [None]:
3.Substitute - Replace matches

In [17]:
clean_text = re.sub(r'\s+', ' ', 'Too   many   spaces')
# 'Too many spaces'

Pandas for DataFrames

In [None]:
Pandas has excellent regex support through string methods:

In [18]:
import pandas as pd

df = pd.DataFrame({
    'text': ['abc123', 'def456', 'ghi789'],
    'emails': ['a@b.com', 'invalid', 'x@y.org']
})

In [None]:
Common Pandas Regex Operations

In [None]:
1.Extract patterns

In [19]:
df['numbers'] = df['text'].str.extract(r'(\d+)')

In [None]:
2.Find rows matching pattern

In [20]:
valid_emails = df[df['emails'].str.contains(r'^\w+@\w+\.\w+$')]

In [None]:
3.Replace using regex

In [21]:
df['clean_text'] = df['text'].str.replace(r'\d', 'X')
# abcXXX, defXXX, ghiXXX

In [None]:
4.Split strings

In [22]:
df['parts'] = df['text'].str.split(r'\d+')
# ['abc', ''], ['def', ''], ['ghi', '']

In [None]:
Practical Data Cleaning Examples

In [None]:
1.Clean phone numbers

In [None]:
df['phone'] = df['phone'].str.replace(r'[^\d]', '')

In [None]:
2.Extract hashtags from text

In [None]:
df['hashtags'] = df['tweet'].str.findall(r'#\w+')

In [None]:
3.Validate email format

In [None]:
df['is_valid_email'] = df['email'].str.match(r'^\w+@\w+\.\w+$')

In [None]:
Tips

In [None]:
Raw strings: Always use raw strings (r'pattern') for regex to avoid escaping issues

Test patterns: Use re.findall() to test patterns before applying to DataFrames

Vectorized operations: Pandas string methods are faster than applying Python's re functions row by row

Common patterns:

\d - digits

\w - word characters

\s - whitespace

^ - start of string

$ - end of string

Flags (for case-insensitive, multiline, etc.):

In [None]:
re.findall(r'pattern', text, flags=re.IGNORECASE)