# String/Regex Cheatsheet

This notebook provides a comprehensive guide to string and regex operations in Python and Pandas, with practical examples and comparisons.

## Comparison Table: Python vs Pandas String Operations

| Operation | Python Syntax | Pandas Syntax | Key Differences |
|-----------|---------------|---------------|-----------------|
| **Case Conversion** | `text.lower()` | `df['col'].str.lower()` | Pandas works on entire Series; Python on single string |
| **String Length** | `len(text)` | `df['col'].str.len()` | Pandas returns Series; Python returns integer |
| **Substring Check** | `"abc" in text` | `df['col'].str.contains("abc")` | Pandas returns boolean Series; Python returns boolean |
| **Start/End Check** | `text.startswith("ab")` | `df['col'].str.startswith("ab")` | Pandas vectorized across all rows |
| **Replace Text** | `text.replace("old", "new")` | `df['col'].str.replace("old", "new")` | Pandas has `regex=True` parameter |
| **Split String** | `text.split(",")` | `df['col'].str.split(",")` | Pandas can use `expand=True` for separate columns |
| **Extract Substring** | `text[0:5]` | `df['col'].str[0:5]` or `df['col'].str.slice(0,5)` | Pandas has both indexing and slice method |
| **Regex Search** | `re.search(pattern, text)` | `df['col'].str.extract(pattern)` | Pandas extracts to new columns automatically |
| **Regex Find All** | `re.findall(pattern, text)` | `df['col'].str.findall(pattern)` | Pandas returns list in each cell |
| **Regex Replace** | `re.sub(pattern, repl, text)` | `df['col'].str.replace(pattern, repl, regex=True)` | Pandas requires `regex=True` flag |
| **String Concatenation** | `str1 + str2` | `df['col1'].str.cat(df['col2'])` | Pandas has special concatenation method |
| **Null Handling** | Manual checking required | Automatic (skips NaN values) | Pandas handles missing data gracefully |
| **Performance** | Single string operation | Vectorized operation | Pandas much faster for multiple strings |
| **Return Type** | String or Match object | Series or DataFrame | Pandas preserves DataFrame structure |

## When to Use Each Approach

### Use Python String Methods When:
- Working with individual strings or small datasets
- Need maximum performance for single string operations  
- Writing functions that process one string at a time
- Working outside of pandas DataFrame context

### Use Pandas String Methods When:
- Working with DataFrame columns containing text
- Need to apply operations to many strings at once
- Want to maintain DataFrame structure in results
- Need automatic handling of missing values (NaN)
- Performing data cleaning and preprocessing tasks

## Summary

This notebook covered:

1. **Python vs Pandas String Operations**: Understanding when to use each approach
2. **Python Built-in String Methods**: Basic operations for individual strings
3. **Pandas String Operations**: Vectorized operations using the `.str` accessor
4. **Regex Operations**: Pattern matching and extraction in both Python and Pandas
5. **Method Chaining**: Tidy-style programming for readable data transformations
6. **Performance Tips**: Optimizing string operations for large datasets

Key takeaways:

- Use Python string methods for individual strings
- Use Pandas `.str` accessor for DataFrame columns
- Pandas handles missing values automatically
- Method chaining creates readable, maintainable code
- Consider performance optimizations for large datasets

## Python Built-in String Operations

Let's start with Python's built-in string methods for individual string manipulation.

In [1]:
# Basic String Methods - Case Operations
text = "Hello World"

print("Original:", text)
print("Lower:", text.lower())
print("Upper:", text.upper())
print("Title:", text.title())
print("Capitalize:", text.capitalize())
print("Swapcase:", text.swapcase())

Original: Hello World
Lower: hello world
Upper: HELLO WORLD
Title: Hello World
Capitalize: Hello world
Swapcase: hELLO wORLD


In [2]:
# String Checking Methods
text = "Hello World"

print("Starts with 'Hello':", text.startswith("Hello"))
print("Ends with 'World':", text.endswith("World"))
print("Is digit:", text.isdigit())
print("Is alpha:", text.isalpha())  # False because contains space
print("Is alphanumeric:", text.isalnum())  # False because contains space
print("Is space:", text.isspace())

Starts with 'Hello': True
Ends with 'World': True
Is digit: False
Is alpha: False
Is alphanumeric: False
Is space: False


In [3]:
# String Searching and Counting
text = "Hello World"

print("Find 'World':", text.find("World"))  # Returns index
print("Find last 'l':", text.rfind("l"))
print("Count 'l':", text.count("l"))

# Note: index() raises error if not found, find() returns -1
try:
    print("Index of 'World':", text.index("World"))
except ValueError as e:
    print("Error:", e)

Find 'World': 6
Find last 'l': 9
Count 'l': 3
Index of 'World': 6


In [4]:
# String Modification
text = "  Hello World  "

print("Original:", repr(text))
print("Replace 'World' with 'Python':", text.replace("World", "Python"))
print("Strip whitespace:", repr(text.strip()))
print("Left strip:", repr(text.lstrip()))
print("Right strip:", repr(text.rstrip()))
print("Strip specific chars:", repr(text.strip().strip("Hd")))

Original: '  Hello World  '
Replace 'World' with 'Python':   Hello Python  
Strip whitespace: 'Hello World'
Left strip: 'Hello World  '
Right strip: '  Hello World'
Strip specific chars: 'ello Worl'


In [5]:
# String Splitting and Joining
text = "Hello World"

print("Split by space:", text.split())
print("Split by 'l':", text.split("l"))
print("Join with dash:", "-".join(["Hello", "World"]))

# String formatting
name, age = "John", 25
print("F-string:", f"Name: {name}, Age: {age}")
print("Format method:", "Name: {}, Age: {}".format(name, age))

Split by space: ['Hello', 'World']
Split by 'l': ['He', '', 'o Wor', 'd']
Join with dash: Hello-World
F-string: Name: John, Age: 25
Format method: Name: John, Age: 25


In [6]:
# String Slicing and Indexing
text = "Hello World"

print("First character:", text[0])
print("Last character:", text[-1])
print("First 5 chars:", text[0:5])
print("From index 6:", text[6:])
print("Reverse:", text[::-1])
print("Every 2nd char:", text[::2])

First character: H
Last character: d
First 5 chars: Hello
From index 6: World
Reverse: dlroW olleH
Every 2nd char: HloWrd


## Pandas String Operations (`.str` accessor)

Now let's explore Pandas string operations which work on entire Series/columns.

In [7]:
import pandas as pd

# Create sample DataFrame
df = pd.DataFrame({
    'name': ['John Doe', 'jane smith', 'Bob Johnson'],
    'email': ['john@email.com', 'jane@gmail.com', 'bob@yahoo.com']
})

print("Sample DataFrame:")
print(df)

Sample DataFrame:
          name           email
0     John Doe  john@email.com
1   jane smith  jane@gmail.com
2  Bob Johnson   bob@yahoo.com


In [8]:
# Case Operations with Pandas
print("Original names:")
print(df['name'])
print("\nLowercase:")
print(df['name'].str.lower())
print("\nUppercase:")
print(df['name'].str.upper())
print("\nTitle case:")
print(df['name'].str.title())
print("\nCapitalize:")
print(df['name'].str.capitalize())

Original names:
0       John Doe
1     jane smith
2    Bob Johnson
Name: name, dtype: object

Lowercase:
0       john doe
1     jane smith
2    bob johnson
Name: name, dtype: object

Uppercase:
0       JOHN DOE
1     JANE SMITH
2    BOB JOHNSON
Name: name, dtype: object

Title case:
0       John Doe
1     Jane Smith
2    Bob Johnson
Name: name, dtype: object

Capitalize:
0       John doe
1     Jane smith
2    Bob johnson
Name: name, dtype: object


In [9]:
# String Checking with Pandas
print("Names starting with 'J':")
print(df['name'].str.startswith('J'))
print("\nEmails ending with '.com':")
print(df['email'].str.endswith('.com'))
print("\nCheck if all digits:")
print(df['name'].str.isdigit())

Names starting with 'J':
0     True
1    False
2    False
Name: name, dtype: bool

Emails ending with '.com':
0    True
1    True
2    True
Name: email, dtype: bool

Check if all digits:
0    False
1    False
2    False
Name: name, dtype: bool


In [10]:
# String Length and Counting
print("Length of names:")
print(df['name'].str.len())
print("\nCount of 'o' in names:")
print(df['name'].str.count('o'))
print("\nCount of 'o' in emails:")
print(df['email'].str.count('o'))

Length of names:
0     8
1    10
2    11
Name: name, dtype: int64

Count of 'o' in names:
0    2
1    0
2    3
Name: name, dtype: int64

Count of 'o' in emails:
0    2
1    1
2    4
Name: email, dtype: int64


In [11]:
# String Modification with Pandas
print("Replace spaces with underscores:")
print(df['name'].str.replace(' ', '_'))
print("\nExtract first 4 characters of email:")
print(df['email'].str.slice(0, 4))
print("\nReplace first 4 characters with 'USER':")
print(df['email'].str.slice_replace(0, 4, 'USER'))

Replace spaces with underscores:
0       John_Doe
1     jane_smith
2    Bob_Johnson
Name: name, dtype: object

Extract first 4 characters of email:
0    john
1    jane
2    bob@
Name: email, dtype: object

Replace first 4 characters with 'USER':
0    USER@email.com
1    USER@gmail.com
2     USERyahoo.com
Name: email, dtype: object


In [12]:
# String Splitting with Pandas
print("Split names (returns lists):")
print(df['name'].str.split())
print("\nSplit names into separate columns:")
print(df['name'].str.split(' ', expand=True))
print("\nSplit emails by '@':")
print(df['email'].str.split('@', expand=True))

Split names (returns lists):
0       [John, Doe]
1     [jane, smith]
2    [Bob, Johnson]
Name: name, dtype: object

Split names into separate columns:
      0        1
0  John      Doe
1  jane    smith
2   Bob  Johnson

Split emails by '@':
      0          1
0  john  email.com
1  jane  gmail.com
2   bob  yahoo.com


In [13]:
# String Concatenation with Pandas
print("Concatenate name and email:")
print(df['name'].str.cat(df['email'], sep=' - '))
print("\nConcatenate with custom separator:")
print(df['name'].str.cat(df['email'], sep=' | Email: '))

Concatenate name and email:
0      John Doe - john@email.com
1    jane smith - jane@gmail.com
2    Bob Johnson - bob@yahoo.com
Name: name, dtype: object

Concatenate with custom separator:
0      John Doe | Email: john@email.com
1    jane smith | Email: jane@gmail.com
2    Bob Johnson | Email: bob@yahoo.com
Name: name, dtype: object


## Advanced Pandas String Operations

Let's explore more advanced features like extraction, filtering, and indexing.

In [14]:
# Extract parts of strings using regex
print("Extract username and domain from email:")
extracted = df['email'].str.extract(r'([^@]+)@([^.]+)')
print(extracted)

print("\nExtract first and last name:")
name_parts = df['name'].str.extract(r'(\w+)\s+(\w+)')
print(name_parts)

Extract username and domain from email:
      0      1
0  john  email
1  jane  gmail
2   bob  yahoo

Extract first and last name:
      0        1
0  John      Doe
1  jane    smith
2   Bob  Johnson


In [15]:
# String contains for boolean indexing
print("Filter rows with 'John' in name:")
john_filter = df['name'].str.contains('John')
print(john_filter)
print("\nRows containing 'John':")
print(df[john_filter])

print("\nFilter emails containing 'gmail' or 'yahoo':")
email_filter = df['email'].str.contains('gmail|yahoo')
print(df[email_filter])

Filter rows with 'John' in name:
0     True
1    False
2     True
Name: name, dtype: bool

Rows containing 'John':
          name           email
0     John Doe  john@email.com
2  Bob Johnson   bob@yahoo.com

Filter emails containing 'gmail' or 'yahoo':
          name           email
1   jane smith  jane@gmail.com
2  Bob Johnson   bob@yahoo.com


In [16]:
# String indexing and slicing with Pandas
print("First character of names:")
print(df['name'].str[0])
print("\nLast character of names:")
print(df['name'].str[-1])
print("\nFirst 4 characters of emails:")
print(df['email'].str[0:4])

First character of names:
0    J
1    j
2    B
Name: name, dtype: object

Last character of names:
0    e
1    h
2    n
Name: name, dtype: object

First 4 characters of emails:
0    john
1    jane
2    bob@
Name: email, dtype: object


In [17]:
# Padding and alignment
print("Left pad names with '*':")
print(df['name'].str.pad(width=15, side='left', fillchar='*'))
print("\nCenter align names:")
print(df['name'].str.center(15, fillchar='-'))
print("\nRight justify names:")
print(df['name'].str.rjust(15, fillchar='.'))

Left pad names with '*':
0    *******John Doe
1    *****jane smith
2    ****Bob Johnson
Name: name, dtype: object

Center align names:
0    ----John Doe---
1    ---jane smith--
2    --Bob Johnson--
Name: name, dtype: object

Right justify names:
0    .......John Doe
1    .....jane smith
2    ....Bob Johnson
Name: name, dtype: object


## Regex Operations

Let's explore regular expression operations in both Python and Pandas.

In [18]:
import re

# Python re module examples
text = "Contact: john@email.com or call 123-456-7890"

print("Original text:", text)
print("\nFind phone number:")
phone_match = re.search(r'\d{3}-\d{3}-\d{4}', text)
print(phone_match.group() if phone_match else "Not found")

print("\nFind all email addresses:")
emails = re.findall(r'\w+@\w+\.\w+', text)
print(emails)

print("\nReplace phone number:")
masked_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)
print(masked_text)

Original text: Contact: john@email.com or call 123-456-7890

Find phone number:
123-456-7890

Find all email addresses:
['john@email.com']

Replace phone number:
Contact: john@email.com or call XXX-XXX-XXXX


In [19]:
# Groups and capturing with regex
text = "Contact: john@email.com or call 123-456-7890"

match = re.search(r'(\w+)@(\w+)\.(\w+)', text)
if match:
    print("Full match:", match.group(0))
    print("Username:", match.group(1))
    print("Domain:", match.group(2))
    print("Extension:", match.group(3))
    print("All groups:", match.groups())

Full match: john@email.com
Username: john
Domain: email
Extension: com
All groups: ('john', 'email', 'com')


In [20]:
# Pandas regex operations
df_text = pd.DataFrame({
    'text': ['Contact john@email.com', 'Call 123-456-7890', 'Visit www.example.com']
})

print("Sample text data:")
print(df_text)

print("\nExtract email addresses:")
emails = df_text['text'].str.extract(r'(\w+@\w+\.\w+)')
print(emails)

print("\nFind all numbers:")
numbers = df_text['text'].str.extractall(r'(\d+)')
print(numbers)

Sample text data:
                     text
0  Contact john@email.com
1       Call 123-456-7890
2   Visit www.example.com

Extract email addresses:
                0
0  john@email.com
1             NaN
2             NaN

Find all numbers:
            0
  match      
1 0       123
  1       456
  2      7890


In [21]:
# Boolean operations with regex in Pandas
print("Contains phone pattern:")
print(df_text['text'].str.contains(r'\d{3}-\d{3}-\d{4}'))

print("\nMatches 'Contact' from start:")
print(df_text['text'].str.match(r'Contact.*'))

print("\nFind all words:")
words = df_text['text'].str.findall(r'\w+')
print(words)

Contains phone pattern:
0    False
1     True
2    False
Name: text, dtype: bool

Matches 'Contact' from start:
0     True
1    False
2    False
Name: text, dtype: bool

Find all words:
0    [Contact, john, email, com]
1         [Call, 123, 456, 7890]
2     [Visit, www, example, com]
Name: text, dtype: object


In [22]:
# Replace with regex in Pandas
print("Replace phone numbers with 'PHONE':")
replaced = df_text['text'].str.replace(r'\d{3}-\d{3}-\d{4}', 'PHONE', regex=True)
print(replaced)

print("\nReplace email format using groups:")
email_replaced = df_text['text'].str.replace(r'(\w+)@(\w+)', r'\1_AT_\2', regex=True)
print(email_replaced)

Replace phone numbers with 'PHONE':
0    Contact john@email.com
1                Call PHONE
2     Visit www.example.com
Name: text, dtype: object

Replace email format using groups:
0    Contact john_AT_email.com
1            Call 123-456-7890
2        Visit www.example.com
Name: text, dtype: object


## Common Regex Patterns

Here are some useful regex patterns for common data cleaning tasks.

In [23]:
# Common regex patterns
patterns = {
    'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
    'phone_us': r'\d{3}-\d{3}-\d{4}',
    'url': r'https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?',
    'date_iso': r'\d{4}-\d{2}-\d{2}',  # YYYY-MM-DD
    'date_us': r'\d{1,2}/\d{1,2}/\d{4}',  # MM/DD/YYYY
    'integer': r'^-?\d+$',
    'float': r'^-?\d+\.?\d*$'
}

# Test some patterns
test_strings = [
    'john@email.com',
    '123-456-7890',
    'https://www.example.com',
    '2023-12-25',
    '12/25/2023',
    '-123',
    '3.14159'
]

for pattern_name, pattern in patterns.items():
    print(f"\n{pattern_name.upper()} pattern: {pattern}")
    for test_str in test_strings:
        if re.match(pattern, test_str):
            print(f"  ✓ Matches: {test_str}")


EMAIL pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
  ✓ Matches: john@email.com

PHONE_US pattern: \d{3}-\d{3}-\d{4}
  ✓ Matches: 123-456-7890

URL pattern: https?://(?:[-\w.])+(?:\:[0-9]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:\#(?:[\w.])*)?)?
  ✓ Matches: https://www.example.com

DATE_ISO pattern: \d{4}-\d{2}-\d{2}
  ✓ Matches: 2023-12-25

DATE_US pattern: \d{1,2}/\d{1,2}/\d{4}
  ✓ Matches: 12/25/2023

INTEGER pattern: ^-?\d+$
  ✓ Matches: -123

FLOAT pattern: ^-?\d+\.?\d*$
  ✓ Matches: -123
  ✓ Matches: 3.14159


## Method Chaining (Pandas Tidy Style)

Following the R tidyverse style, we can chain pandas operations for clean, readable code.

In [24]:
# Example of method chaining for string operations
df_demo = pd.DataFrame({
    'name': ['  John Doe  ', 'JANE SMITH', 'bob johnson'],
    'email': ['john@email.com', 'jane@gmail.com', 'bob@yahoo.com']
})

print("Original DataFrame:")
print(df_demo)

# Method chaining for data cleaning
result = (df_demo
    .assign(
        name_clean = lambda x: x['name'].str.strip().str.title(),
        domain = lambda x: x['email'].str.extract(r'@(\w+)\.'),
        has_gmail = lambda x: x['email'].str.contains('gmail')
    )
    .query('has_gmail == True')
    .reset_index(drop=True)
)

print("\nCleaned and filtered result:")
print(result)

Original DataFrame:
           name           email
0    John Doe    john@email.com
1    JANE SMITH  jane@gmail.com
2   bob johnson   bob@yahoo.com

Cleaned and filtered result:
         name           email  name_clean domain  has_gmail
0  JANE SMITH  jane@gmail.com  Jane Smith  gmail       True


In [25]:
# More complex chaining example
complex_result = (df_demo
    .assign(
        # Clean names
        name_clean = lambda x: (x['name']
                               .str.strip()
                               .str.lower()
                               .str.title()),
        
        # Extract email parts
        username = lambda x: x['email'].str.extract(r'([^@]+)@')[0],
        domain = lambda x: x['email'].str.extract(r'@([^.]+)\.')[0],
        
        # Create flags
        is_gmail = lambda x: x['email'].str.contains('gmail'),
        name_length = lambda x: x['name_clean'].str.len()
    )
    .filter(['name_clean', 'username', 'domain', 'is_gmail', 'name_length'])
)

print("Complex transformation result:")
print(complex_result)

Complex transformation result:
    name_clean username domain  is_gmail  name_length
0     John Doe     john  email     False            8
1   Jane Smith     jane  gmail      True           10
2  Bob Johnson      bob  yahoo     False           11


## Performance Tips

Here are some tips for optimizing string operations with large datasets.

In [26]:
# Performance demonstration
import numpy as np

# Create larger sample data
np.random.seed(42)
large_df = pd.DataFrame({
    'category': np.random.choice(['Type A', 'Type B', 'Type C'], 10000),
    'text': ['Sample text ' + str(i) for i in range(10000)]
})

print("Large DataFrame shape:", large_df.shape)

# 1. Use categorical data for repeated strings
print("\nMemory usage before categorical conversion:")
print(f"Category column: {large_df['category'].memory_usage(deep=True)} bytes")

large_df['category'] = large_df['category'].astype('category')
print("Memory usage after categorical conversion:")
print(f"Category column: {large_df['category'].memory_usage(deep=True)} bytes")

Large DataFrame shape: (10000, 2)

Memory usage before categorical conversion:
Category column: 550132 bytes
Memory usage after categorical conversion:
Category column: 10405 bytes


In [27]:
# 2. Compile regex patterns for repeated use
pattern = re.compile(r'\d+')

# Demonstrate vectorized operations
print("Using vectorized string operations:")
%timeit large_df['text'].str.contains(pattern)

# 3. Chain operations efficiently
print("\nEfficient chaining:")
%timeit (large_df['text'].str.lower().str.strip().str.replace(r'\s+', ' ', regex=True))

# 4. Filter before operations when possible
print("\nFiltering before operations:")
filtered_df = large_df[large_df['text'].notna()]
print(f"Filtered DataFrame shape: {filtered_df.shape}")

Using vectorized string operations:
2.97 ms ± 60.4 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Efficient chaining:
6.2 ms ± 225 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Filtering before operations:
Filtered DataFrame shape: (10000, 2)
