# ✍️ Part 8: Pure Pandas String Processing and Feature Engineering

**Goal:** To transform raw, messy text columns (`Name`, `Ticket`, `Cabin`) into high-signal numerical and categorical features using **100% native Pandas string methods (`.str`) and regular expressions**.

---
### Key Learning Objectives
1.  Master string manipulation: `.str.split()`, `.str.len()`, `.str.contains()`.
2.  Use the **`.str.extract()`** method with regex to pull specific features (e.g., `Title`, `Ticket Prefix`).
3.  Create complex social features (`Family_Name`, `Title_Grouped`, `Shared_Ticket_Count`).
4.  Implement safeguards for arithmetic operations involving categorical columns.

In [1]:
import pandas as pd
import re  # Built-in Python regular expressions

# Set pandas display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

print("=== PURE PANDAS STRING PROCESSING & FEATURE ENGINEERING ===")

# Load the Titanic dataset (continuing from previous notebooks)
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic_df = pd.read_csv(titanic_url)

# Apply all prior enhancements (Imputation and Type Optimization) using pandas-only methods
print("🔧 Applying prior enhancements using pandas-only...")

# 1. Missing value fixes (Group-based Age Imputation, Mode Embarked Imputation)
age_by_group = titanic_df.groupby(['Pclass', 'Sex'])['Age'].transform('median')
titanic_df['Age'] = titanic_df['Age'].fillna(age_by_group)
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

# 2. Data type optimizations (Categorical, Boolean, Float)
titanic_df['Pclass'] = titanic_df['Pclass'].astype('category')
titanic_df['Sex'] = titanic_df['Sex'].astype('category')
titanic_df['Embarked'] = titanic_df['Embarked'].astype('category')
titanic_df['Survived'] = titanic_df['Survived'].astype('bool')

print("✅ Dataset enhanced with prior improvements using pandas-only!")
print(f"Dataset shape: {titanic_df.shape}")

=== PURE PANDAS STRING PROCESSING & FEATURE ENGINEERING ===
🔧 Applying prior enhancements using pandas-only...
✅ Dataset enhanced with prior improvements using pandas-only!
Dataset shape: (891, 12)


## 2. String Accessor (`.str`) Basics and Pattern Detection

The `.str` accessor allows us to apply string methods element-wise to every string in a Series.

* **Information:** `.str.len()`, `.str.split()`.
* **Detection:** `.str.contains()`, `.str.startswith()`.

In [3]:
print("\n📊 2. Basic String Information and Pattern Detection")

# 1. Basic Information and Length Analysis
print("\n1. Basic String Information:")
titanic_df['Name_Length'] = titanic_df['Name'].str.len()
print("Name length statistics:")
print(titanic_df['Name_Length'].describe())

# Find shortest and longest names using pandas methods
shortest_idx = titanic_df['Name_Length'].idxmin()
longest_idx = titanic_df['Name_Length'].idxmax()
print(f"Shortest name: {titanic_df.loc[shortest_idx, 'Name']}")
print(f"Longest name: {titanic_df.loc[longest_idx, 'Name']}")


# 2. Pattern Detection: Parentheses and Quotes
# Check for parentheses (maiden names, nicknames) using pandas regex
titanic_df['Has_Parentheses'] = titanic_df['Name'].str.contains(r'\(.*\)', na=False, regex=True)
parentheses_count = titanic_df['Has_Parentheses'].sum()
print(f"\nNames with parentheses: {parentheses_count}")

# Check for quotes (nicknames) using pandas string methods
titanic_df['Has_Quotes'] = titanic_df['Name'].str.contains('"', na=False, regex=False)
quotes_count = titanic_df['Has_Quotes'].sum()
print(f"Names with quotes: {quotes_count}")


📊 2. Basic String Information and Pattern Detection

1. Basic String Information:
Name length statistics:
count    891.000000
mean      26.965208
std        9.281607
min       12.000000
25%       20.000000
50%       25.000000
75%       30.000000
max       82.000000
Name: Name_Length, dtype: float64
Shortest name: Lam, Mr. Ali
Longest name: Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)

Names with parentheses: 143
Names with quotes: 53


## 3. String Splitting (`.str.split()`) and Regex Extraction (`.str.extract()`)

This is where we turn a single text column into multiple structural features.

* **Splitting:** `.str.split(', ', expand=True)` splits the name into Last Name and the rest.
* **Extraction (Power Technique):** `.str.extract(r'pattern')` uses regular expressions to precisely capture specific text, like the title, by using **capture groups** (parentheses).

In [4]:
print("\n📊 3. Splitting and Regex Extraction for Name")

# 1. Splitting Names into Components
# Most Titanic names follow pattern: "Lastname, Title Firstname"
name_parts = titanic_df['Name'].str.split(', ', expand=True, n=1)
name_parts.columns = ['Last_Name', 'First_Title']

# Extract family name and title components
titanic_df['Family_Name'] = name_parts['Last_Name']
titanic_df['First_Title_Part'] = name_parts['First_Title']

print("✅ Family names extracted using pandas")
print(f"Unique family names: {titanic_df['Family_Name'].nunique()}")


# 2. Title Extraction with Regex
# Pattern: comma, optional space, capture word(s), period
title_pattern = r', ([^.]*)\.'
titanic_df['Title_Raw'] = titanic_df['Name'].str.extract(title_pattern)

print("\n✅ Raw Titles extracted using pandas regex!")
print("📋 Raw Title distribution (top 15):")
print(titanic_df['Title_Raw'].value_counts().head(15))


# 3. Advanced Title Cleaning and Grouping
title_mapping = {
    'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
    'Dr': 'Dr', 'Rev': 'Rev', 'Col': 'Officer', 'Major': 'Officer',
    'Capt': 'Officer', 'Countess': 'Royalty', 'Lady': 'Royalty', 
    'Sir': 'Royalty', 'Don': 'Royalty', 'Dona': 'Royalty', 
    'Jonkheer': 'Royalty', 'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'
}

titanic_df['Title_Grouped'] = titanic_df['Title_Raw'].str.strip().map(title_mapping)
titanic_df['Title_Grouped'] = titanic_df['Title_Grouped'].fillna('Other')

print("\n📋 Grouped title distribution (Simplified):")
print(titanic_df['Title_Grouped'].value_counts())


📊 3. Splitting and Regex Extraction for Name
✅ Family names extracted using pandas
Unique family names: 667

✅ Raw Titles extracted using pandas regex!
📋 Raw Title distribution (top 15):
Title_Raw
Mr        517
Miss      182
Mrs       125
Master     40
Dr          7
Rev         6
Col         2
Mlle        2
Major       2
Ms          1
Mme         1
Don         1
Lady        1
Sir         1
Capt        1
Name: count, dtype: int64

📋 Grouped title distribution (Simplified):
Title_Grouped
Mr         517
Miss       185
Mrs        126
Master      40
Dr           7
Rev          6
Officer      5
Royalty      4
Other        1
Name: count, dtype: int64


## 4. Feature Engineering from Cabin, Ticket, and Family Data

We apply the same string processing logic to derive high-signal features from other columns.

* **Cabin:** Extract `Deck` letter.
* **Ticket:** Extract `Prefix` and create a `Shared_Ticket` flag.
* **Family:** Combine `SibSp` and `Parch` to capture social context.

In [5]:
# 1. Cabin Analysis and Feature Engineering
print("\n📊 Cabin Analysis and Feature Engineering")

# Extract Deck (first character) from cabin using pandas string methods
titanic_df['Cabin_Deck'] = titanic_df['Cabin'].str[0]
titanic_df['Cabin_Deck'] = titanic_df['Cabin_Deck'].fillna('Unknown')
titanic_df['Has_Cabin'] = titanic_df['Cabin'].notna().astype(int)

print(f"✅ Cabin decks extracted. Distribution:\n{titanic_df['Cabin_Deck'].value_counts()}")


# 2. Ticket Pattern Analysis
print("\n📊 Ticket Pattern Analysis")

# Extract non-numeric prefix from tickets using pandas string methods
def extract_ticket_prefix_pandas(ticket_series):
    # Remove all digits and clean up using pandas string methods
    prefix_series = ticket_series.str.replace(r'\d+', '', regex=True).str.strip()
    prefix_series = prefix_series.str.replace(r'[./\s]', '', regex=True).str.strip()
    # Replace empty strings with 'NONE' using pandas where
    prefix_series = prefix_series.where(prefix_series != '', 'NONE')
    return prefix_series

titanic_df['Ticket_Prefix'] = extract_ticket_prefix_pandas(titanic_df['Ticket'])
print(f"✅ Ticket prefixes extracted. Top 5:\n{titanic_df['Ticket_Prefix'].value_counts().head(5)}")

# Ticket sharing (same ticket number across multiple passengers)
ticket_sharing = titanic_df.groupby('Ticket').size()
titanic_df['Shared_Ticket_Count'] = titanic_df['Ticket'].map(ticket_sharing)
titanic_df['Has_Shared_Ticket'] = (titanic_df['Shared_Ticket_Count'] > 1).astype('int8')


# 3. Family Feature Engineering
print("\n📊 Family Feature Engineering")
# Build on existing SibSp and Parch using pandas arithmetic
titanic_df['Family_Size'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1
titanic_df['Is_Alone'] = (titanic_df['Family_Size'] == 1).astype(int)

# Analyze families traveling together using pandas groupby
family_name_stats = titanic_df.groupby('Family_Name').agg({
    'PassengerId': 'count',
    'Survived': 'mean',
    'Fare': 'mean',
}).round(3)

# Flatten column names and merge back
family_name_stats.columns = ['Family_Count', 'Family_Survival_Rate', 'Family_Avg_Fare']
titanic_df = titanic_df.merge(family_name_stats, left_on='Family_Name', right_index=True, how='left')

print("✅ Family size and social features created and merged.")


📊 Cabin Analysis and Feature Engineering
✅ Cabin decks extracted. Distribution:
Cabin_Deck
Unknown    687
C           59
B           47
D           33
E           32
A           15
F           13
G            4
T            1
Name: count, dtype: int64

📊 Ticket Pattern Analysis
✅ Ticket prefixes extracted. Top 5:
Ticket_Prefix
NONE     661
PC        60
CA        41
A         28
STONO     18
Name: count, dtype: int64

📊 Family Feature Engineering
✅ Family size and social features created and merged.


## 5. Safe Interaction Features (Respecting Categoricals)

When creating interaction features between numeric and categorical data, **never perform arithmetic directly on a categorical column**. Always cast the categorical column to a numeric Series temporarily for the calculation.

* **`Age_Class_Ratio`:** Age divided by a numeric representation of `Pclass`.
* **`Fare_Per_Person`:** Fare divided by `Family_Size`.

In [10]:
print("\n📊 Safe Interaction Features")

# 1. Create a temporary numeric Series for Pclass strictly for math
# Cast Pclass (which is 'category') to an integer Series
pclass_num = titanic_df['Pclass'].astype('int16')

# 2. Create core interactions
titanic_df['Fare_Per_Person'] = (titanic_df['Fare'] / titanic_df['Family_Size']).astype('float32')

# Avoid division by zero, though Pclass should be 1-3
safe_pclass_divisor = pclass_num.replace(0, pd.NA)
titanic_df['Age_Class_Ratio'] = (titanic_df['Age'] / safe_pclass_divisor).astype('float32')

# 3. Final Quality Check (Correlations)
numeric_features = [
    'Family_Size', 'Has_Cabin', 'Shared_Ticket_Count', 
    'Fare_Per_Person', 'Age_Class_Ratio', 'Name_Length'
]
# Correlate features with survival, casting Survived bool to int for correlation
corr = titanic_df[numeric_features].corrwith(titanic_df['Survived'].astype('int8')).sort_values(key=abs, ascending=False)

print("\n📈 Feature correlations with survival (abs-sorted):")
print(corr.round(3))



📊 Safe Interaction Features

📈 Feature correlations with survival (abs-sorted):
Name_Length            0.332
Has_Cabin              0.317
Fare_Per_Person        0.222
Age_Class_Ratio        0.160
Shared_Ticket_Count    0.038
Family_Size            0.017
dtype: float64


In [9]:
def session8_summary():
    # Final check on savings percentage using only the most essential features
    original_memory_cols = ['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch']
    final_memory_cols = ['Title_Grouped', 'Ticket_Prefix', 'Cabin_Deck', 'Family_Size', 'Has_Cabin']

    original_memory = titanic_df[original_memory_cols].memory_usage(deep=True).sum() / 1024
    final_memory = titanic_df[final_memory_cols].memory_usage(deep=True).sum() / 1024
    
    # Check for division by zero before calculating savings percentage
    savings_percentage = ((original_memory - final_memory) / original_memory * 100).round(1) if original_memory > 0 else 0

    lines = []
    lines.append("=== Session 8 Summary: Pure pandas String Processing & Feature Engineering ===")
    lines.append("Goal: Create interpretable, pandas-only features from Names, Tickets, and Cabins,")
    lines.append("      plus safe interactions that respect categorical dtypes.\n")
    
    lines.append("✅ SKILLS MASTERED TODAY (pandas-Only):")
    lines.append("1. String Extraction: .str.extract(r'pattern') to get Title, Ticket components.")
    lines.append("2. String Splitting: .str.split() for initial component separation.")
    lines.append("3. Categorical Logic: .str.contains() and .str.len() for binary and numeric flags.")
    lines.append("4. Social Features: Combining SibSp and Parch for Family_Size; merging Family_Survival_Rate.")
    lines.append("5. Arithmetic Safeguard: Temporarily casting categorical Pclass to int for Age_Class_Ratio.")

    lines.append("\n🔥 POWER TECHNIQUE OF THE DAY:")
    lines.append("REGEX EXTRACTION (`.str.extract`)")
    lines.append("→ Enables precise feature capture (e.g., Title, Deck letter) from unstructured text.")

    lines.append("\n💡 KEY INSIGHTS:")
    lines.append("• High-Signal Features: Title_Grouped, Fare_Per_Person, and Has_Cabin are typically strong predictors.")
    lines.append(f"• Memory Efficiency: String processing often increases memory initially, but prior categorical/integer optimization is key.")

    lines.append("\n📝 REUSABLE FEATURES CREATED:")
    lines.append("• Title_Grouped (Categorical, Social Status)")
    lines.append("• Cabin_Deck (Categorical, Wealth Indicator)")
    lines.append("• Fare_Per_Person (Numeric, Normalized Fare)")
    lines.append("• Family_Size, Is_Alone (Numeric, Social Context)")

    lines.append(f"\n✓ Session 8 completed! Complex feature engineering mastered - ready for EDA and Visualization! 🐼")
    
    return "\n".join(lines)

if __name__ == "__main__":
    print(session8_summary())

=== Session 8 Summary: Pure pandas String Processing & Feature Engineering ===
Goal: Create interpretable, pandas-only features from Names, Tickets, and Cabins,
      plus safe interactions that respect categorical dtypes.

✅ SKILLS MASTERED TODAY (pandas-Only):
1. String Extraction: .str.extract(r'pattern') to get Title, Ticket components.
2. String Splitting: .str.split() for initial component separation.
3. Categorical Logic: .str.contains() and .str.len() for binary and numeric flags.
4. Social Features: Combining SibSp and Parch for Family_Size; merging Family_Survival_Rate.
5. Arithmetic Safeguard: Temporarily casting categorical Pclass to int for Age_Class_Ratio.

🔥 POWER TECHNIQUE OF THE DAY:
REGEX EXTRACTION (`.str.extract`)
→ Enables precise feature capture (e.g., Title, Deck letter) from unstructured text.

💡 KEY INSIGHTS:
• High-Signal Features: Title_Grouped, Fare_Per_Person, and Has_Cabin are typically strong predictors.
• Memory Efficiency: String processing often increa