# üïµÔ∏è‚Äç‚ôÇÔ∏è 1. Smart Data Wrangling & Imputation

Welcome to the first step of our **Reimagined Titanic Analysis**. 

Instead of blindly filling missing `Age` values with the mean (which distorts the data), we will use a **GenAI-inspired 'Semantic' approach**: deducting the likely age based on social status (Title) and class.

In [1]:
import pandas as pd
import numpy as np
import re

# Load Data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

full_data = pd.concat([train, test], sort=False)

## üß¨ The 'Title' Hypothesis
A person's title (Mr, Mrs, Master, Dr) carries huge information about their age and social standing.
- **Master**: Always a child.
- **Miss**: Usually younger, but can be unmarried adult.
- **Mrs**: Married, likely adult.
- **Mr**: Adult male.

In [2]:
def extract_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

full_data['Title'] = full_data['Name'].apply(extract_title)

# Simplify Titles
title_mapping = {
    "Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs", 
    "Lady": "Rare", "Countess": "Rare", "Capt": "Rare", "Col": "Rare", "Don": "Rare", 
    "Dr": "Rare", "Major": "Rare", "Rev": "Rare", "Sir": "Rare", "Jonkheer": "Rare", "Dona": "Rare"
}
full_data['Title'] = full_data['Title'].replace(title_mapping)

print(full_data['Title'].value_counts())

Mr        757
Miss      264
Mrs       198
Master     61
Rare       29
Name: Title, dtype: int64


## üß† Smart Imputation Logic
Now we calculate valid median ages for each group and fill the gaps.

In [3]:
# Calculate median age by Title and Pclass for better accuracy
age_medians = full_data.groupby(['Title', 'Pclass'])['Age'].median()

def fill_age(row):
    if pd.isnull(row['Age']):
        return age_medians[row['Title'], row['Pclass']]
    return row['Age']

full_data['Age'] = full_data.apply(fill_age, axis=1)

# Check missing values
print("Missing Age values:", full_data['Age'].isnull().sum())

Missing Age values:

 0


In [4]:
# Export Clean Data
full_data.to_csv('titanic_clean.csv', index=False)
print("‚úÖ Data Saved to titanic_clean.csv")

‚úÖ Data Saved to titanic_clean.csv
