# Titanic Data Cleaning
This notebook handles loading and cleaning the Titanic dataset.

## Data cleaning and preparation

This notebook performs the following steps:

1. Load the Titanic dataset (using `seaborn` for a stable schema).
2. Inspect the raw data and identify missing values.
3. Perform simple and defensible cleaning steps:
   - Fill missing `Age` values with the median.
   - Fill missing `Embarked` values with the mode.
   - Remove columns with many missing or irrelevant fields.
   - Encode `Sex` as a binary feature (male=0, female=1).
4. Save the cleaned dataset to `data/processed/cleaned_titanic.csv` for subsequent analysis and modeling.

Notes: The goal of this notebook is to produce a reproducible, tidy dataset suitable for EDA and model training that follow.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os
import seaborn as sns

# Create directories if they don't exist
os.makedirs('../../data/processed', exist_ok=True)

# Load the Titanic dataset from seaborn (stable and consistent schema)
df = sns.load_dataset('titanic')

# Normalize column names to a simple convention used later
rename_map = {
    'survived': 'Survived',
    'pclass': 'Pclass',
    'sex': 'Sex',
    'age': 'Age',
    'sibsp': 'SibSp',
    'parch': 'Parch',
    'fare': 'Fare',
    'embarked': 'Embarked'
}

df = df.rename(columns=rename_map)

# Preview the loaded data
print('Loaded dataset shape:', df.shape)
df.head()

In [None]:
# Handle missing values
# Age: fill with median (robust to outliers)
df['Age'] = df['Age'].fillna(df['Age'].median())

# Embarked: fill with mode (most frequent boarding point)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Drop columns that are not useful or have too many missing values
for col in ['deck', 'alive', 'embark_town']:
    if col in df.columns:
        df.drop(columns=[col], inplace=True, errors='ignore')

# Encode categorical variables in a simple, reproducible way
# Sex: male -> 0, female -> 1
if 'Sex' in df.columns:
    df['Sex'] = df['Sex'].astype(str).str.lower().map({'male': 0, 'female': 1}).astype(int)

# Embarked: ensure string codes (C, Q, S)
df['Embarked'] = df['Embarked'].astype(str)

# Final verification
print('\nAfter cleaning:')
df.info()
print('\nMissing values per column:')
print(df.isnull().sum())

In [None]:
# Save cleaned data
out_path = '../../data/processed/cleaned_titanic.csv'
df.to_csv(out_path, index=False)
print(f"Data cleaning complete. Cleaned data saved to {out_path} (shape: {df.shape})")