# Titanic â€” Machine Learning from Disaster

## Data Processing

This notebook contains the data-processing pipeline: loading raw data, handling missing values, applying cleaning rules, and saving processed outputs to `data/processed/` for modeling.

## Setup

In [9]:
import sys
sys.path.append("..")

from src.data_processing import load_raw_data, clean_data, save_processed_data

### Loading Raw Data
We load both the training and test datasets from the raw data directory:

In [10]:
RAW_TRAIN_DATA_FILE_NAME = "train.csv"
RAW_TEST_DATA_FILE_NAME = "test.csv"

train_raw = load_raw_data(RAW_TRAIN_DATA_FILE_NAME)
test_raw = load_raw_data(RAW_TEST_DATA_FILE_NAME)

print(f"Raw Train data shape: {train_raw.shape}")
print(f"Raw Test data shape: {test_raw.shape}")

train.csv successfully loaded from /Users/boramiklosbence/Documents/GitHub/vdn8wh-kutmod/02_assignment/notebooks/../data/raw/train.csv.
test.csv successfully loaded from /Users/boramiklosbence/Documents/GitHub/vdn8wh-kutmod/02_assignment/notebooks/../data/raw/test.csv.
Raw Train data shape: (891, 12)
Raw Test data shape: (418, 11)


### Missing Values Analysis - Training Data
We examine the missing values in the training dataset:

In [11]:
train_raw.isna().sum().sort_values(ascending=False)

Cabin          687
Age            177
Embarked         2
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Fare             0
dtype: int64

### Missing Values Analysis - Test Data
We examine the missing values in the training dataset:

In [12]:
test_raw.isna().sum().sort_values(ascending=False)

Cabin          327
Age             86
Fare             1
PassengerId      0
Pclass           0
Name             0
Sex              0
SibSp            0
Parch            0
Ticket           0
Embarked         0
dtype: int64

### Data Processing
We apply our cleaning pipeline to both datasets:

In [13]:
train_processed = clean_data(train_raw)
test_processed = clean_data(test_raw)

print(f"Processed Train data shape: {train_processed.shape}")
print(f"Processed Test data shape: {test_processed.shape}")

Data cleaning successfully completed.
Data cleaning successfully completed.
Processed Train data shape: (891, 11)
Processed Test data shape: (418, 10)


### Post-Processing - Training Data
After processing, we verify that all missing values have been handled:

In [14]:
train_processed.isna().sum().sort_values(ascending=False)

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

### Post-Processing - Test Data
After processing, we verify that all missing values have been handled:

In [15]:
test_processed.isna().sum().sort_values(ascending=False)

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

### Saving Processed Data
Finally, we save the cleaned datasets to the processed data directory:

In [16]:
PROCESSED_TRAIN_DATA_FILE_NAME = "train.csv"
PROCESSED_TEST_DATA_FILE_NAME = "test.csv"

save_processed_data(train_processed, PROCESSED_TRAIN_DATA_FILE_NAME)
save_processed_data(test_processed, PROCESSED_TEST_DATA_FILE_NAME)

train.csv successfully saved to /Users/boramiklosbence/Documents/GitHub/vdn8wh-kutmod/02_assignment/notebooks/../data/processed/train.csv
test.csv successfully saved to /Users/boramiklosbence/Documents/GitHub/vdn8wh-kutmod/02_assignment/notebooks/../data/processed/test.csv
