# Titanic - Machine Learning from Disaster

## Overview

This notebook contains a comprehensive analysis of the Titanic dataset, implementing various machine learning techniques to predict passenger survival.

## Table of Contents

1. [Task 1: Data Exploration and Visualization](#1.-Task-1:-Data-Exploration-and-Visualization)

   - Load dataset using DataLoader
   - Analyze key statistics
   - Visualize relationships using TitanicVisualizer

2. [Task 2: Data Cleaning and Preprocessing](#2.-Task-2:-Data-Cleaning-and-Preprocessing)

   - Handle missing values using FeatureProcessor
   - Encode categorical variables
   - Scale features
   - Split dataset

3. [Task 3: Feature Engineering](#3.-Task-3:-Feature-Engineering)

   - Generate new features using FeatureProcessor
   - Perform feature selection
   - Analyze feature importance

4. [Task 4: Model Selection and Training](#4.-Task-4:-Model-Selection-and-Training)

   - Train multiple models
   - Use cross-validation
   - Compare models using multiple metrics

5. [Task 5: Model Optimization](#5.-Task-5:-Model-Optimization)

   - Perform hyperparameter tuning
   - Evaluate optimized models

6. [Task 6: Testing and Submission](#6.-Task-6:-Testing-and-Submission)
   - Make predictions on test set
   - Generate submission file

## Setup

First, let's import all necessary libraries and initialize our processors.


In [1]:
# Add src directory to Python path
import sys
sys.path.append('..')

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.preprocessing import StandardScaler

# Custom modules
from src.data_processing import DataLoader, FeatureProcessor, TitanicPreprocessor, ModelEvaluator
from src.visualization import TitanicVisualizer

# Initialize processors
data_loader = DataLoader()
feature_processor = FeatureProcessor()
preprocessor = TitanicPreprocessor()
visualizer = TitanicVisualizer()
model_evaluator = ModelEvaluator(visualizer)


# Set random seed for reproducibility
np.random.seed(42)

## Create paths to files or data


In [2]:
# Set base path for data
base_path = '../data/raw/'

# Construct the full path to the training dataset
train_data_path = base_path + 'train.csv'

# Construct the full path to the test dataset
test_data_path = base_path + 'test.csv'


# 1. Task 1: Data Exploration and Visualization

## 1.1 Load Dataset using DataLoader


In [3]:
# Load raw data using DataLoader
train_data = data_loader.load_csv(train_data_path)
test_data = data_loader.load_csv(test_data_path)

print("Training set shape:", train_data.shape)
print("Test set shape:", test_data.shape)

# Display first few rows
display(train_data.head())

Training set shape: (712, 12)
Test set shape: (179, 11)


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,693,3,"Lam, Mr. Ali",male,,0,0,1601,56.4958,,S,1
1,482,2,"Frost, Mr. Anthony Wood ""Archie""",male,,0,0,239854,0.0,,S,0
2,528,1,"Farthing, Mr. John",male,,0,0,PC 17483,221.7792,C95,S,0
3,856,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.35,,S,1
4,802,2,"Collyer, Mrs. Harvey (Charlotte Annie Tate)",female,31.0,1,1,C.A. 31921,26.25,,S,1


## 1.2 Analyze Key Statistics


In [4]:
# Display basic statistics
print("\nBasic Statistics:")
display(train_data.describe())

# Display info about data types and missing values
print("\nDataset Info:")
display(train_data.info())

# Calculate missing values
missing_values = train_data.isnull().sum()
print("\nMissing Values:")
display(missing_values[missing_values > 0])


Basic Statistics:


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
count,712.0,712.0,575.0,712.0,712.0,712.0,712.0
mean,444.405899,2.308989,29.807687,0.492978,0.390449,31.819826,0.383427
std,257.465527,0.833563,14.485211,1.06072,0.838134,48.059104,0.486563
min,1.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,222.75,2.0,21.0,0.0,0.0,7.8958,0.0
50%,439.5,3.0,28.5,0.0,0.0,14.4542,0.0
75%,667.25,3.0,39.0,1.0,0.0,31.0,1.0
max,891.0,3.0,80.0,8.0,6.0,512.3292,1.0



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Pclass       712 non-null    int64  
 2   Name         712 non-null    object 
 3   Sex          712 non-null    object 
 4   Age          575 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Ticket       712 non-null    object 
 8   Fare         712 non-null    float64
 9   Cabin        160 non-null    object 
 10  Embarked     710 non-null    object 
 11  Survived     712 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 66.9+ KB


None


Missing Values:


Age         137
Cabin       552
Embarked      2
dtype: int64

## 1.3 Visualize Relationships using TitanicVisualizer


In [5]:
# Create directory for plots
import os
plots_dir = '../plots'
os.makedirs(plots_dir, exist_ok=True)

# Plot survival rates by various features
for feature in ['Sex', 'Pclass', 'Embarked']:
    visualizer.plot_survival_by_feature(train_data, feature)

# Plot age distribution
visualizer.plot_age_distribution(train_data)

# Plot correlation matrix for numerical features
numeric_data = train_data.select_dtypes(include=[np.number])
visualizer.plot_correlation_matrix(numeric_data)

# 2. Task 2: Data Cleaning and Preprocessing

## 2.1 Handle Missing Values using FeatureProcessor


In [6]:
# Handle missing values
train_clean = feature_processor.handle_missing_values(train_data)
test_clean = feature_processor.handle_missing_values(test_data)

print("Missing values after handling:")
print(train_clean.isnull().sum()[train_clean.isnull().sum() > 0])

Missing values after handling:
Cabin    552
dtype: int64


## 2.2 Feature Engineering and Encoding


In [7]:
# Create new features
train_featured = feature_processor.create_features(train_clean)
test_featured = feature_processor.create_features(test_clean)

# Encode categorical variables
categorical_features = ['Sex', 'Embarked', 'Title', 'Deck', 'AgeGroup']
train_encoded = feature_processor.encode_categorical_features(train_featured, categorical_features)
test_encoded = feature_processor.encode_categorical_features(test_featured, categorical_features)

# Scale features
features_to_scale = ['Age', 'Fare', 'FarePerPerson']
train_scaled = feature_processor.scale_features(train_encoded, features_to_scale)
test_scaled = feature_processor.scale_features(test_encoded, features_to_scale)

print("Features after preprocessing:")
print(train_scaled.columns.tolist())

Features after preprocessing:
['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived', 'HasCabin', 'FamilySize', 'IsAlone', 'FarePerPerson', 'Sex_male', 'Embarked_Q', 'Embarked_S', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Deck_Unknown', 'AgeGroup_Teenager', 'AgeGroup_Young Adult', 'AgeGroup_Adult', 'AgeGroup_Senior', 'AgeGroup_Elderly']


## 2.3 Prepare Final Dataset using TitanicPreprocessor


In [8]:
# Process data using our optimized pipeline
processed_data = preprocessor.prepare_data(
    train_path=train_data_path,
    test_path=test_data_path,
    test_size=0.2,
    random_state=42
)

X_train = processed_data['X_train']
X_val = processed_data['X_val']
y_train = processed_data['y_train']
y_val = processed_data['y_val']
test_processed = processed_data['test_processed']

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", test_processed.shape)

Training set shape: (569, 1746)
Validation set shape: (143, 1746)
Test set shape: (179, 1746)


# 3. Task 3: Feature Engineering

## 3.1 Analyze Feature Importance


In [9]:
# Train a Random Forest for feature importance
rf_model = RandomForestClassifier(random_state=42)

# Create dummy variables for categorical columns
X_train_encoded = X_train.copy()
categorical_columns = ['AgeGroup', 'Title', 'Deck', 'Sex', 'Embarked']
numeric_columns = X_train.select_dtypes(include=['int64', 'float64']).columns

# Keep numeric columns as is
X_train_final = X_train_encoded[numeric_columns].copy()

# Create dummies for categorical columns
for column in categorical_columns:
    if column in X_train.columns:  # Only process if column exists
        # Create dummy variables and drop the first category to avoid multicollinearity
        dummies = pd.get_dummies(X_train[column], prefix=column, drop_first=True)
        # Add the dummy columns to the dataset
        X_train_final = pd.concat([X_train_final, dummies], axis=1)

# Fit the model to the data (y_train should already be encoded)
rf_model.fit(X_train_final, y_train)

# Create feature importance DataFrame
feature_importance = pd.DataFrame({
    'Feature': X_train_final.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot feature importance
visualizer.plot_feature_importance(feature_importance)
print("Check plots in the 'plots' directory.")

Check plots in the 'plots' directory.


# 4. Task 4: Model Selection and Training


In [10]:

# Define custom models with optimized parameters
custom_models = {
    'Logistic Regression': LogisticRegression(
        random_state=42,
        max_iter=1000,  # Increased iterations
        C=0.1  # Add regularization
    ),
    'Random Forest': RandomForestClassifier(
        random_state=42,
        n_estimators=200,
        max_depth=10
    ),
    'SVM': SVC(
        probability=True,
        random_state=42,
        C=1.0,
        kernel='rbf',
        class_weight='balanced'
    )
}

# Train and evaluate all models using custom configurations
results_df = model_evaluator.evaluate_all_models(
    X_train, X_val, y_train, y_val, 
    models=custom_models
)

print("\nModel Performance:")
display(results_df)
print("Check plots in the 'plots' directory.")

Training Logistic Regression...
Cross-validation scores: [0.84210526 0.84210526 0.8245614  0.80701754 0.80530973]
Mean CV score: 0.824 (+/- 0.032)
Training Random Forest...
Cross-validation scores: [0.83333333 0.78947368 0.81578947 0.78070175 0.82300885]
Mean CV score: 0.808 (+/- 0.040)
Training SVM...
Cross-validation scores: [0.83333333 0.84210526 0.78947368 0.78947368 0.83185841]
Mean CV score: 0.817 (+/- 0.046)

Model Performance:


Unnamed: 0,Logistic Regression,Random Forest,SVM
accuracy,0.832,0.804,0.804
precision,0.792,0.865,0.729
recall,0.764,0.582,0.782
f1,0.778,0.696,0.754
roc_auc,0.85,0.842,0.858


Check plots in the 'plots' directory.


# 5. Task 5: Model Optimization


In [11]:
# * set as default models in the data_processing.py file
# custom_models = {
#     'Logistic Regression': LogisticRegression(random_state=42),
#     'Random Forest': RandomForestClassifier(random_state=42),
#     'SVM': SVC(probability=True, random_state=42)
# }


optimized_df, optimized_models = model_evaluator.perform_grid_search(
    X_train, X_val, y_train, y_val,
)

print("\nOptimized Model Performance:")
display(optimized_df)


Optimizing Logistic Regression...
Best parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best cross-validation score: 0.835

Optimizing Random Forest...
Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 300}
Best cross-validation score: 0.840

Optimizing SVM...
Best parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
Best cross-validation score: 0.837

Optimized Model Performance:


Unnamed: 0,Logistic Regression,Random Forest,SVM
accuracy,0.818,0.825,0.832
precision,0.764,0.778,0.792
recall,0.764,0.764,0.764
f1,0.764,0.771,0.778
roc_auc,0.892,0.86,0.903


# 6. Task 6: Testing and Submission


In [12]:

# Get best model and make predictions
best_model_name, submission = model_evaluator.get_best_model_and_predict(
    optimized_df, optimized_models, test_processed
)

# Save submission file
filename = 'Prince_submission.csv'
model_evaluator.save_submission(submission, filename=filename)

Best model: SVM
Submission file saved to: ..\submissions\Prince2_submission.csv

Sample predictions:
   PassengerId  Survived
0        566.0         0
1        161.0         0
2        554.0         0
3        861.0         0
4        242.0         0


WindowsPath('../submissions/Prince2_submission.csv')