# Titanic - Machine Learning from Disaster

## Overview
This notebook contains a comprehensive analysis of the Titanic dataset, implementing various machine learning techniques to predict passenger survival. We'll follow a structured approach to solve this problem, going through data exploration, preprocessing, feature engineering, model selection, and optimization.

## Table of Contents
1. [Data Exploration and Visualization](#1.-Data-Exploration-and-Visualization)
2. [Data Cleaning and Preprocessing](#2.-Data-Cleaning-and-Preprocessing)
3. [Feature Engineering](#3.-Feature-Engineering)
4. [Model Selection and Training](#4.-Model-Selection-and-Training)
5. [Model Optimization](#5.-Model-Optimization)
6. [Testing and Submission](#6.-Testing-and-Submission)

## Setup
First, let's import all necessary libraries and set up our environment.

In [1]:
# Add src directory to Python path
import sys
sys.path.append('..')

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Custom modules
from src.data_processing import (
    load_data,
    handle_missing_values,
    encode_categorical_features,
    create_features,
    scale_features,
    prepare_data
)
from src.visualization import (
    set_plotting_style,
    plot_survival_by_feature,
    plot_age_distribution,
    plot_correlation_matrix,
    plot_feature_importance,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_model_comparison,
    create_analysis_plots
)

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report, confusion_matrix

# Settings
import warnings
warnings.filterwarnings('ignore')
set_plotting_style()
%matplotlib inline

# Set random seed for reproducibility
np.random.seed(42)

ModuleNotFoundError: No module named 'pandas'

# 1. Data Exploration and Visualization

In this section, we'll:
1. Load the dataset
2. Analyze basic statistics
3. Visualize relationships between features

In [None]:
# Load the datasets
train_df, test_df = load_data('../data/raw/train.csv', '../data/raw/test.csv')

print("Training set shape:", train_df.shape)
print("Test set shape:", test_df.shape)

# Display first few rows
train_df.head()

In [None]:
# Create comprehensive analysis plots
create_analysis_plots(train_df)

# 2. Data Cleaning and Preprocessing

Now we'll prepare our data for modeling by:
1. Handling missing values
2. Encoding categorical variables
3. Scaling numerical features

In [None]:
# Prepare data using our custom functions
X_train, X_val, y_train, y_val, test_processed, scaler = prepare_data(train_df, test_df)

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", test_processed.shape)

# 3. Feature Engineering

Our prepare_data function has already created several new features:
1. FamilySize: Combined SibSp and Parch
2. IsAlone: Binary indicator for solo travelers
3. FarePerPerson: Fare divided by family size
4. AgeGroup: Categorized age into meaningful groups

Let's analyze the importance of these features.

In [None]:
# Train a Random Forest to get feature importance
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Create feature importance DataFrame
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

# Plot feature importance
plot_feature_importance(feature_importance)

# 4. Model Selection and Training

We'll train and evaluate three different models:
1. Logistic Regression
2. Random Forest
3. Support Vector Machine

In [None]:
def evaluate_model(model, X_train, X_val, y_train, y_val):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_val)
    
    # Calculate metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    f1 = f1_score(y_val, y_pred)
    roc_auc = roc_auc_score(y_val, y_pred)
    
    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'ROC-AUC': roc_auc
    }

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42)
}

# Evaluate each model
results = {}
for name, model in models.items():
    print(f"\nEvaluating {name}...")
    results[name] = evaluate_model(model, X_train, X_val, y_train, y_val)

# Create and display comparison plot
results_df = pd.DataFrame(results).round(3)
plot_model_comparison(results_df)

# 5. Model Optimization

We'll perform hyperparameter tuning for each model using GridSearchCV.

In [None]:
# Define parameter grids for each model
param_grids = {
    'Logistic Regression': {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    },
    'Random Forest': {
        'n_estimators': [100, 200, 300],
        'max_depth': [None, 5, 10, 15],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'kernel': ['rbf', 'linear'],
        'gamma': ['scale', 'auto']
    }
}

# Perform grid search for each model
optimized_models = {}
optimized_results = {}

for name, model in models.items():
    print(f"\nOptimizing {name}...")
    grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    optimized_models[name] = grid_search.best_estimator_
    optimized_results[name] = evaluate_model(grid_search.best_estimator_,
                                            X_train,
                                            X_val,
                                            y_train,
                                            y_val)
    
    print(f"Best parameters: {grid_search.best_params_}")

# Create and display optimized model comparison plot
optimized_results_df = pd.DataFrame(optimized_results).round(3)
plot_model_comparison(optimized_results_df)

# 6. Testing and Submission

We'll use the best performing model to make predictions on the test set.

In [None]:
# Find the best model based on validation accuracy
best_model_name = max(optimized_results, key=lambda k: optimized_results[k]['Accuracy'])
best_model = optimized_models[best_model_name]

print(f"Best performing model: {best_model_name}")

# Make predictions on test set
test_predictions = best_model.predict(test_processed)

# Create submission file
submission = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': test_predictions
})

# Save predictions
submission.to_csv('../Prince_submission.csv', index=False)
print("\nSubmission file has been created!")