# Titanic Survival Prediction Project
This notebook demonstrates an end-to-end machine learning pipeline for predicting Titanic passenger survival. It covers data cleaning, feature engineering, exploratory data analysis (EDA), model building, evaluation, and prediction filtering/sorting.

## 1. Load and Clean Data
Load the Titanic dataset and apply cleaning functions to handle missing values and perform necessary transformations.

In [None]:
# Import required libraries and custom modules
import os
import pandas as pd
from data_loader import load_data, clean_data
from feature_engineering import add_family_size, extract_title, encode_features
data_path = os.path.join('titanic', 'train.csv')
df = load_data(data_path)
df_clean = clean_data(df)
df_clean.head()

### Data Cleaning Summary
- Missing values in 'Age' are filled with the median age.
- Missing values in 'Embarked' are filled with the mode.
- 'Cabin' missing values are marked as 'Unknown' and a 'HasCabin' feature is created.
- Irrelevant columns ('Ticket', 'PassengerId') are dropped.
- Fare inconsistencies are fixed by replacing zeros and missing values with the median fare.

## 2. Feature Engineering
Apply feature engineering steps such as adding family size, extracting titles, and encoding categorical features.

In [None]:
# Apply feature engineering steps
df_fe = add_family_size(df_clean)
df_fe = extract_title(df_fe)
df_fe = encode_features(df_fe)
df_fe.head()

### Feature Engineering Summary
- Created 'FamilySize' as SibSp + Parch + 1.
- Extracted titles from 'Name' and grouped rare titles.
- Encoded categorical variables: Sex, Embarked, Title, Pclass.
- Scaled numerical features: Age, Fare, FamilySize.

## 3. Save Cleaned Dataset
Save the cleaned and preprocessed dataset to a CSV file for reference and reproducibility.

In [None]:
# Save cleaned and preprocessed dataset
df_fe.to_csv('titanic_ml/titanic_cleaned.csv', index=False)
print('Cleaned dataset saved as titanic_cleaned.csv')

## 4. Exploratory Data Analysis (EDA)
Generate visualizations and insights: plot survival by gender, survival by passenger class, age distribution, survival by embarked port, and passenger class vs survival.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from eda import plot_survival_by_gender, plot_survival_by_pclass, plot_age_histogram, plot_survival_by_embarked, plot_pclass_vs_survival
plot_survival_by_gender(df_fe)
plot_survival_by_pclass(df_fe)
plot_age_histogram(df_fe)
plot_survival_by_embarked(df_fe)
plot_pclass_vs_survival(df_fe)

### EDA Insights
- **Gender:** Females had a much higher survival rate than males.
- **Passenger Class:** First class passengers had the highest survival rate, followed by second and third class.
- **Age:** Younger passengers (children) had higher survival rates.
- **Embarked Port:** Passengers embarked from Cherbourg (C) had higher survival rates.
- **Class vs Survival:** Survival rates varied significantly across passenger classes and age groups.

## 5. Modeling
Split the data into training and test sets, train machine learning models (e.g., Logistic Regression, Random Forest), evaluate models using metrics such as confusion matrix and ROC curve, and perform cross-validation and hyperparameter tuning.

In [None]:
from modeling import split_data, train_models, evaluate_models, cross_validate_model, hyperparameter_tuning
X_train, X_test, y_train, y_test = split_data(df_fe)
models = train_models(X_train, y_train)
evaluate_models(models, X_test, y_test)
# Cross-validation and hyperparameter tuning for Random Forest
cross_validate_model(models['RandomForest'], X_train, y_train)
param_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10, None]}
best_rf = hyperparameter_tuning(models['RandomForest'], param_grid, X_train, y_train)

### Model Evaluation Summary
- Multiple models trained: Logistic Regression, Random Forest, SVM.
- Evaluation metrics include accuracy, precision, recall, F1-score, confusion matrix, and ROC-AUC.
- Cross-validation and hyperparameter tuning performed for Random Forest.

## 6. Filtering and Sorting Predictions
Implement filtering and sorting functionality for predictions based on class, age, and gender. Integrate with Streamlit for a user interface.

In [None]:
from prediction_utils import filter_by_pclass, filter_by_age_range, filter_by_gender, sort_by_survival_probability
# Generate predictions using the best Random Forest model
y_prob = best_rf.predict_proba(X_test)[:,1]
results = X_test.copy()
results['Survival_Prob'] = y_prob
results['Survived'] = y_test.values
# Example filtering and sorting
filtered_pclass = filter_by_pclass(results, 1)
filtered_age = filter_by_age_range(results, 20, 30)
filtered_gender = filter_by_gender(results, 'male')
sorted_results = sort_by_survival_probability(results)
filtered_pclass.head(), filtered_age.head(), filtered_gender.head(), sorted_results.head()

### Streamlit Integration
A Streamlit app can be developed to provide an interactive user interface for filtering and sorting predictions by passenger class, age range, and gender. This enables users to explore model results dynamically.

**Note:** The Titanic folder contains separate train.csv and test.csv files, so data splitting is not required. The train.csv will be used for training and validation, and test.csv for final predictions.