# 🏁 Day 15: Capstone Mini Project

## 🎯 Objective
Apply everything you've learned across the internship to solve a real-world classification problem using the end-to-end ML pipeline.

## 💡 Project: Predict Titanic Survivors
We'll use the Titanic dataset to predict whether a passenger survived based on features like age, gender, class, etc.

## ✅ Workflow:
1. Load dataset
2. Clean and preprocess data
3. Encode categorical features
4. Scale numerical features
5. Split into train-test sets
6. Train multiple models (Logistic, Random Forest, SVM)
7. Compare model performance
8. Visualize ROC curves
9. Export the best model

## 📦 Step 1 – Load & Explore Data

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head()

## 🔍 Step 2 – Data Cleaning & Feature Engineering

In [None]:
# Drop unnecessary columns
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
# Fill missing Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing Embarked with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

## 🔠 Step 3 – Encode Categorical Features

In [None]:
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
df.head()

## ⚙️ Step 4 – Split and Scale Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('Survived', axis=1)
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 🧪 Step 5 – Train Multiple Models

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(probability=True)
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f'{name} Accuracy:', accuracy_score(y_test, y_pred))

## 📊 Step 6 – Compare ROC Curves

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

plt.figure(figsize=(8, 6))
for name, model in models.items():
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.2f})')

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Model ROC Comparison')
plt.legend()
plt.grid(True)
plt.show()

## 💾 Step 7 – Save the Best Model

In [None]:
import joblib
best_model = models['Random Forest']  # Assuming it's best based on AUC
joblib.dump(best_model, 'titanic_model.pkl')
print('Model saved as titanic_model.pkl')