# Project 01: Titanic Survival Prediction

**Difficulty**: ‚≠ê Beginner

**Estimated Time**: 20-25 hours

**Project Type**: Binary Classification

**Dataset**: Titanic from Kaggle

## Learning Objectives

By the end of this project, you will be able to:
1. Perform comprehensive exploratory data analysis (EDA)
2. Handle missing data using multiple strategies
3. Engineer features from existing variables
4. Compare multiple classification algorithms
5. Evaluate models using appropriate metrics
6. Create a professional portfolio project

## Problem Statement

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

**Goal**: Build a predictive model that answers the question: "What sorts of people were more likely to survive?" using passenger data (name, age, gender, socio-economic class, etc.).

## Prerequisites

- Machine Learning Fundamentals (Module 05)
- Data Manipulation with Pandas (Module 02)
- Data Visualization (Module 03)
- Feature Engineering basics

## 1. Setup and Imports

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization defaults
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ All libraries imported successfully!")

## 2. Load and Inspect Data

For this project, we'll use the Titanic dataset. You can download it from:
- **Kaggle**: https://www.kaggle.com/c/titanic/data
- **Seaborn**: Built-in dataset

We'll use the seaborn built-in dataset for easy access.

In [None]:
# Load Titanic dataset from seaborn
titanic_data = sns.load_dataset('titanic')

# Display basic information
print(f"Dataset shape: {titanic_data.shape}")
print(f"Number of rows: {len(titanic_data)}")
print(f"Number of columns: {len(titanic_data.columns)}")
print("\n" + "="*50)

# Display first few rows
titanic_data.head()

In [None]:
# Display column information
titanic_data.info()

In [None]:
# Statistical summary
titanic_data.describe()

### Data Dictionary

| Variable | Definition | Key |
|----------|------------|-----|
| survived | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | male/female |
| age | Age in years | |
| sibsp | # of siblings / spouses aboard | |
| parch | # of parents / children aboard | |
| fare | Passenger fare | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
| class | Same as pclass but categorical | |
| who | man, woman, or child | |
| deck | Deck number | |

## 3. Exploratory Data Analysis (EDA)

Let's explore the data to understand patterns and relationships.

### 3.1 Missing Values Analysis

In [None]:
# Check for missing values
missing_values = titanic_data.isnull().sum()
missing_percentage = (missing_values / len(titanic_data)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage
})

# Show only columns with missing values
missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
missing_cols = missing_df[missing_df['Missing Count'] > 0].sort_values('Percentage', ascending=False)

plt.barh(missing_cols.index, missing_cols['Percentage'], color='coral')
plt.xlabel('Percentage of Missing Values')
plt.title('Missing Values by Feature')
plt.xlim(0, 100)

# Add percentage labels
for i, v in enumerate(missing_cols['Percentage']):
    plt.text(v + 1, i, f'{v:.1f}%', va='center')

plt.tight_layout()
plt.show()

print("üìä Key Observations:")
print("- Deck has 77.2% missing values (may need to drop)")
print("- Age has 19.9% missing values (will impute)")
print("- Embarked has minimal missing values (will impute)")

### 3.2 Survival Rate Analysis

In [None]:
# Overall survival rate
survival_rate = titanic_data['survived'].mean()
print(f"Overall Survival Rate: {survival_rate:.2%}")
print(f"Survived: {titanic_data['survived'].sum()} passengers")
print(f"Died: {len(titanic_data) - titanic_data['survived'].sum()} passengers")

# Visualize survival distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
survival_counts = titanic_data['survived'].value_counts()
axes[0].pie(survival_counts, labels=['Died', 'Survived'], autopct='%1.1f%%',
            colors=['#FF6B6B', '#4ECDC4'], startangle=90)
axes[0].set_title('Survival Distribution')

# Bar chart
sns.countplot(data=titanic_data, x='survived', palette=['#FF6B6B', '#4ECDC4'], ax=axes[1])
axes[1].set_xlabel('Survived')
axes[1].set_ylabel('Count')
axes[1].set_title('Survival Count')
axes[1].set_xticklabels(['Died (0)', 'Survived (1)'])

plt.tight_layout()
plt.show()

### 3.3 Survival by Gender

In [None]:
# Survival rate by gender
gender_survival = titanic_data.groupby('sex')['survived'].agg(['sum', 'count', 'mean'])
gender_survival.columns = ['Survived', 'Total', 'Survival Rate']
print("Survival by Gender:")
print(gender_survival)
print(f"\nFemales were {gender_survival.loc['female', 'Survival Rate'] / gender_survival.loc['male', 'Survival Rate']:.1f}x more likely to survive")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=titanic_data, x='sex', hue='survived', palette=['#FF6B6B', '#4ECDC4'], ax=axes[0])
axes[0].set_title('Survival Count by Gender')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Count')
axes[0].legend(['Died', 'Survived'])

# Survival rate
gender_survival['Survival Rate'].plot(kind='bar', color=['#FF6B6B', '#4ECDC4'], ax=axes[1])
axes[1].set_title('Survival Rate by Gender')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xticklabels(['Female', 'Male'], rotation=0)
axes[1].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(gender_survival['Survival Rate']):
    axes[1].text(i, v + 0.02, f'{v:.1%}', ha='center')

plt.tight_layout()
plt.show()

### 3.4 Survival by Passenger Class

In [None]:
# Survival rate by class
class_survival = titanic_data.groupby('pclass')['survived'].agg(['sum', 'count', 'mean'])
class_survival.columns = ['Survived', 'Total', 'Survival Rate']
print("Survival by Passenger Class:")
print(class_survival)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
sns.countplot(data=titanic_data, x='pclass', hue='survived', palette=['#FF6B6B', '#4ECDC4'], ax=axes[0])
axes[0].set_title('Survival Count by Passenger Class')
axes[0].set_xlabel('Passenger Class')
axes[0].set_ylabel('Count')
axes[0].legend(['Died', 'Survived'])

# Survival rate
class_survival['Survival Rate'].plot(kind='bar', color=['#4ECDC4', '#95E1D3', '#FF6B6B'], ax=axes[1])
axes[1].set_title('Survival Rate by Passenger Class')
axes[1].set_xlabel('Passenger Class')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xticklabels(['1st Class', '2nd Class', '3rd Class'], rotation=0)
axes[1].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(class_survival['Survival Rate']):
    axes[1].text(i, v + 0.02, f'{v:.1%}', ha='center')

plt.tight_layout()
plt.show()

print("\nüìä Key Insight: Higher class passengers had much better survival rates!")

### 3.5 Survival by Age

In [None]:
# Age distribution by survival
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
titanic_data[titanic_data['survived'] == 0]['age'].hist(bins=30, alpha=0.7, label='Died',
                                                         color='#FF6B6B', ax=axes[0])
titanic_data[titanic_data['survived'] == 1]['age'].hist(bins=30, alpha=0.7, label='Survived',
                                                         color='#4ECDC4', ax=axes[0])
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Age Distribution by Survival')
axes[0].legend()

# Box plot
sns.boxplot(data=titanic_data, x='survived', y='age', palette=['#FF6B6B', '#4ECDC4'], ax=axes[1])
axes[1].set_xlabel('Survived')
axes[1].set_ylabel('Age')
axes[1].set_title('Age Distribution by Survival')
axes[1].set_xticklabels(['Died', 'Survived'])

plt.tight_layout()
plt.show()

# Age statistics
print("Age Statistics by Survival:")
print(titanic_data.groupby('survived')['age'].describe())

### 3.6 Combined Analysis: Gender + Class

In [None]:
# Survival rate by gender and class
gender_class_survival = titanic_data.groupby(['sex', 'pclass'])['survived'].mean().unstack()
print("Survival Rate by Gender and Class:")
print(gender_class_survival)

# Visualize
gender_class_survival.plot(kind='bar', figsize=(10, 6), color=['#4ECDC4', '#95E1D3', '#FF6B6B'])
plt.title('Survival Rate by Gender and Passenger Class')
plt.xlabel('Gender')
plt.ylabel('Survival Rate')
plt.legend(['1st Class', '2nd Class', '3rd Class'])
plt.xticks(rotation=0)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

print("\nüìä Key Insight: 1st class females had >95% survival rate, while 3rd class males had <15%!")

### 3.7 Correlation Analysis

In [None]:
# Select numeric columns for correlation
numeric_cols = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
correlation_matrix = titanic_data[numeric_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=1)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Show correlations with survival
print("\nCorrelations with Survival:")
print(correlation_matrix['survived'].sort_values(ascending=False))

## 4. Data Preprocessing

Now we'll prepare the data for machine learning.

### 4.1 Feature Selection and Engineering

In [None]:
# Create a copy for preprocessing
df = titanic_data.copy()

# Drop columns with too many missing values or not useful
columns_to_drop = ['deck', 'embark_town', 'alive', 'class', 'who', 'adult_male', 'alone']
df = df.drop(columns=columns_to_drop, errors='ignore')

print(f"Remaining columns: {df.columns.tolist()}")
print(f"Dataset shape: {df.shape}")

In [None]:
# Feature Engineering: Create new features

# 1. Family Size
df['family_size'] = df['sibsp'] + df['parch'] + 1

# 2. Is Alone
df['is_alone'] = (df['family_size'] == 1).astype(int)

# 3. Fare per person
df['fare_per_person'] = df['fare'] / df['family_size']

# 4. Age categories
df['age_category'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100],
                            labels=['Child', 'Teenager', 'Adult', 'Middle-aged', 'Senior'])

print("‚úÖ New features created:")
print("- family_size: Total family members aboard")
print("- is_alone: Whether passenger traveled alone")
print("- fare_per_person: Fare divided by family size")
print("- age_category: Age grouped into categories")

# Display sample
df[['survived', 'family_size', 'is_alone', 'fare_per_person', 'age_category']].head()

### 4.2 Handle Missing Values

In [None]:
# Check current missing values
print("Missing values before imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# 1. Fill Age with median by sex and pclass (more accurate than overall median)
df['age'] = df.groupby(['sex', 'pclass'])['age'].transform(
    lambda x: x.fillna(x.median())
)

# 2. Fill Embarked with mode (most common port)
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# 3. Fill fare_per_person (created from fare, so same missing values)
df['fare_per_person'] = df['fare_per_person'].fillna(df['fare_per_person'].median())

# 4. Age category will be filled after age imputation
df['age_category'] = pd.cut(df['age'], bins=[0, 12, 18, 35, 60, 100],
                            labels=['Child', 'Teenager', 'Adult', 'Middle-aged', 'Senior'])

print("\nMissing values after imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])
print("\n‚úÖ All missing values handled!")

### 4.3 Encode Categorical Variables

In [None]:
# Encode categorical variables

# 1. Sex: male=1, female=0
df['sex_encoded'] = df['sex'].map({'male': 1, 'female': 0})

# 2. Embarked: one-hot encoding
embarked_dummies = pd.get_dummies(df['embarked'], prefix='embarked')
df = pd.concat([df, embarked_dummies], axis=1)

# 3. Age category: ordinal encoding
age_cat_mapping = {'Child': 0, 'Teenager': 1, 'Adult': 2, 'Middle-aged': 3, 'Senior': 4}
df['age_cat_encoded'] = df['age_category'].map(age_cat_mapping)

print("‚úÖ Categorical variables encoded")
print(f"Dataset shape after encoding: {df.shape}")

# Display sample
df[['sex', 'sex_encoded', 'embarked', 'embarked_C', 'embarked_Q', 'embarked_S']].head()

### 4.4 Select Final Features

In [None]:
# Select features for modeling
feature_columns = [
    'pclass', 'sex_encoded', 'age', 'sibsp', 'parch', 'fare',
    'family_size', 'is_alone', 'fare_per_person',
    'embarked_C', 'embarked_Q', 'embarked_S', 'age_cat_encoded'
]

X = df[feature_columns]
y = df['survived']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures used: {feature_columns}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"Survival rate: {y.mean():.2%}")

## 5. Train-Test Split and Scaling

In [None]:
# Split data into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} ({X_train.shape[0]/len(X):.1%})")
print(f"Test set size: {X_test.shape[0]} ({X_test.shape[0]/len(X):.1%})")
print(f"\nTraining set survival rate: {y_train.mean():.2%}")
print(f"Test set survival rate: {y_test.mean():.2%}")

# Scale features (important for some algorithms like SVM, KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n‚úÖ Data split and scaled successfully!")

## 6. Model Training and Evaluation

We'll train and compare multiple classification algorithms.

### 6.1 Train Multiple Models

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(random_state=42, probability=True),
    'K-Nearest Neighbors': KNeighborsClassifier()
}

# Train and evaluate each model
results = {}

print("Training models...\n")
for name, model in models.items():
    # Use scaled data for models that benefit from it
    if name in ['Logistic Regression', 'SVM', 'K-Nearest Neighbors']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'ROC AUC': roc_auc,
        'model': model,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"‚úÖ {name}: Accuracy={accuracy:.4f}, F1={f1:.4f}, ROC AUC={roc_auc:.4f}")

print("\n‚úÖ All models trained successfully!")

### 6.2 Compare Model Performance

In [None]:
# Create comparison dataframe
results_df = pd.DataFrame({
    name: {metric: values[metric] for metric in ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC']}
    for name, values in results.items()
}).T

print("Model Performance Comparison:")
print(results_df.sort_values('Accuracy', ascending=False))

# Find best model
best_model_name = results_df['Accuracy'].idxmax()
print(f"\nüèÜ Best Model: {best_model_name} with {results_df.loc[best_model_name, 'Accuracy']:.2%} accuracy")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
colors = ['#4ECDC4', '#FF6B6B', '#95E1D3', '#FFD93D']

for idx, (metric, color) in enumerate(zip(metrics, colors)):
    ax = axes[idx // 2, idx % 2]
    sorted_results = results_df.sort_values(metric, ascending=True)
    sorted_results[metric].plot(kind='barh', ax=ax, color=color)
    ax.set_xlabel(metric)
    ax.set_title(f'Model Comparison: {metric}')
    ax.set_xlim(0, 1)
    
    # Add value labels
    for i, v in enumerate(sorted_results[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center')

plt.tight_layout()
plt.show()

### 6.3 ROC Curve Comparison

In [None]:
# Plot ROC curves for all models
plt.figure(figsize=(12, 8))

for name, values in results.items():
    fpr, tpr, _ = roc_curve(y_test, values['probabilities'])
    plt.plot(fpr, tpr, label=f"{name} (AUC = {values['ROC AUC']:.3f})")

# Plot diagonal line (random classifier)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.500)')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves - Model Comparison')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 6.4 Confusion Matrix - Best Model

In [None]:
# Get best model predictions
best_predictions = results[best_model_name]['predictions']

# Create confusion matrix
cm = confusion_matrix(y_test, best_predictions)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', square=True,
            xticklabels=['Died', 'Survived'],
            yticklabels=['Died', 'Survived'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title(f'Confusion Matrix - {best_model_name}')
plt.tight_layout()
plt.show()

# Print detailed classification report
print(f"\nDetailed Classification Report - {best_model_name}:")
print(classification_report(y_test, best_predictions, target_names=['Died', 'Survived']))

### 6.5 Feature Importance (Random Forest)

In [None]:
# Get Random Forest model
rf_model = results['Random Forest']['model']

# Get feature importances
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print(feature_importance)

# Plot feature importance
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='#4ECDC4')
plt.xlabel('Importance')
plt.title('Feature Importance - Random Forest')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nüìä Key Insights:")
top_3 = feature_importance.head(3)['Feature'].tolist()
print(f"Top 3 most important features: {', '.join(top_3)}")

## 7. Model Optimization (Hyperparameter Tuning)

Let's tune the best model to improve performance.

In [None]:
# Hyperparameter tuning for Random Forest (one of the best performers)
print("Performing hyperparameter tuning for Random Forest...\n")

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\n‚úÖ Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# Evaluate on test set
y_pred_tuned = grid_search.predict(X_test)
tuned_accuracy = accuracy_score(y_test, y_pred_tuned)

print(f"\nTest set accuracy (before tuning): {results['Random Forest']['Accuracy']:.4f}")
print(f"Test set accuracy (after tuning): {tuned_accuracy:.4f}")
print(f"Improvement: {(tuned_accuracy - results['Random Forest']['Accuracy']):.4f}")

## 8. Cross-Validation Analysis

In [None]:
# Perform cross-validation on all models
print("Performing 5-fold cross-validation on all models...\n")

cv_results = {}

for name, model_info in results.items():
    model = model_info['model']
    
    # Use scaled data for models that benefit from it
    if name in ['Logistic Regression', 'SVM', 'K-Nearest Neighbors']:
        scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    else:
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    cv_results[name] = {
        'Mean': scores.mean(),
        'Std': scores.std(),
        'Min': scores.min(),
        'Max': scores.max()
    }
    
    print(f"{name}:")
    print(f"  Mean Accuracy: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
    print(f"  Range: [{scores.min():.4f}, {scores.max():.4f}]\n")

# Create visualization
cv_df = pd.DataFrame(cv_results).T
cv_df = cv_df.sort_values('Mean', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(cv_df.index, cv_df['Mean'], xerr=cv_df['Std'], color='#4ECDC4', alpha=0.7)
plt.xlabel('Mean Cross-Validation Accuracy')
plt.title('5-Fold Cross-Validation Results')
plt.xlim(0, 1)

# Add value labels
for i, (idx, row) in enumerate(cv_df.iterrows()):
    plt.text(row['Mean'] + 0.01, i, f"{row['Mean']:.3f}", va='center')

plt.tight_layout()
plt.show()

## 9. Key Insights and Conclusions

### Summary of Findings

**1. Survival Patterns:**
- Overall survival rate was 38.4%
- Women had 74.2% survival rate vs 18.9% for men ("Women and children first")
- 1st class passengers had 63.0% survival rate vs 24.2% for 3rd class
- Children (age < 12) had higher survival rates than adults

**2. Most Important Features:**
- Gender (sex_encoded) - Most predictive feature
- Passenger class (pclass) - Strong indicator of survival
- Fare - Correlates with class and survival
- Age - Younger passengers more likely to survive
- Family size - Being alone or in very large families reduced survival

**3. Model Performance:**
- Best model achieved ~80-85% accuracy
- Random Forest and Gradient Boosting performed best
- Simple models (Logistic Regression) performed surprisingly well
- Cross-validation showed consistent performance across folds

**4. Business Insights:**
- Socioeconomic status strongly influenced survival (1st class priority in lifeboats)
- Gender was the strongest predictor ("Women and children first" protocol)
- Location on ship (deck) would have been valuable but had too much missing data
- Family connections affected survival (traveling alone was disadvantageous)

## 10. Next Steps and Improvements

**Potential Improvements:**
1. **Feature Engineering:**
   - Extract titles from names (Mr., Mrs., Master, etc.)
   - Create cabin deck features from cabin numbers
   - Interaction features (e.g., sex * class)

2. **Advanced Modeling:**
   - Try ensemble methods (stacking, blending)
   - Use XGBoost or LightGBM
   - Neural networks for comparison

3. **Model Deployment:**
   - Create a web app with Streamlit
   - Deploy as REST API with FastAPI
   - Containerize with Docker

4. **Further Analysis:**
   - Analyze prediction errors (false positives/negatives)
   - SHAP values for model interpretability
   - Cost-sensitive learning (different costs for different errors)

## Exercises

Try these exercises to deepen your understanding:

### Exercise 1: Title Extraction
Extract titles (Mr., Mrs., Miss, etc.) from passenger names and use them as a feature. Do titles improve model performance?

### Exercise 2: Ensemble Methods
Create a voting classifier that combines the top 3 performing models. Does it improve accuracy?

### Exercise 3: Cost-Sensitive Learning
In a real scenario, false negatives (predicting death when survived) might be worse than false positives. Implement class weights to penalize false negatives more heavily.

### Exercise 4: SHAP Analysis
Use SHAP (SHapley Additive exPlanations) to explain individual predictions. Why did the model predict survival/death for specific passengers?

### Exercise 5: Deployment
Create a simple Streamlit app that takes passenger information as input and predicts survival probability.

## Project Checklist

‚úÖ **Completed:**
- [x] Data loading and initial exploration
- [x] Comprehensive EDA with visualizations
- [x] Missing value handling
- [x] Feature engineering
- [x] Data preprocessing and encoding
- [x] Train-test split
- [x] Multiple model training and comparison
- [x] Model evaluation with multiple metrics
- [x] Hyperparameter tuning
- [x] Cross-validation analysis
- [x] Feature importance analysis
- [x] Clear insights and conclusions

üìã **For Portfolio:**
- [ ] Create professional README.md
- [ ] Add requirements.txt
- [ ] Clean and organize code
- [ ] Add comments and documentation
- [ ] Create presentation slides
- [ ] Deploy as web app (optional)
- [ ] Write blog post about project (optional)