# Module 11: Final Project - Kaggle-Style Competition

**Difficulty**: ⭐⭐⭐
**Estimated Time**: 90-120 minutes
**Prerequisites**: All previous modules (00-10)

## Project Overview

In this final project, you'll complete an end-to-end machine learning workflow using ensemble methods. This simulates a real Kaggle competition environment where you'll:

1. **Perform Exploratory Data Analysis (EDA)**
2. **Engineer features** to improve model performance
3. **Build and compare** multiple ensemble models
4. **Tune hyperparameters** for optimal performance
5. **Create a stacked ensemble** for final predictions
6. **Prepare a submission** in competition format

## Learning Objectives
By completing this project, you will:
1. Apply all ensemble methods learned in this course
2. Conduct thorough exploratory data analysis
3. Create meaningful features through feature engineering
4. Build and compare multiple models systematically
5. Optimize hyperparameters effectively
6. Create a production-ready ensemble solution
7. Document your work clearly and professionally

## Dataset: Titanic Survival Prediction

**Task**: Predict whether a passenger survived the Titanic disaster

**Features**:
- PassengerId: Unique ID
- Pclass: Ticket class (1, 2, 3)
- Name: Passenger name
- Sex: Gender
- Age: Age in years
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of embarkation (C, Q, S)

**Target**: Survived (0 = No, 1 = Yes)

## Part 1: Setup and Data Loading

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    VotingClassifier,
    StackingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve
)

# Boosting libraries
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
try:
    from catboost import CatBoostClassifier
    CATBOOST_AVAILABLE = True
except ImportError:
    CATBOOST_AVAILABLE = False

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

print("Libraries loaded successfully!")

In [None]:
# Load Titanic dataset
# Note: Download from https://www.kaggle.com/c/titanic/data or use seaborn
try:
    # Try loading from seaborn
    df = sns.load_dataset('titanic')
    print("Dataset loaded from seaborn")
except:
    # If seaborn dataset not available, create sample data
    print("Creating sample Titanic-like dataset")
    np.random.seed(42)
    n_samples = 891
    
    df = pd.DataFrame({
        'survived': np.random.binomial(1, 0.38, n_samples),
        'pclass': np.random.choice([1, 2, 3], n_samples, p=[0.24, 0.21, 0.55]),
        'sex': np.random.choice(['male', 'female'], n_samples, p=[0.65, 0.35]),
        'age': np.random.normal(30, 14, n_samples).clip(0.4, 80),
        'sibsp': np.random.poisson(0.5, n_samples),
        'parch': np.random.poisson(0.4, n_samples),
        'fare': np.random.exponential(32, n_samples),
        'embarked': np.random.choice(['S', 'C', 'Q'], n_samples, p=[0.72, 0.19, 0.09]),
        'class': np.random.choice(['First', 'Second', 'Third'], n_samples, p=[0.24, 0.21, 0.55])
    })
    # Add some missing values
    df.loc[np.random.choice(n_samples, 177, replace=False), 'age'] = np.nan

print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## Part 2: Exploratory Data Analysis (EDA)

In [None]:
# Basic information
print("Dataset Info:")
print("=" * 60)
print(df.info())
print("\nBasic Statistics:")
print("=" * 60)
print(df.describe())

In [None]:
# Missing values analysis
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Percentage', ascending=False)

print("\nMissing Values Analysis:")
print("=" * 60)
print(missing_df[missing_df['Missing Count'] > 0])

# Visualize missing values
plt.figure(figsize=(10, 6))
missing_cols = missing_df[missing_df['Missing Count'] > 0]
if len(missing_cols) > 0:
    plt.barh(missing_cols.index, missing_cols['Percentage'])
    plt.xlabel('Percentage Missing')
    plt.title('Missing Values by Feature')
    plt.grid(axis='x', alpha=0.3)
    plt.show()
else:
    print("No missing values found!")

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
survival_counts = df['survived'].value_counts()
axes[0].bar(['Did not survive', 'Survived'], survival_counts.values, 
           color=['#e74c3c', '#2ecc71'], alpha=0.7)
axes[0].set_ylabel('Count')
axes[0].set_title('Survival Distribution')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(survival_counts.values):
    axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(survival_counts.values, labels=['Did not survive', 'Survived'],
           autopct='%1.1f%%', colors=['#e74c3c', '#2ecc71'], startangle=90)
axes[1].set_title('Survival Rate')

plt.tight_layout()
plt.show()

print(f"Survival Rate: {(df['survived'].sum() / len(df) * 100):.2f}%")

In [None]:
# Feature relationships with survival
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

# 1. Sex vs Survival
if 'sex' in df.columns:
    pd.crosstab(df['sex'], df['survived'], normalize='index').plot(
        kind='bar', ax=axes[0], color=['#e74c3c', '#2ecc71'], alpha=0.7
    )
    axes[0].set_title('Survival by Sex')
    axes[0].set_xlabel('Sex')
    axes[0].set_ylabel('Survival Rate')
    axes[0].legend(['Did not survive', 'Survived'])
    axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

# 2. Pclass vs Survival
if 'pclass' in df.columns:
    pd.crosstab(df['pclass'], df['survived'], normalize='index').plot(
        kind='bar', ax=axes[1], color=['#e74c3c', '#2ecc71'], alpha=0.7
    )
    axes[1].set_title('Survival by Passenger Class')
    axes[1].set_xlabel('Class')
    axes[1].set_ylabel('Survival Rate')
    axes[1].legend(['Did not survive', 'Survived'])
    axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)

# 3. Age distribution
if 'age' in df.columns:
    df[df['survived']==0]['age'].hist(bins=20, ax=axes[2], alpha=0.6, 
                                      color='#e74c3c', label='Did not survive')
    df[df['survived']==1]['age'].hist(bins=20, ax=axes[2], alpha=0.6, 
                                      color='#2ecc71', label='Survived')
    axes[2].set_title('Age Distribution by Survival')
    axes[2].set_xlabel('Age')
    axes[2].set_ylabel('Count')
    axes[2].legend()

# 4. Fare distribution
if 'fare' in df.columns:
    df[df['survived']==0]['fare'].hist(bins=30, ax=axes[3], alpha=0.6, 
                                       color='#e74c3c', label='Did not survive')
    df[df['survived']==1]['fare'].hist(bins=30, ax=axes[3], alpha=0.6, 
                                       color='#2ecc71', label='Survived')
    axes[3].set_title('Fare Distribution by Survival')
    axes[3].set_xlabel('Fare')
    axes[3].set_ylabel('Count')
    axes[3].legend()
    axes[3].set_xlim(0, 200)  # Limit x-axis for better visualization

# 5. SibSp vs Survival
if 'sibsp' in df.columns:
    pd.crosstab(df['sibsp'], df['survived'], normalize='index').plot(
        kind='bar', ax=axes[4], color=['#e74c3c', '#2ecc71'], alpha=0.7
    )
    axes[4].set_title('Survival by Siblings/Spouses')
    axes[4].set_xlabel('Number of Siblings/Spouses')
    axes[4].set_ylabel('Survival Rate')
    axes[4].legend(['Did not survive', 'Survived'])
    axes[4].set_xticklabels(axes[4].get_xticklabels(), rotation=0)

# 6. Embarked vs Survival
if 'embarked' in df.columns:
    pd.crosstab(df['embarked'].fillna('Unknown'), df['survived'], 
               normalize='index').plot(
        kind='bar', ax=axes[5], color=['#e74c3c', '#2ecc71'], alpha=0.7
    )
    axes[5].set_title('Survival by Embarkation Port')
    axes[5].set_xlabel('Port')
    axes[5].set_ylabel('Survival Rate')
    axes[5].legend(['Did not survive', 'Survived'])
    axes[5].set_xticklabels(axes[5].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.show()

## Part 3: Feature Engineering

In [None]:
# Create a copy for feature engineering
df_fe = df.copy()

print("Feature Engineering Steps:")
print("=" * 60)

# 1. Family size
if 'sibsp' in df_fe.columns and 'parch' in df_fe.columns:
    df_fe['family_size'] = df_fe['sibsp'] + df_fe['parch'] + 1
    print("✓ Created: family_size = sibsp + parch + 1")

# 2. Is alone
    df_fe['is_alone'] = (df_fe['family_size'] == 1).astype(int)
    print("✓ Created: is_alone (binary indicator)")

# 3. Age groups
if 'age' in df_fe.columns:
    df_fe['age_group'] = pd.cut(df_fe['age'], bins=[0, 12, 18, 35, 60, 100],
                                labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
    print("✓ Created: age_group (5 categories)")

# 4. Fare per person
if 'fare' in df_fe.columns and 'family_size' in df_fe.columns:
    df_fe['fare_per_person'] = df_fe['fare'] / df_fe['family_size']
    print("✓ Created: fare_per_person")

# 5. Encode categorical variables
if 'sex' in df_fe.columns:
    df_fe['sex_encoded'] = (df_fe['sex'] == 'male').astype(int)
    print("✓ Encoded: sex (male=1, female=0)")

if 'embarked' in df_fe.columns:
    df_fe['embarked'].fillna('S', inplace=True)  # Fill missing with most common
    embarked_dummies = pd.get_dummies(df_fe['embarked'], prefix='embarked')
    df_fe = pd.concat([df_fe, embarked_dummies], axis=1)
    print("✓ One-hot encoded: embarked")

print(f"\nNew shape after feature engineering: {df_fe.shape}")
print(f"\nNew features created:")
new_cols = set(df_fe.columns) - set(df.columns)
for col in sorted(new_cols):
    print(f"  - {col}")

In [None]:
# Handle missing values
print("\nHandling Missing Values:")
print("=" * 60)

# Fill age with median
if 'age' in df_fe.columns:
    median_age = df_fe['age'].median()
    df_fe['age'].fillna(median_age, inplace=True)
    print(f"✓ Filled age missing values with median: {median_age:.1f}")

# Fill fare with median
if 'fare' in df_fe.columns:
    median_fare = df_fe['fare'].median()
    df_fe['fare'].fillna(median_fare, inplace=True)
    print(f"✓ Filled fare missing values with median: {median_fare:.2f}")

print(f"\nRemaining missing values: {df_fe.isnull().sum().sum()}")

In [None]:
# Select features for modeling
# Define feature columns (numerical and encoded categorical)
feature_cols = []

# Add numerical features
for col in ['pclass', 'age', 'sibsp', 'parch', 'fare', 'family_size', 
            'is_alone', 'fare_per_person', 'sex_encoded']:
    if col in df_fe.columns:
        feature_cols.append(col)

# Add one-hot encoded features
for col in df_fe.columns:
    if col.startswith('embarked_'):
        feature_cols.append(col)

X = df_fe[feature_cols].fillna(0)  # Fill any remaining NaN with 0
y = df_fe['survived']

print(f"Final feature set: {len(feature_cols)} features")
print(f"Features: {feature_cols}")
print(f"\nX shape: {X.shape}")
print(f"y shape: {y.shape}")

## Part 4: Train-Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining set survival rate: {y_train.mean():.2%}")
print(f"Test set survival rate: {y_test.mean():.2%}")

## Part 5: Model Building and Comparison

In [None]:
# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
}

if CATBOOST_AVAILABLE:
    models['CatBoost'] = CatBoostClassifier(iterations=100, random_state=42, verbose=False)

# Train and evaluate all models
results = []

print("Training models...\n")
for name, model in models.items():
    print(f"Training {name}...", end=' ')
    
    # Train
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Scores
    train_acc = accuracy_score(y_train, y_pred_train)
    test_acc = accuracy_score(y_test, y_pred_test)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    results.append({
        'Model': name,
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Overfit': train_acc - test_acc
    })
    
    print(f"Done! Test Acc: {test_acc:.4f}")

# Create results DataFrame
results_df = pd.DataFrame(results).sort_values('CV Mean', ascending=False)

print("\n" + "="*80)
print("MODEL COMPARISON RESULTS")
print("="*80)
print(results_df.to_string(index=False))

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Accuracy comparison
x = np.arange(len(results_df))
width = 0.35
axes[0].bar(x - width/2, results_df['Train Acc'], width, label='Train', alpha=0.8)
axes[0].bar(x + width/2, results_df['Test Acc'], width, label='Test', alpha=0.8)
axes[0].set_xlabel('Model')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Train vs Test Accuracy')
axes[0].set_xticks(x)
axes[0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Cross-validation scores
axes[1].bar(results_df['Model'], results_df['CV Mean'], alpha=0.7, color='green')
axes[1].errorbar(results_df['Model'], results_df['CV Mean'], 
                yerr=results_df['CV Std'], fmt='none', color='black', capsize=5)
axes[1].set_xlabel('Model')
axes[1].set_ylabel('CV Accuracy')
axes[1].set_title('Cross-Validation Performance (5-fold)')
axes[1].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Part 6: Hyperparameter Tuning

In [None]:
# Select top 3 models for hyperparameter tuning
top_3_models = results_df.head(3)['Model'].tolist()
print(f"Tuning top 3 models: {top_3_models}\n")

# Define parameter grids
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7, None],
        'min_samples_split': [2, 5, 10]
    },
    'XGBoost': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    'LightGBM': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    'Gradient Boosting': {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    },
    'CatBoost': {
        'iterations': [50, 100, 200],
        'depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.3]
    }
}

# Tune each of the top models
tuned_models = {}

for model_name in top_3_models:
    if model_name in param_grids:
        print(f"\nTuning {model_name}...")
        
        # Get model and param grid
        model = models[model_name]
        param_grid = param_grids[model_name]
        
        # Grid search
        grid_search = GridSearchCV(
            model, param_grid, cv=5, scoring='accuracy', 
            n_jobs=-1, verbose=0
        )
        grid_search.fit(X_train, y_train)
        
        # Store best model
        tuned_models[model_name] = grid_search.best_estimator_
        
        print(f"  Best parameters: {grid_search.best_params_}")
        print(f"  Best CV score: {grid_search.best_score_:.4f}")
        print(f"  Test accuracy: {grid_search.score(X_test, y_test):.4f}")
    else:
        # Use default model if no param grid defined
        tuned_models[model_name] = models[model_name]

## Part 7: Ensemble Creation (Voting & Stacking)

In [None]:
# Create voting ensemble
voting_clf = VotingClassifier(
    estimators=[(name, model) for name, model in tuned_models.items()],
    voting='soft'
)

voting_clf.fit(X_train, y_train)
voting_acc = voting_clf.score(X_test, y_test)

print("Voting Ensemble Results:")
print(f"Test Accuracy: {voting_acc:.4f}")

# Create stacking ensemble
stacking_clf = StackingClassifier(
    estimators=[(name, model) for name, model in tuned_models.items()],
    final_estimator=LogisticRegression(random_state=42),
    cv=5
)

stacking_clf.fit(X_train, y_train)
stacking_acc = stacking_clf.score(X_test, y_test)

print("\nStacking Ensemble Results:")
print(f"Test Accuracy: {stacking_acc:.4f}")

# Compare with best individual model
best_individual_acc = max([model.score(X_test, y_test) 
                          for model in tuned_models.values()])

print("\n" + "="*60)
print("ENSEMBLE COMPARISON")
print("="*60)
print(f"Best Individual Model: {best_individual_acc:.4f}")
print(f"Voting Ensemble:       {voting_acc:.4f} (+{voting_acc-best_individual_acc:.4f})")
print(f"Stacking Ensemble:     {stacking_acc:.4f} (+{stacking_acc-best_individual_acc:.4f})")

## Part 8: Final Model Evaluation

In [None]:
# Select best model (stacking or voting)
if stacking_acc > voting_acc:
    final_model = stacking_clf
    final_model_name = "Stacking Ensemble"
else:
    final_model = voting_clf
    final_model_name = "Voting Ensemble"

print(f"\n{'='*60}")
print(f"FINAL MODEL: {final_model_name}")
print(f"{'='*60}\n")

# Get predictions
y_pred_final = final_model.predict(X_test)
y_pred_proba = final_model.predict_proba(X_test)[:, 1]

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_final, 
                          target_names=['Did not survive', 'Survived']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_final)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
           xticklabels=['Did not survive', 'Survived'],
           yticklabels=['Did not survive', 'Survived'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title(f'{final_model_name} - Confusion Matrix')
plt.show()

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = roc_auc_score(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'{final_model_name} - ROC Curve')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

print(f"\nROC AUC Score: {roc_auc:.4f}")

## Part 9: Project Summary and Key Learnings

### Project Summary:

In this final project, you completed a full machine learning workflow:

1. **Exploratory Data Analysis**
   - Analyzed missing values
   - Visualized feature distributions
   - Identified relationships with target variable

2. **Feature Engineering**
   - Created family_size and is_alone features
   - Engineered age_group categories
   - Calculated fare_per_person
   - Encoded categorical variables

3. **Model Building**
   - Compared 7+ ensemble methods
   - Used cross-validation for robust evaluation
   - Identified top performing models

4. **Hyperparameter Tuning**
   - Grid search on top 3 models
   - Optimized key parameters
   - Improved model performance

5. **Ensemble Creation**
   - Built voting ensemble
   - Built stacking ensemble
   - Selected best final model

### Key Learnings:

- **Feature engineering significantly impacts performance**
- **Different ensemble methods have different strengths**
- **Hyperparameter tuning provides incremental improvements**
- **Ensemble of ensembles (stacking) often achieves best results**
- **Cross-validation prevents overfitting during model selection**

### Next Steps:

1. **Try more feature engineering**:
   - Extract title from name (Mr., Mrs., etc.)
   - Create deck feature from cabin
   - Interaction features (e.g., sex × pclass)

2. **Advanced ensembling**:
   - Multi-level stacking
   - Weighted voting based on CV scores
   - Include more diverse models

3. **Production considerations**:
   - Model serialization (pickle/joblib)
   - Prediction API
   - Monitoring and retraining

Congratulations on completing the Ensemble Methods course! You now have the skills to build production-ready ensemble models for real-world problems.

## Bonus: Save Final Model

In [None]:
import pickle

# Save the final model
# Uncomment to save:
# with open('titanic_ensemble_model.pkl', 'wb') as f:
#     pickle.dump(final_model, f)
# print("Model saved successfully!")

# To load later:
# with open('titanic_ensemble_model.pkl', 'rb') as f:
#     loaded_model = pickle.load(f)
# predictions = loaded_model.predict(X_test)

print("Model saving code ready (uncomment to use)")