# Boosting Trees Assignment

In this assignment, you'll compare three popular boosting algorithms:
- **AdaBoost**: Classic boosting algorithm
- **XGBoost**: Extreme Gradient Boosting
- **LightGBM**: Light Gradient Boosting Machine

## Learning Objectives
- Understand how different boosting algorithms work
- Compare performance and training time
- Visualize model performance and feature importance

## Setup and Installation

Install required packages:

In [None]:
!pip install scikit-learn xgboost lightgbm matplotlib seaborn pandas numpy

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import time

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load and Explore Dataset

We'll use the Breast Cancer Wisconsin dataset - a medium-sized dataset perfect for comparing boosting algorithms.

**Dataset Info:**
- 569 samples
- 30 features (computed from breast mass images)
- Binary classification: malignant (1) or benign (0)

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print(f"Dataset shape: {X.shape}")
print(f"\nFirst few rows:")
print(X.head())
print(f"\nTarget distribution:")
print(pd.Series(y).value_counts())

## 2. Train-Test Split

Split the data into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

## 3. Train Boosting Models

### Task: Train three boosting models

**Hints:**
1. **AdaBoost**: 
   - Use `AdaBoostClassifier` from sklearn
   - Key parameters: `n_estimators`, `learning_rate`, `random_state`
   - Try n_estimators=100, learning_rate=1.0
      - experiment with different learning rates

2. **XGBoost**:
   - Use `XGBClassifier`
   - Key parameters: `n_estimators`, `learning_rate`, `max_depth`, `random_state`
   - Try n_estimators=100, learning_rate=0.1, max_depth=3
      - experiment with different max_depth values

3. **LightGBM**:
   - Use `LGBMClassifier`
   - Key parameters: `n_estimators`, `learning_rate`, `max_depth`, `random_state`
   - Try n_estimators=100, learning_rate=0.1, max_depth=3
   - Set `verbose=-1` to suppress output
      - experiment with different learning rates

**TODO:** Create three model instances and train them. Track training time for each model.

In [None]:
models = {}
training_times = {}

# TODO: Initialize AdaBoost model
# ada_model = ...

# TODO: Initialize XGBoost model
# xgb_model = ...

# TODO: Initialize LightGBM model
# lgbm_model = ...

# TODO: Train each model and record training time
# Example pattern:
# start_time = time.time()
# model.fit(X_train, y_train)
# training_times['ModelName'] = time.time() - start_time
# models['ModelName'] = model

print("Training complete!")
print("\nTraining times:")
for name, t in training_times.items():
    print(f"{name}: {t:.4f} seconds")

## 4. Evaluate Models

### Task: Make predictions and evaluate each model

**Hints:**
- Use `.predict()` to get class predictions
- Use `.predict_proba()` to get probability predictions (for ROC curves)
- Calculate accuracy using `accuracy_score()`
- Print classification reports using `classification_report()`

**TODO:** Generate predictions and calculate accuracy for each model.

In [None]:
results = {}

# TODO: For each model, make predictions and calculate accuracy
# Example pattern:
# y_pred = model.predict(X_test)
# y_pred_proba = model.predict_proba(X_test)[:, 1]
# acc = accuracy_score(y_test, y_pred)
# results['ModelName'] = {'accuracy': acc, 'y_pred': y_pred, 'y_pred_proba': y_pred_proba}

# Print results
print("\n" + "="*50)
print("Model Performance Summary")
print("="*50)
for name, result in results.items():
    print(f"\n{name}:")
    print(f"Accuracy: {result['accuracy']:.4f}")
    print(f"Training Time: {training_times[name]:.4f}s")

## 6. ROC Curves Comparison

**TODO:** Plot ROC curves for all three models on the same plot.

**Hints:**
- Use `roc_curve()` to calculate false positive rate and true positive rate
- Use `auc()` to calculate area under curve
- Plot all curves on the same figure for comparison
- Include AUC score in the legend

In [None]:
plt.figure(figsize=(10, 8))

# TODO: For each model, calculate and plot ROC curve
# Example pattern:
# fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
# roc_auc = auc(fpr, tpr)
# plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})')

# Plot diagonal line
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True)
plt.show()