<a href="https://colab.research.google.com/github/DhimanTarafdar/california-housing-regression-xgboost/blob/main/Module_22_XGBoost_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Module 22: XGBoost (Practice Notebook)

### Instructions for Students
- This is a **practice notebook**.
- Complete all **TODO** sections.
- Read the markdown explanations carefully.
- Do not skip evaluation and reflection questions.

Dataset used here is **California Housing (Regression)**.



## 1. Import Required Libraries


In [2]:
# TODO: Import necessary libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

from xgboost import XGBRegressor


## 2. Load Dataset (California Housing)


In [3]:
# TODO: Load dataset

from sklearn.datasets import fetch_openml

data = fetch_openml(name="california_housing", version=1, as_frame=True)
X = data.data
y = data.target


## 3. Train-Test Split


In [4]:
# TODO: Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 4. Baseline XGBoost Regressor


In [5]:
# TODO: Train baseline model
model = XGBRegressor(random_state=42, enable_categorical=True)
model.fit(X_train, y_train)


## 5. Evaluate Baseline Model


In [6]:
# TODO: Evaluate baseline
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Baseline Model Performance:")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

Baseline Model Performance:
RMSE: 48550.7154
R² Score: 0.8201



## 6. Hyperparameter Tuning with GridSearchCV


In [7]:
# TODO: Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.1, 0.3],
    'subsample': [0.8],
    'colsample_bytree': [0.8]
}


### Base Model for Grid Search


In [8]:
# TODO: Base model
xgb_base = XGBRegressor(random_state=42, enable_categorical=True)



### Run GridSearchCV


In [10]:
# TODO: Run GridSearchCV
grid = GridSearchCV(
    estimator=xgb_base,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits



## 7. Evaluate Tuned Model


In [11]:
# TODO: Evaluate tuned model
print(f"Best Parameters: {grid.best_params_}")
print(f"Best CV Score (neg MSE): {grid.best_score_:.4f}")

# Get best model
best_model = grid.best_estimator_

# Predict on test set
y_pred_tuned = best_model.predict(X_test)

# Calculate metrics
mse_tuned = mean_squared_error(y_test, y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)
r2_tuned = r2_score(y_test, y_pred_tuned)

print(f"\nTuned Model Performance:")
print(f"RMSE: {rmse_tuned:.4f}")
print(f"R² Score: {r2_tuned:.4f}")

# Compare with baseline
print(f"\nImprovement:")
print(f"RMSE reduced by: {rmse - rmse_tuned:.4f}")
print(f"R² improved by: {r2_tuned - r2:.4f}")

Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Best CV Score (neg MSE): -2339246933.3333

Tuned Model Performance:
RMSE: 49320.5132
R² Score: 0.8144

Improvement:
RMSE reduced by: -769.7978
R² improved by: -0.0057


## Model Performance Analysis & Findings

### Dataset Overview
- **Dataset**: California Housing (Regression Problem)
- **Task**: Predict median house prices
- **Features**: 8 numerical features + 1 categorical (ocean_proximity)
- **Target**: Median house value

---

###  Results Summary

#### Baseline XGBoost Model (Default Parameters)
- **RMSE**: 48,550.72
- **R² Score**: 0.8201
- **Interpretation**: The baseline model explains 82% of variance in house prices with an average prediction error of ~$48,551.

#### Tuned Model (After GridSearchCV)
- **RMSE**: 49,320.51
- **R² Score**: 0.8144
- **Best Parameters**:
  - `n_estimators`: 200
  - `max_depth`: 5
  - `learning_rate`: 0.1
  - `subsample`: 0.8
  - `colsample_bytree`: 0.8

---

### Key Observations

**1. Unexpected Outcome:**
- The tuned model performed **slightly worse** than the baseline!
- RMSE increased by $770
- R² score decreased by 0.0057

**2. Why did this happen?**
- **Overfitting on CV folds**: The GridSearchCV optimized for cross-validation performance, which may not generalize well to the test set
- **Limited parameter search space**: We reduced the grid size for faster computation, possibly missing better combinations
- **Small improvement margin**: When baseline is already strong (82% R²), further tuning has diminishing returns
- **Random variation**: With CV=3, results can be more variable

**3. What does this tell us?**
- **Default XGBoost parameters are already quite good** for this dataset
- Hyperparameter tuning doesn't always guarantee improvement
- The baseline model may have better generalization for unseen data
- More extensive tuning (larger grid, more CV folds) might be needed

---

### Conclusion & Recommendations

**Best Model for Production**: **Baseline Model** (RMSE: 48,550.72, R²: 0.8201)

**Why?**
- Better test performance
- Faster training (no tuning overhead)
- Simpler to maintain

**Next Steps to Improve:**
1. **Try RandomizedSearchCV** with larger parameter space
2. **Increase CV folds** (5 or 10) for more stable evaluation
3. **Feature Engineering**: Create interaction features or polynomial features
4. **Ensemble Methods**: Combine XGBoost with other models
5. **Use more data** if available for better generalization

**Key Takeaway**: Sometimes simpler is better! The baseline XGBoost with default parameters proved to be the most effective solution for this California Housing dataset. Always compare tuned models against baseline before deployment.


## 8. Reflection Questions

1. Did GridSearch improve performance?
2. Which parameter had the biggest effect?
3. What happens if learning_rate is too high?
4. Would you deploy this model? Why?


##  Reflection Questions & Answers

### 1. Did GridSearch improve performance?

**Answer**: No, GridSearch did not improve performance in this case.

**Explanation**:
- Baseline RMSE: 48,550.72 vs Tuned RMSE: 49,320.51
- The tuned model performed slightly worse (~$770 higher error)
- This happened because:
  - XGBoost default parameters were already well-suited for this dataset
  - Limited grid search space (reduced for speed) may have missed optimal combinations
  - Cross-validation score doesn't always translate to better test performance
  - Possible overfitting to validation folds during tuning

**Conclusion**: The baseline model generalizes better to unseen data.

---

### 2. Which parameter had the biggest effect?

**Answer**: Based on the best parameters found, **`max_depth=5`** and **`n_estimators=200`** likely had the biggest effect.

**Why?**
- **`max_depth`**: Controls tree complexity. Deeper trees capture more patterns but risk overfitting. A depth of 5 provides good balance.
- **`n_estimators`**: More trees (200 vs default 100) allow the model to learn more patterns through boosting iterations.
- **`learning_rate=0.1`**: Matched the baseline, so no significant change.
- **`subsample` and `colsample_bytree`**: Both at 0.8 add regularization but have moderate impact.

**General Rule**: In XGBoost, `n_estimators`, `max_depth`, and `learning_rate` typically have the most significant impact on performance.

---

### 3. What happens if learning_rate is too high?

**Answer**: If learning_rate is too high, the model will:

**Negative Effects**:
- **Overshoot the optimal solution**: Takes too large steps during optimization
- **Unstable training**: Loss may fluctuate or fail to converge
- **Poor generalization**: Model learns too aggressively from each tree, missing subtle patterns
- **Overfitting**: Gives too much weight to individual trees without proper regularization

**Example**:
- `learning_rate = 0.01`: Slow but stable, needs more trees (500+)
- `learning_rate = 0.1`: Balanced (recommended default)
- `learning_rate = 0.5+`: Too aggressive, likely poor performance

**Best Practice**: Use smaller learning rates (0.01-0.1) with more estimators for better results.

---

### 4. Would you deploy this model? Why?

**Answer**: Yes, I would deploy the **baseline model**, but with some considerations.

**Why Deploy?**
 **Good Performance**: R² = 0.82 means the model explains 82% of price variance  
 **Reasonable Error**: RMSE of ~$48,551 is acceptable for California housing prices (typical prices $100K-$500K+)  
 **Simple & Fast**: Baseline model is easier to maintain and faster to retrain  
 **Production Ready**: XGBoost is industry-proven and scalable

**Deployment Considerations**:
 **Model Monitoring**: Track prediction errors over time (concept drift)  
 **Confidence Intervals**: Provide prediction ranges, not just point estimates  
 **Feature Updates**: Ensure new data has same features and distributions  
 **A/B Testing**: Deploy alongside current system to validate real-world performance  
 **Interpretability**: Use SHAP values to explain predictions to stakeholders

**Improvements Before Deployment**:
1. Test on more recent data to check temporal stability
2. Add feature importance analysis
3. Implement early stopping to prevent overfitting
4. Set up automated retraining pipeline
5. Create fallback mechanisms for edge cases

**Final Decision**: Deploy the baseline model in a monitored environment with the ability to rollback if performance degrades.