<a href="https://colab.research.google.com/github/Ovizero01/Machine-Leaning/blob/main/022_XGBoost/022_XGBoost%20Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Module 22: XGBoost (Practice Notebook)

### Instructions for Students
- This is a **practice notebook**.
- Complete all **TODO** sections.
- Read the markdown explanations carefully.
- Do not skip evaluation and reflection questions.

Dataset used here is **California Housing (Regression)**.



## 1. Import Required Libraries


In [36]:
# TODO: Import necessary libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from xgboost import XGBRegressor


## 2. Load Dataset (California Housing)


In [37]:
# TODO: Load dataset

from sklearn.datasets import fetch_openml

data = fetch_openml(name="california_housing", version=1, as_frame=True)
X = data.data
y = data.target
X = pd.get_dummies(X, columns=["ocean_proximity"], drop_first=True)


## 3. Train-Test Split


In [38]:
# TODO: Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)


## 4. Baseline XGBoost Regressor


In [39]:
# TODO: Train baseline model
model = XGBRegressor(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    random_state=42
)

model.fit(X_train, y_train)



## 5. Evaluate Baseline Model


In [40]:
# TODO: Evaluate baseline
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE : {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²  : {r2:.4f}")

MSE : 3209475584.0000
RMSE: 3209475584.0000
R²  : 0.7551



## 6. Hyperparameter Tuning with GridSearchCV


In [41]:
# TODO: Define parameter grid
param_grid = {
    "max_depth": [3, 5],
    "learning_rate": [0.05, 0.1],
    "subsample": [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0]
}


### Base Model for Grid Search


In [42]:
# TODO: Base model
base_model = XGBRegressor(
    objective="reg:squarederror",
    random_state=42
)

grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=3,
    verbose=1,
    n_jobs=-1
)


### Run GridSearchCV


In [43]:
# TODO: Run GridSearchCV
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits



## 7. Evaluate Tuned Model


In [44]:
# TODO: Evaluate tuned model
best_search = grid_search.best_estimator_
y_pred_best = best_search.predict(X_test)


mse = mean_squared_error(y_test, y_pred_best)
rmse = mean_squared_error(y_test, y_pred_best)
r2 = r2_score(y_test, y_pred_best)

print(f"MSE : {mse:.4f}")
print(f"RMSE : {rmse:.4f}")
print(f"R²   : {r2:.4f}")

MSE : 2562362624.0000
RMSE : 2562362624.0000
R²   : 0.8045



## 8. Reflection Questions

1. Did GridSearch improve performance?
2. Which parameter had the biggest effect?
3. What happens if learning_rate is too high?
4. Would you deploy this model? Why?


1. Yes, GridSearch improved the performance.

2. The parameter that affects the model the most is max_depth. Deeper trees usually make the model much better. learning_rate matters too, but not as much.

3. If learning rate is too high, the model learns too fast and can overfit or miss patterns, making it less accurate on new data.

4. No, because this model is just a baseline. It works, but it hasn't been fully tuned, validated, or tested for real-word case. Deploying it now could give unreliable predictions.