
# XGBoost (Practice)

### Instructions for Students
- This is a **practice notebook**.
- Complete all **TODO** sections.
- Read the markdown explanations carefully.
- Do not skip evaluation and reflection questions.

Dataset used here is **California Housing (Regression)**.



## 1. Import Required Libraries


In [19]:
# TODO: Import necessary libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

from xgboost import XGBRegressor


## 2. Load Dataset (California Housing)


In [8]:
# TODO: Load dataset

from sklearn.datasets import fetch_openml

data = fetch_openml(name="california_housing", version=1, as_frame=True)
X = data.data
y = data.target

In [9]:
# Perform one-hot encoding on the 'ocean_proximity' column
X = pd.get_dummies(X, columns=['ocean_proximity'], drop_first=True)
X.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND
0,-122.23,37.88,41,880,129.0,322,126,8.3252,True,False,False,False
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,True,False,False,False
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,True,False,False,False
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,True,False,False,False
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,True,False,False,False



## 3. Train-Test Split


In [26]:
# TODO: Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42)
print(X_train.shape)
print(y_train.shape)

(16512, 12)
(16512,)



## 4. Baseline XGBoost Regressor


In [11]:
# TODO: Train baseline model
model = XGBRegressor(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    eval_metric="rmse",   # fixed
    random_state=42
)


model.fit(X_train, y_train)

In [14]:
# Predict
y_pred = model.predict(X_test)


## 5. Evaluate Baseline Model


In [18]:
# TODO: Evaluate baseline
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MSE:", mse),
print("R2 Score:", r2)

MSE: 3209475584.0
R2 Score: 0.7550783157348633



## 6. Hyperparameter Tuning with GridSearchCV


In [20]:
# TODO: Define parameter grid
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [2, 3, 5],
    "learning_rate": [0.05, 0.1, 0.3],
    "subsample": [0.7, 0.8, 1.0]
}


### Base Model for Grid Search


In [27]:
# TODO: Base model
xgb_base = XGBRegressor(
    objective="reg:squarederror",
    eval_metric="rmse",
    colsample_bytree=0.8,
    random_state=42,
    # use_label_encoder=False -> not for reg model
)


### Run GridSearchCV


## Why “neg” ?

In scikit-learn:
- GridSearchCV tries to maximize the score.
- But metrics like MSE and RMSE should be minimized.
- So sklearn uses the negative value internally.

In [28]:
# TODO: Run GridSearchCV
grid = GridSearchCV(
    estimator=xgb_base,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=7,
    n_jobs=-1,
    verbose=1 # Show me basic progress while training
)

grid.fit(X_train, y_train)

Fitting 7 folds for each of 81 candidates, totalling 567 fits



## 7. Evaluate Tuned Model


In [35]:
# TODO: Evaluate tuned model
print("Best Parameters:", grid.best_params_)
print("Best CV RMSE:", -grid.best_score_)
print()
best_model = grid.best_estimator_

y_pred_tuned = best_model.predict(X_test)

print("Test R2 Score (Tuned):", r2_score(y_test, y_pred_tuned))
print("RMSE:", mean_squared_error(y_test, y_pred_tuned))

Best Parameters: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 1.0}
Best CV RMSE: 46988.27678571428

Test R2 Score (Tuned): 0.8251931071281433
RMSE: 2290685440.0



## 8. Reflection Questions

1. Did GridSearch improve performance?
  > Yes improved R2 Score
  >- Baseline R² = 0.755
  >- Tuned R² = 0.825
2. Which parameter had the biggest effect?
  > 'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300, 'subsample': 1.0
3. What happens if learning_rate is too high?
  >- Lower learning_rate -> needs more trees
  >- Higher learning_rate -> fewer trees but risk of overfitting
4. Would you deploy this model? Why?
  > I would deploy it, if the model is not ovefitting.