# XGBoost for Regression: A Powerful and Efficient Alternative ⚡️

This notebook explores the use of **XGBoost (eXtreme Gradient Boosting)** for a regression problem and compares its performance against both a standard Linear Regression model and `scikit-learn`'s native Gradient Boosting implementation.

**Gradient Boosting** is a powerful ensemble technique that builds models sequentially. Each new model (typically a decision tree) is trained to correct the errors of the ones before it, allowing the overall model to improve iteratively.

**XGBoost** is a highly optimized and parallelized implementation of this concept. It is renowned in the machine learning community for its key advantages:
* **Higher Predictive Accuracy:** It often achieves state-of-the-art results on structured data.
* **Faster Training Speed:** It leverages parallel processing and algorithmic optimizations to train much more quickly than standard implementations.

---

## 1. Predicting Housing Prices in California

We will use the well-known California Housing dataset, available directly from `scikit-learn`, to predict the median house value (`MedHouseVal`) in a district based on several demographic and geographic features.

First, let's load the dataset and convert it into a `pandas` DataFrame for easier handling.


In [13]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

# Create a DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


After splitting the data, we will compare our three models on **R-squared ($R^2$) Score**, **Mean Squared Error (MSE)**, and **Training Time**.


In [14]:
from sklearn.model_selection import train_test_split
import time

X = df.drop('MedHouseVal', axis='columns')
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

## 2. Comparing Model Performance and Speed

### a) Baseline Model: Linear Regression
This simple, fast model serves as our initial performance benchmark.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

start = time.time()
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
end = time.time()

y_pred = model_lr.predict(X_test)
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"Time: {end - start} seconds")

R2 Score: 0.6009790143129106
MSE: 0.5444842122132874
Time: 0.0035076141357421875 seconds


### b) Scikit-Learn's Gradient Boosting Regressor
Next, we'll use `scikit-learn`'s native Gradient Boosting implementation. We expect a significant improvement in accuracy.


In [16]:
from sklearn.ensemble import GradientBoostingRegressor

start = time.time()
model_gbm = GradientBoostingRegressor()
model_gbm.fit(X_train, y_train)
end = time.time()

y_pred = model_gbm.predict(X_test)
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"Time: {end - start} seconds")

R2 Score: 0.7852161384744023
MSE: 0.2930833861720792
Time: 3.253821611404419 seconds


As expected, the R² score jumped significantly from **60% to 78.5%**. However, notice the training time increased substantially to over 2.7 seconds.


### c) XGBoost Regressor

Finally, we'll train the `XGBRegressor` on the same data.

In [17]:
from xgboost import XGBRegressor

start = time.time()
model_xgb = XGBRegressor()
model_xgb.fit(X_train, y_train)
end = time.time()

y_pred = model_xgb.predict(X_test)
print(f"R2 Score: {r2_score(y_test, y_pred)}")
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"Time: {end - start} seconds")

R2 Score: 0.8365188672017998
MSE: 0.22307823146216027
Time: 0.12464714050292969 seconds


The XGBoost model achieves the best performance and is dramatically faster than the standard Gradient Boosting implementation.


## 3. Conclusion

| Model | R² Score | Mean Squared Error (MSE) | Training Time |
|:--- |:--- |:--- |:--- |
| Linear Regression | 60.1% | 0.544 | ~0.003 s |
| Gradient Boosting (sklearn) | 78.5% | 0.293 | ~2.770 s |
| **XGBoost** | **83.7%** | **0.223** | **~0.127 s** |

This comparison clearly shows why XGBoost is so popular. It provided a significant boost in predictive accuracy over both other models while also being **over 20 times faster** to train than scikit-learn's standard `GradientBoostingRegressor`. This combination of superior performance and high efficiency makes it an excellent choice for a wide range of regression tasks.