## Using Different Base Learners in XGBoost

XGBoost is a powerful gradient boosting framework that builds an ensemble of weak learners to create a strong predictive model.
By default, XGBoost uses **decision trees** as its base learners, but it can also integrate **other types of learners**.

### What is a Base Learner?
A base learner (or weak learner) is the fundamental model used to make predictions in each boosting round.
Each learner focuses on reducing the residual errors from the previous round.

### Common Base Learners in XGBoost

#### 1. **Decision Tree (Default)**
- Works by creating a series of decision rules.
- Each new tree corrects the errors of the previous trees.
- Controlled by parameters like `max_depth`, `min_child_weight`, and `gamma`.

#### 2. **Linear Models (Logistic/Linear Regression)**
- Uses a linear function instead of a tree structure.
- For regression tasks, it behaves like linear regression.
- For classification, it can use logistic regression.
- Set with `booster='gblinear'`.

### How XGBoost Uses Base Learners
- Each base learner is trained sequentially using **gradient boosting**.
- It minimizes a loss function (e.g., log loss for classification, mean squared error for regression).
- The final prediction is a combination (weighted sum) of all learners.

### Key Parameters
- **`booster`**
  - Specifies the type of base learner.
  - Options: `'gbtree'` (default), `'gblinear'`, `'dart'` (dropout trees).

- **`learning_rate`**
  - Controls the contribution of each learner.

- **`n_estimators`**
  - Number of boosting rounds (number of base learners to train).


In [7]:
import pandas as pd
import numpy as np

In [8]:
import os
import pandas as pd

base_dir = "datasets"
file_name = "ames_housing_trimmed_processed.csv"
housing = pd.read_csv(os.path.join(base_dir, file_name))

housing.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,Remodeled,GrLivArea,BsmtFullBath,BsmtHalfBath,...,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,PavedDrive_P,PavedDrive_Y,SalePrice
0,60,65.0,8450,7,5,2003,0,1710,1,0,...,0,0,0,0,1,0,0,0,1,208500
1,20,80.0,9600,6,8,1976,0,1262,0,1,...,0,1,0,0,0,0,0,0,1,181500
2,60,68.0,11250,7,5,2001,1,1786,1,0,...,0,0,0,0,1,0,0,0,1,223500
3,70,60.0,9550,7,5,1915,1,1717,1,0,...,0,0,0,0,1,0,0,0,1,140000
4,60,84.0,14260,8,5,2000,0,2198,1,0,...,0,0,0,0,1,0,0,0,1,250000


In [9]:
X, y = housing.iloc[:,:-1], housing.iloc[:,-1]

### Decision Trees As Base Learners

Train an XGBoost model to predict house prices using the provided dataset.
The features (`X`) contain information about the houses and their locations, while the target (`y`) represents the house prices.

By default, XGBoost uses decision trees as base learners, so no additional specification is needed for this.
Instantiate the XGBoost regressor with the desired parameters.
Fit the model on the training data to learn the relationship between the features and the target.
Use the trained model to make predictions on the test data and evaluate its performance using an appropriate metric such as RMSE.


In [10]:
#1. Split df into training and testing sets, holding out 20% for testing. Use a random_state of 123.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 123)

#2. Instantiate the XGBRegressor as xg_reg, using a seed of 123. Specify an objective of "reg:squarederror" and use 10 trees. Note: You don't have to specify booster="gbtree" as this is the default.
import xgboost as xgb
xg_reg = xgb.XGBRegressor(objective = "reg:squarederror", n_estimators =10, seed = 123)

#3. Fit xg_reg to the training data and predict the labels of the test set. Save the predictions in a variable called preds.
xg_reg.fit(X_train, y_train)
preds = xg_reg.predict(X_test)

#4. Compute the rmse using np.sqrt() and the mean_squared_error().
from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (RMSE))




RMSE: 31292.976337
