In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

In [27]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [28]:
from sklearn.metrics import mean_absolute_error

def maePredict(my_model):
    predictions = my_model.predict(X_valid)
    return mean_absolute_error(predictions, y_valid)


In [29]:
maePredict(my_model)

234449.40684370397

* n_estimators
specificies how many times to go through the modeling cycle
- Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.
- Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).
- Typical values range from 100-1000, though this depends a lot on the learning_rate parameter discussed below.

In [30]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

maePredict(my_model)

246002.3690226896

* early_stopping_rounds
automatically finds the ideal value for n_estimators.
- Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators.
- It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.
- early_stopping_rounds = 5 is a reasonable value. Thus, we stop after 5 straight rounds of deteriorating validation scores.

In [31]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

maePredict(my_model)



237243.49972846097

* learning_rate
Multiplies the predictions from each model in the ensemble.
- This makes each tree we add to the ensemble less impactful, so we can set a higher value for n_estimators without overfitting.
- Typical values range from 0.01-0.2

In [32]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

maePredict(my_model)



239733.01253681886

* n_jobs
On larger datasets where runtime is a consideration, you can use parallelism to build your models faster.
- This parameter tells XGBoost how many cores of your computer to use when executing code.

In [33]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=3)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

maePredict(my_model)



239733.01253681886