BOSTON HOUSES PREDICTION WITH SCIKIT (PART 3)

In [1]:
# We continue using the Boston houses dataset, trying to improve the results in the previous code. 
# This time we'll see cross validation, XGBoost and some ideas about data leakage.

1. Cross-validation

In [1]:
# In the previous code, when we used the train_test_split command to get 80% of the data as train data and the resulting 20% as validation data
# we don't know what lines of the data we're getting. Maybe they are the first 80%, maybe the last 80% or maybe there is 10% validation, then
# 80% training and then, the last 10% for validation. This could affect the error we're getting, so we use cross-validation to check wich way
# is better. In big DS it might not be a good idea, as it could be really slow.

import pandas as pd
dir = 'https://raw.githubusercontent.com/AleGL92/Scikit-Learn/main/melb_data.csv'
melb_data = pd.read_csv(dir)
cols = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = melb_data[cols]
y = melb_data.Price

# Then, we define a pipeline that uses an imputer to fill in missing values and a random forest model to make predictions.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

my_pipeline = Pipeline(steps = [
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators = 50, random_state = 0))
])

# Multiply by -1 since sklearn calculates *negative* MAE
MAE = -1 * cross_val_score(my_pipeline, X, y, cv = 5, scoring = 'neg_mean_absolute_error')
print("MAEs using cross-validation:\n", MAE)

MAEs using cross-validation:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


2. XGBoost 

In [4]:
# Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble.
# A first prediction is made, from this one a loss function is calculated. This loss function is then used to fit a new model, then this model will be added to the ensemble.
# Within a set number of iterations, we can sometines get better predictions than using a Decission Tree or a Random Forest.
# We'll be using XGBooster library for this. Before we split in test and validation data.
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

X_train, X_val, y_train, y_val = train_test_split(X, y)
my_model = XGBRegressor()
my_model.fit(X_train, y_train)

preds = my_model.predict(X_val)
print("MAE using XGBoost: ", mean_absolute_error(preds, y_val))
# We got 244251, wich is the best result so far.

MAE using XGBoost:  244251.88284931885


In [6]:
# There are some parameters that affect the results when using XGBoost:

# n_estimators specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.
# Values too low for n_estimators cause underfitting, and values too high, cause overfitting.

# early_stopping_rounds offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score 
# stops improving. Since random chance sometimes causes a single round where validation scores don't improve, we need to specify a number for how many rounds of straight 
# deterioration to allow before stopping.

# Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number 
# (known as the learning rate) before adding them in. This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without 
# overfitting. If we use early stopping, the appropriate number of trees will be determined automatically. By default XGBoost sets learning rate of 0,1.

# On larger datasets where runtime is a consideration, we could use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of 
# cores on your machine. On smaller datasets, this won't help.

my_model = XGBRegressor(n_estimators = 500, learning_rate = 0.05, n_jobs = 4)
my_model.fit(X_train, y_train, 
            early_stopping_rounds = 5, 
            eval_set=[(X_val, y_val)],
            verbose = False)

preds = my_model.predict(X_val)
print("MAE using XGBoost, after tuning parameters: ", mean_absolute_error(preds, y_val))
# We got 240908, wich is even better than the previous result.

MAE using XGBoost, after tuning parameters:  240908.63307713548
