# Machine Learning using Cross Validation

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('./tmp/melb_data.csv')

# Select usbset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Seperate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size = 0.8, random_state=0)

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=80, random_state=0))
])

Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere. For more info, see [scikit_learn](https://scikit-learn.org/stable/modules/model_evaluation.html) documentation.

In [11]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculate negative MAE
scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
print(scores)

[301620.65761152 302450.87845854 286030.87454361 235172.22588306
 259620.74982788]


In [12]:
print(scores.mean())

276979.0772649233


## XGBoost 

XGBoost has a few parameters that can dramatically affect accuracy and training speed. The first parameters you should understand are:

- **n_estimators** specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.

Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.
Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).
Typical values range from 100-1000, though this depends a lot on the learning_rate parameter discussed below.

- **early_stopping_rounds** offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.

- **learning_rate** Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.

- **n_jobs** On larger datasets where runtime is a consideration, you can use parallelism to build your models faster. It's common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.

In [33]:
from xgboost import XGBRegressor
my_model = XGBRegressor(n_estimators=5000, learning_rate=0.005, n_jobs=10)
my_model.fit(X_train, y_train, early_stopping_rounds=10, eval_set=[(X_valid, y_valid)], verbose=False)

  if getattr(data, 'base', None) is not None and \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.005, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=5000,
             n_jobs=10, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [34]:
from sklearn. metrics import mean_absolute_error
preds_valid = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(preds_valid, y_valid)))

Mean Absolute Error: 250072.45741669735
