First we read the data

In [1]:
import pandas as pd
data = pd.read_csv('../input/melb_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

Then specify a pipeline of our modeling steps (It can be very difficult to do cross-validation properly if you arent't using [pipelines](https://www.kaggle.com/dansbecker/pipelines))

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

Finally get the cross-validation scores:

In [3]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores)

[-312633.79946035 -300616.9300481  -299723.34101318]


You may notice that we specified an argument for *scoring*.  This specifies what measure of model quality to report.  The docs for scikit-learn show a [list of options](http://scikit-learn.org/stable/modules/model_evaluation.html).  

It is a little surprising that we specify *negative* mean absolute error in this case. Scikit-learn has a convention where all metrics are defined so a high number is better.  Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.

You typically want a single measure of model quality to compare between models.  So we take the average across experiments.

In [4]:
print('Mean Absolute Error %2f' %(-1 * scores.mean()))

Mean Absolute Error 304324.690174


# Conclusion

Using cross-validation gave us much better measures of model quality, with the added benefit of cleaning up our code (no longer needing to keep track of separate train and test sets.  So, it's a good win.

# Your Turn
1. Convert the code for your on-going project over from train-test split to cross-validation.  Make sure to remove all code that divides your dataset into training and testing datasets.  Leaving code you don't need any more would be sloppy.

2. Add or remove a predictor from your models.  See the cross-validation score using both sets of predictors, and see how you can compare the scores.