In this tutorial, you will learn how to use **cross-validation** for better measures of model performance.

# Introduction

Machine learning is an iterative process. 

You will face choices about what predictive variables to use, what types of models to use, what arguments to supply to those models, etc. We make these choices in a data-driven way by measuring model quality of various alternatives.  

# Shortcomings of Train-Test Split

Imagine you have a dataset with 5000 rows.  The `train_test_split` function has an argument for `test_size` that you can use to decide how many rows go to the training set and how many go to the test set. 

You will typically keep about 20% of the data as a test dataset (`test_size = 0.2`).  But even with 1000 rows in the test set, there's some random chance in determining model scores.  A model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows.  The larger the test set, the less randomness (aka "noise") there is in our measure of model quality (and the more reliable it will be!).

At an extreme, you could imagine having only 1 row of data in the test set.  If you compare alternative models, which one makes the best predictions on a single data point will be mostly a matter of luck.  

But we can only get a large test set by removing data from our training data, and smaller training datasets mean worse models.  In fact, the ideal modeling decisions on a small dataset typically aren't the best modeling decisions on large datasets. 

# What is cross-validation?

In **cross-validation**, we run our modeling process on different subsets of the data to get multiple measures of model quality. 

For example, we could begin by dividing the data into 5 pieces, each being 20% of the full dataset.  In this case, we say that we have broken the data into 5 "**folds**".  

![cross-validation-graphic](https://i.stack.imgur.com/1fXzJ.png)

Procedure:
- We run an experiment called experiment 1 which uses the first fold as a holdout set, and everything else as training data. This gives us a measure of model quality based on a 20% holdout set, much as we got from using the simple train-test split.  
- We then run a second experiment, where we hold out data from the second fold (using everything except the 2nd fold for training the model.) This gives us a second estimate of model quality.
- We repeat this process, using every fold once as the holdout.  Putting this together, 100% of the data is used as a holdout at some point.  

Returning to our example above from train-test split, if we have 5000 rows of data, we end up with a measure of model quality based on 5000 rows of holdout (even if we don't use all 5000 rows simultaneously).

# Trade-offs Between Cross-Validation and Train-Test Split

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions.  However, it can take longer to run, because it estimates multiple models (one for each **fold**).  

So, given these tradeoffs, when should you use each approach?
- _For small datasets_, where extra computational burden isn't a big deal, you should run cross-validation.
- _For larger datasets_, a simple train-test split is sufficient.  It will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

There's no simple threshold for what constitutes a large vs. small dataset.  But if your model takes a couple minutes or less to run, it's probably worth switching to cross-validation.  

Alternatively, you can run cross-validation and see if the scores for each experiment seem close.  If each experiment yields the same results, a train-test split is probably sufficient.

# Example

We'll work with the same data as in the previous tutorial.  We load the input data in `X` and the output data in `y`.

In [None]:
import pandas as pd

# read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# select target
y = data.Price

Then, we define a **pipeline** that uses an **imputer** to fill in missing values and a **random forest** model to make predictions.  

While it's _possible_ to do cross-validation without pipelines, it is quite difficult!  Using a pipeline will make the code remarkably straightforward.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

my_pipeline = make_pipeline(SimpleImputer(), RandomForestRegressor())

We obtain the cross-validation scores with just a single line of code.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(my_pipeline, X, y,
                         scoring='neg_mean_absolute_error')

print("MAE scores:")
# multiply by -1 (since scikit-learn calculates *negative* MAE)
print(-1 * scores)

We specify an argument for the `scoring` parameter to choose a measure of model quality to report: in this case, we chose negative mean absolute error (MAE).  The docs for scikit-learn show a [list of options](http://scikit-learn.org/stable/modules/model_evaluation.html).  

It is a little surprising that we specify *negative* MAE. Scikit-learn has a convention where all metrics are defined so a high number is better.  Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere. 

We typically want a single measure of model quality to compare alternative models.  So we take the average across experiments.

In [None]:
print("Average MAE score (across experiments):")
print(-1 * scores.mean())

# Conclusion

Using cross-validation yields a much better measure of model quality, with the added benefit of cleaning up our code: note that we no longer needing to keep track of separate train and test sets.  So, especially for small datasets, it's a good improvement!

# Your Turn

hi