# Cross-Validation and the Test Set

In the last lecture, we saw how keeping some data hidden from our model could help us to get a clearer understanding of whether or not the model was overfitting. 

We saw that in order to fit the data well, but not overfit, the key was selecting the right value for the complexity parameter (e.g. the max_depth). This time, we'll introduce a common automated framework for handling this task, called **cross-validation**. We'll also incorporate a designated **test set**, which we won't touch until the very end of our analysis to get an overall view of the performance of our model.

In [1]:
#standard imports


__Before we get started:__ Make sure the file titanic.csv is in the same folder as this notebook.
    
Now, let's read in the data.

We are again going to use the `train_test_split` function to divide our data in two. This time, however, we are not going to be using the holdout data to determine the model complexity. Instead, we are going to hide the holdout data until the very end of our analysis. We'll use a different technique for handling the model complexity. __In practice, you should never touch your test data until the very end in order to make sure that your test is an honest test of your models performance.__

We will clean our data with the same preprocessing as earlier

In [4]:
from sklearn import preprocessing
def prep_titanic_data(data_df):
    df = data_df.copy()
    le = preprocessing.LabelEncoder()
    df['Sex'] = le.fit_transform(df['Sex'])
    df = df.drop(['Name'], axis = 1)
    
    X = df.drop(['Survived'], axis = 1).values
    y = df['Survived'].values
    
    return(X, y)

In [5]:
X_train, y_train = prep_titanic_data(train)
X_test,  y_test  = prep_titanic_data(test)

## K-fold Cross-Validation

The idea of k-fold cross validation is to take a small piece of our training data, say 20%, and use that as a mini test set. We train the model on the remaining 80%, and then evaluate on the 20%. We then take a *different* 20%, train on the remaining 20%, and so on. We do this many times, and finally average the results to get an overall average picture of how the model might be expected to perform on the real test set. Cross-validation is a highly efficient tool for estimating the optimal complexity of a model. 

<figure class="image" style="width:100%">
  <img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" alt="Illustration of k-fold cross validation. The training data is sequentially partitioned into 'folds', each of which is used as mini-testing data exactly once. The image shows five-fold validation, with four boxes of training data and one box of testing data. The diagram then indicates a final evaluation against additional testing data not used in cross-validation." width="600px">
    <br>
    <caption><i>K-fold cross-validation. Source: scikit-learn docs.</i></caption>
</figure>

The good folks at `scikit-learn` have implemented a function called `cross_val_score` which automates this entire process. It repeatedly selects holdout data; trains the model; and scores the model against the holdout data. While exceptions apply, you can often use `cross_val_score` as a plug-and-play replacement for `model.fit()` and `model.score()` during your model selection phase. 

Let's test this out with a decision tree.

We are relly interested in the average of the CV scores

Now, let's use this to find the best depth

Now, let's look at our results

Now that we have a reasonable estimate of the optimal depth, we can try evaluating against the unseen testing data. 

Great! We even got slightly higher accuracy on the test set than we did in validation.

# Machine Learning Workflow: The Big Picture

We now have all of the elements that we need to execute the core machine learning workflow. At a high-level, here's what should go into a machine task:

1. Separate out the test set from your data. 
2. Clean and prepare your data if needed. It is best practice to clean your training and test data separately. It's convenient to write a function for this. 
3. Identify a set of candidate models (e.g. decision trees with depth up to 30, logistic models with between 1 and 3 variables, etc). 
4. Use a validation technique (k-fold cross-validation is usually sufficient) to estimate how your models will perform on the unseen test data. Select the best model as measured by validation. 
5. Finally, score the best model against the test set and report the result. 

Of course, this isn't all there is to data science -- you still need to do exploratory analysis; interpret your model; etc. etc. 

We'll discuss model interpretation further in a coming lecture. 