# Using `train_val_test_split` for Rigorous Modeling
Thanks to [this](https://towardsdatascience.com/automatic-hyperparameter-tuning-with-sklearn-gridsearchcv-and-randomizedsearchcv-e94f53a518ee) article for providing some code used below in the automatic hyperparameter tuning.

First, we load example data from sklearn into our X and y arrays, where X are the features and y is the response, aka target.

In [1]:
from sklearn import datasets
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

Next we split the data into training, testing, and validation sets. We will use the training and validation set to develop our model, and then only use the test set at the very end to see how well the tuning worked.

We will stick with random sampling for this example, which we can specify by adding `sampler="random"` to the call to `train_val_test_split` (`'random'` is also the default option). There are many other samplers available -- checkout [this documentation](https://github.com/JacksonBurns/astartes#implemented-sampling-algorithms) for a complete table.

In [2]:
from astartes import train_val_test_split
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
    diabetes_X,
    diabetes_y,
    sampler="random",
    random_state=42,
)

By default, `train_val_test_split` will use a validation size of 0.1 (10% of the dataset), test size of 0.1, and train size of 0.8. You can override these with `val_size`, `test_size`, and `train_size` keyword arguments.

Now we create a baseline model without tuning it for better performance on our data:

In [3]:
from sklearn.ensemble import RandomForestRegressor
rfr_baseline = RandomForestRegressor(n_estimators=5)
rfr_baseline.fit(X_train, y_train)

To judge how good this baseline model is, we use the `score` method, which returns the coefficient of determination for the inputs using the model as a predictor:

In [4]:
rfr_baseline.score(X_val, y_val)

0.36233207787448163

Now try and find some better model parameters by tuning the model, in this case with an automatic tuner:

In [5]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

rdr_tuned = RandomForestRegressor()

n_estimators = np.arange(5, 50, step=5)
max_depth = list(np.arange(2, 20, step=2)) + [None]

param_grid = {
    "n_estimators": n_estimators,
    "max_depth": max_depth,
}

random_cv = RandomizedSearchCV(
    rdr_tuned, param_grid, cv=3, n_iter=50, scoring="r2", n_jobs=-1, verbose=1, random_state=1
)
random_cv.fit(X_train, y_train)
random_cv.score(X_val, y_val)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


0.44237143491088005

In the real world, you might have done this by changing a hyperparameter a bit, training the model, and then evaluating how it worked on the validation set. In this way, you are tuning the model to work _specifically_ on the validation set itself.

Before celebrating our substantial improvement, let's make sure that the model performs well on the test set:

In [6]:
random_cv.score(X_test, y_test)

0.3867129657498729

The performance is lower on the test set than it is on the validation set, which is a sign that the results might not be generalizable. If future measurements were to be taken, we cannot be sure what the performance would be of this trained model. We should try to improve our model further or re-evaluate our modeling approach!

_Side Note:_
For completeness, we can also look at how the baseline model performs on the test set:

In [7]:
rfr_baseline.score(X_test, y_test)

0.2761820693760284

We did improve model performance on the test set by tuning the hyperparameters, but the improvement is not the same amount as we observed on the validation set so the tuned model is not completely generalizable.