# Training, Validation, and Test Sets

**Training:**
Data that is used to directly train the model. 

**Validation:**
Data that is used to select between different models; or different values of hyperparameters for the same model.

**Test:**
Data that is used only at the very end, once a final model has been chosen. This will give us the best estimate of how well our model will perform on new, unseen data.

In this notebook you will gain some experience with splitting a data set into *training* and *validation* sets. We will assume that a *test* data set has already been removed. Some of the things you will explore: 

  * splitting data without shuffling
  * splitting data with shuffling
  * how to use the `random_state` parameter

### Create some feature data

Here we will create a simple feature array with 10 samples and 2 features. The rows represent *samples* and the columns represent the *features*.

In [None]:
import numpy as np

X = np.array(np.arange(20)).reshape(10, 2)
X

### Create some target data

We now need to create some target data to correspond to our feature data. Remember that the first element of our target array corresponds to the outcome/target/answer for the first sample, etc. 

In [None]:
y = np.array([1, 1, 0, 0, 0, 1, 0, 1, 1, 0])
y

### Put data into dataframe format

We can use Pandas to put the data into the dataframe format. On occasion, this may be the more natural way to store the data. For now, it will allow you an easy way to see what's happening when we do or don't shuffle the data. 

In [None]:
import pandas as pd

df = pd.DataFrame({"feature_1":X[:, 0], "feature_2":X[:, 1], "target":y})
df

### Split the data without shuffling

We will take advantage of many of the *convenience* functions that come with the `sklearn` package. Because splitting data into training and test sets is an essential part of any supervised learning problem, `sklearn` contains a function that allows us to do this easily: `train_test_split`. 

In [None]:
# import the function from the sklearn package so that we can use it
from sklearn.model_selection import train_test_split  

# the 'train_test_split' function returns 4 arrays 
# feature data arrays for training and testing
# target data arrays for training and testing

# NOTE: the order is important

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.75, shuffle = False)

#### Take a look at the split data

The default for `train_test_split` is to use 75% of the data for training and 25% for testing. The exact percentage depends on the size of your dataset. For example, if you have 3 training samples, a 75/25 split is not possible, so you will get a 66.6/33.3 split. 

Take a look at the `X_train` array: 

In [None]:
print(f"The training data is a 2 dimensional array: \n {X_train}\n")

# .shape gives us the number of rows and columns
print(f"The shape of the training data array is {X_train.shape}.")  
print(f"The shape of the complete data array is {X.shape}.\n")  

# calculate percent of data included in training data
pct = (X_train.shape[0] / X.shape[0]) * 100  # .shape[0] selects the number of rows/samples
print(f"The percent of data contained in the training data array is {pct}.")

> Note how the data was split in order; that is, the training set is made up of the first elements of the original data. Use `df` above to verify. 

Since we are doing supervised learning, we need the correct target values for our training data. Take a look at `y_train` to verify that it contains the targets corresponding to the samples in `X_train`. Use `df` above to verify. 

In [None]:
y_train # this corresponds to the first 7 rows of the dataframe 'df' we created above

### Split the data with shuffling

The default setting for `train_test_split` is to first *shuffle* the data and then do the split. We will come back to why this is important; for now let's focus on the basics. *Shuffling* the data means that the *rows* (which represent the samples) are randomy re-ordered. 

To see how this works, first reprint our original feature array using the code cell below:

In [None]:
X_train

In [None]:
from sklearn.model_selection import train_test_split  

# Remember that shuffle=True is the default so we don't normally need to specify this
# we include it here for clarity
X_train_new, X_valid_new, y_train_new, y_valid_new = train_test_split(X, y, shuffle = True)

#### Take a look at the split data after shuffling

Take a look at the `X_train_new` array: 
-  What percent of the original data is in your training set?
    - *Answer*: still 70%
-  Note the order of the split data

In [None]:
X_train_new

Make sure `y_train_new` contains the correct targets for your shuffled training data. Use `df` above to verify. 

In [None]:
y_train_new # in the original dataframe 'df', the row [0, 1] should have target = 1, the row [10, 11] should have target 1, etc

#### Shuffling is random 

Rerun the last three code cells and note how each time you get a slightly different training set. This is because `train_test_split` **randomly** re-orders the data. 

#### Controlled randomness

While we want the shuffling procedure to randomly re-order our data, this can cause unnecessary problems when we are trying to debug our code or for reproducibility; every time you run your code the training set will change, which means your machine learning model will change, etc. To avoid this problem, we can set the `random_state` parameter. This parameter allows for a random re-ordering, but the same re-ordering will occur everytime you run your code. 

Let's see this in action. 

In [None]:
from sklearn.model_selection import train_test_split  

# Remember that shuffle=True is the default so we don't normally need to specify this
# we include it here for increased clarity
X_train_rs, X_valid_rs, y_train_rs, y_valid_rs = train_test_split(X, y, shuffle = True, random_state = 12)

print(X_train_rs)
print(y_train_rs)

Run the above code a few times to convince yourself that we get the same re-ordering every time we run the code. 

**NOTE**: It doesn't matter which number you use with the `random_state` parameter. Try changing the `12` above to some other integer and rerun the code a few times. 

  - Did you get the same ordering as you did with `random_state = 12`?
      - *Answer*: You should see a different order
  - Did you get the same ordering every time you ran the code with a new number?
      - *Answer*: You should see the same order if you keep 'random_state' fixed

### Why does this matter?

To get a feel for why we would want to shuffle our training data, we will build a kNN classification model for the Iris dataset. For now, all we need to know is that this dataset has 4 features, with the target being which species of Iris each sample is. The task is to build a model that uses the 4 features to predict the species. (You'll get more details about this dataset in a later notebook.)

In [None]:
# load the data 
from sklearn.datasets import load_iris

iris_data = load_iris()

X_iris = iris_data.data
y_iris = iris_data.target

#### Split the data without shuffling

In [None]:
from sklearn.model_selection import train_test_split 

X_iris_train, X_iris_valid, y_iris_train, y_iris_valid = train_test_split(X_iris, y_iris, shuffle = False, train_size = 2/3)

#### Build kNN model with `k=3`

In [None]:
from sklearn.neighbors import KNeighborsClassifier  

iris_clf = KNeighborsClassifier(n_neighbors=3)   
iris_clf.fit(X_iris_train, y_iris_train)   

iris_acc_no_shuffle = iris_clf.score(X_iris_valid, y_iris_valid)

print("Iris validation set accuracy without shuffling: {:.2f}".format(iris_acc_no_shuffle))  

#### Split the data WITH shuffling

In [None]:
from sklearn.model_selection import train_test_split 

X_iris_shuffle_train, X_iris_shuffle_valid, y_iris_shuffle_train, y_iris_shuffle_valid = train_test_split(X_iris, y_iris, shuffle = True, train_size = 2/3, random_state=31)

#### Build kNN model with `k=3`

In [None]:
from sklearn.neighbors import KNeighborsClassifier  

iris_shuffle_clf = KNeighborsClassifier(n_neighbors=3)   
iris_shuffle_clf.fit(X_iris_shuffle_train, y_iris_shuffle_train)   

iris_acc_shuffle = iris_clf.score(X_iris_shuffle_valid, y_iris_shuffle_valid)

print("Iris valid set accuracy without shuffling: {:.2f}".format(iris_acc_shuffle))  

#### Challenge

See if you can figure out why there would be such a difference in accuracy when all we did was shuffle the data before splitting it into train and test sets. (The function included below may help you see what's going on easier, although you can just look at the NumPy arrays to figure it out.) 

Hint: 
 - When we do not shuffle the data, what does the data look like that is being used to train the kNN classifier? 
     - *Answer*: In the 3 code cells below, note how the complete data has 3 classes (0, 1, 2) but our training data only contained 2 classes (0, 1). This is because the complete data comes with the classes in order: the first 50 rows correspond to class 0, the second 50 correspond to class 1, and the last 50 correspond to class 2. So, we have trained a model by only showing it 2 of the classes. That means our model only thinks there are 2 classes and has learned to try and predict if a new data point is either class 0 or class 1. Then we have asked our model to make predictions on our test data, but that only includes samples of class 2, which our model knows nothing about because we never taught it about class 2! It should get an accuracy of 0 because it can only predict class 0 or 1 and those predictions for our test data will always be incorrect because all the test data corresponds to class 2. (See the target arrays below to see the classes contained in the complete, train, and test data.)


In [None]:
# target array for the complete data
y_iris

In [None]:
# target array for the UNSHUFFLED data
y_iris_train

In [None]:
# test array for the UNSHUFFLED data
y_iris_valid

 - How is this different to the shuffled training data?
     - *Answer*: In the 3 code cells below, note how the complete, train, and test data sets all contain the 3 classes (0, 1, 2). This means that when we trained the model, we showed it examples of all 3 classes so when we asked it to make predictions on the test data it was able to do much better than before because it was now able to predict for class 0, 1, or 2. 
 

In [None]:
# target array for the complete data
y_iris

In [None]:
# target array for the SHUFFLED data
y_iris_shuffle_train

In [None]:
# validation array for the SHUFFLED data
y_iris_shuffle_valid

- Should there be any connection between the *train*, *test*, and corresponding *real world* data?
     - *Answer*: Ideally we want the train and test data to be as representative of the real world data so that our model can generalize. In the UNSHUFFLED example above, we saw what can happen when the train data and the test data aren't good representations: our model does not generalize well and performed poorly on the test set. Beyond this, we also don't want any sort of bias in our train or test data. For example, if you want to build a model that predicts house prices for Windsor and our train and test data only contain houses that cost more that a million dollars, then our model will not perform well when asked to make a prediction for a typical house in Windsor. 

### Creating all 3 data sets

In [None]:
from sklearn.model_selection import train_test_split

X_train_valid, X_test, y_train_valid, y_test = train_test_split(X_iris, y_iris, shuffle = True, test_size = 0.15, random_state = 12)

In [None]:
X_train_valid.shape

In [None]:
X_test.shape

In [None]:
23 / 150

In [None]:
# 20% validation (of original 150 samples) = 0.2 * 150 = 30 
# SO, that means approximately 25% of X_train_valid
(0.2 * 150) / 127

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, shuffle = True, test_size = 0.25, random_state = 12)

In [None]:
X_valid.shape