# 1.4. TRAINING, VALIDATION AND TEST DATA
## INTRODUCTION
Training, validation and test data sets are three types of data sets that are commonly used in machine learning to create and evaluate predictive models. They have different purposes and roles in the machine learning process.

## TRAINING DATA SET
A **training data set is a data set of examples used to fit the parameters of the model**, such as the weights of the connections in an artificial neural network. The model is trained on the training data set using a supervised learning method, such as gradient descent. The training data set usually consists of pairs of an input vector and the corresponding output vector, also known as the target or label. The goal of the training process is to minimize the error between the model’s predictions and the targets on the training data set.

## VALIDATION DATA SET
A **validation data set is a data set of examples used to tune the hyperparameters of the model, such as the learning rate or the number of hidden layers**. The validation data set is separate from the training data set and is not used to fit the parameters of the model. Instead, it is used to evaluate how well the model generalizes to new data that it has not seen before. The validation data set provides a measure of the model’s performance and helps to select the best model among different candidates.

## TEST DATA SET
A **test data set is a data set of examples used to test the final performance of the model after it has been trained and validated**. The test data set is also separate from the training and validation data sets and is only used once at the end of the machine learning process. The test data set should reflect the real-world data that the model will encounter in practice. The test data set provides an unbiased estimate of the model’s accuracy and error rate on unseen data.

## CONCLUSION
Training, validation and test data sets are essential for building and evaluating machine learning models. They have different roles and functions in the machine learning process:
- **Training data set**: used to fit the parameters of the model.
- **Validation data set**: used to tune the hyperparameters of the model and select the best model.
- **Test data set**: used to test the final performance of the model on unseen data.

## HANDS ON FOR TRAIN, VALIDATION AND TEST DATA SPLITTING

### 1. USING SkLearn's `train_test_split`

In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [3]:
# load the iris dataset and get X and Y data
iris = load_iris()
train = pd.DataFrame(iris.data)
test = pd.DataFrame(iris.target)

In [4]:
# set aside 20% of train and test data for evaluation
X_train, X_test, y_train, y_test = train_test_split(train, test,
    test_size=0.2, shuffle = True, random_state = 8)

# Use the same function above for the validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
    test_size=0.25, random_state= 8) # 0.25 x 0.8 = 0.2


print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_train.shape))
print("y_test shape: {}".format(y_test.shape))
print("X_val shape: {}".format(y_train.shape))
print("y val shape: {}".format(y_test.shape))

X_train shape: (90, 4)
X_test shape: (30, 4)
y_train shape: (90, 1)
y_test shape: (30, 1)
X_val shape: (90, 1)
y val shape: (30, 1)


### 2. USING Numpy

In [6]:
import numpy as np
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/'
'master/iris.csv') 

# one line split 
train, validation, test = np.split(iris.sample(frac=1), [int(.6*len(iris)),
int(.8*len(iris))])

In [7]:
# Assign the train split
X_train = train[[train.columns[i] for i in range(train.shape[1]-1) ]]
y_train = train[train.columns[-1]]
# Assign the test split
X_test = test[[test.columns[i] for i in range(train.shape[1]-1) ]]
y_test = test[test.columns[-1]]
# Assign the validation split
X_val = validation[[validation.columns[i] for i in 
                    range(validation.shape[1]-1) ]]
y_val = validation[validation.columns[-1]]

# Print the sets data shapes
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_train.shape))
print("y_test shape: {}".format(y_test.shape))
print("X_val shape: {}".format(X_val.shape))
print("y_val shape: {}".format(y_val.shape))

X_train shape: (90, 4)
X_test shape: (30, 4)
y_train shape: (90,)
y_test shape: (30,)
X_val shape: (30, 4)
y_val shape: (30,)


## REFERENCES
1. https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets
2. https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
3. https://www.applause.com/blog/training-data-validation-data-vs-test-data
4. https://machinelearningmastery.com/training-and-validation-data-in-pytorch/
5. https://vitalflux.com/machine-learning-training-validation-test-data-set/
6. https://towardsdatascience.com/how-to-split-data-into-three-sets-train-validation-and-test-and-why-e50d22d3e54c