<a href="https://colab.research.google.com/github/ProfTodoMundo/DeepLearningReview/blob/main/Training_vs_Testing_vs_Validation_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



1. Splitting the dataset is to assess how effective will the trained model be in generalizing to new data.
2. This split can be achieved by using train_test_split function of scikit-learn.



# Training Set
1. This is the actual dataset from which a model trains .i.e.
2. the model sees and learns from this data to predict the outcome or to make the right decisions.
3. Generally, this data is more than 60% of the total data available for the project.

In [None]:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to
# represent x,y for example
# Making a array for x ranging
# from 0-15 then reshaping it
# to form a matrix of shape 8x2
x = np.arange(16).reshape((8,2))

# y is just a list of 0-7 number
# representing target variable
y = range(8)

# Splitting dataset in 80-20 fashion .i.e.
# Testing set is 20% of total data
# Training set is 80% of total data
x_train, x_test, y_train, y_test = train_test_split(x,y, train_size=0.8, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)

Training set x:  [[ 0  1]
 [14 15]
 [ 4  5]
 [ 8  9]
 [ 6  7]
 [12 13]]
Training set y:  [0, 7, 2, 4, 3, 6]


# Testing Set
1. This dataset is independent of the training set but
2. It has a somewhat similar type of probability distribution of classes and
3. It is used as a benchmark to evaluate the model,
4. It is used only after the training of the model is complete.

In [None]:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for example
# Making a array for x ranging from 0-15 then
# reshaping it to form a matrix of shape 8x2
x = np.arange(16).reshape((8, 2))

# y is just a list of 0-7 number representing
# target variable
y = range(8)

# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Testing set is 20% of total data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)

Testing set x:  [[ 2  3]
 [10 11]]
Testing set y:  [1, 5]


Validation Set
1. The validation set is used to fine-tune the hyperparameters of the model and
2. It is considered a part of the training of the model.
3. The model only sees this data for evaluation but
4. The model does not learn from this data, providing an objective unbiased evaluation of the model.

In [None]:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split

# Making a dummy array to represent x,y for example
# Making a array for x ranging from 0-23 then reshaping it
# to form a matrix of shape 8x3
x = np.arange(24).reshape((8,3))

# y is just a list of 0-7 number representing
# target variable
y = range(8)

# Splitting dataset in 80-20 fashion .i.e.
# Training set is 80% of total data
# Combined set of testing & validation is
# 20% of total data
x_train, x_Combine, y_train, y_Combine = train_test_split(x,y, train_size=0.8, random_state=42)

# Splitting combined dataset in 50-50 fashion .i.e.
# Testing set is 50% of combined dataset
# Validation set is 50% of combined dataset
x_val, x_test, y_val, y_test = train_test_split(x_Combine, y_Combine, test_size=0.5, random_state=42)

# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
print("  ")

# Testing set
print("Testing set x: ",x_test)
print("Testing set y: ",y_test)
print("  ")

# Validation set
print("Validation set x: ",x_val)
print("Validation set y: ",y_val)

Training set x:  [[ 0  1  2]
 [21 22 23]
 [ 6  7  8]
 [12 13 14]
 [ 9 10 11]
 [18 19 20]]
Training set y:  [0, 7, 2, 4, 3, 6]
  
Testing set x:  [[15 16 17]]
Testing set y:  [5]
  
Validation set x:  [[3 4 5]]
Validation set y:  [1]
