# Train and test datasets

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is called training dataset. The second subset is not used to train the model, instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is called test dataset.

#### The train-test procedure is appropriate when there is a sufficiently large dataset available.

The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.

Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance.

Some models are very costly to train, and in that case, repeated evaluation used in other procedures is intractable. An example might be deep neural network models. In this case, the train-test procedure is commonly used.

Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.

### The procedure to split train and test datasets

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets.

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

* Computational cost in training the model.
* Computational cost in evaluating the model.
* Training set representativeness.
* Test set representativeness.

The most commons percentages are:
* Train: 80%, Test: 20%
* Train: 67%, Test: 33%
* Train: 50%, Test: 50%

## Pratice

To split data into train and test datasets in Python we can use tha library **Scikit Learning**.

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

Documentation of the function **train_test_split**: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('gender_classification.csv')
type(df)

pandas.core.frame.DataFrame

In [3]:
df.head()

Unnamed: 0,long_hair,forehead_width_cm,forehead_height_cm,nose_wide,nose_long,lips_thin,distance_nose_to_lip_long,gender
0,1,11.8,6.1,1,0,1,1,Male
1,0,14.0,5.4,0,0,1,0,Female
2,0,11.8,6.3,1,1,1,1,Male
3,0,14.4,6.1,0,1,1,1,Male
4,1,13.5,5.9,0,0,0,0,Female


In [4]:
train, test = train_test_split(df, test_size=0.3, train_size=0.7, random_state= 42)

We can split the original dataset into input (X) and output (y) columns, then call the function passing both arrays and have them split appropriately into train and test subsets.

In [14]:
Y = df['gender']
X = df.drop(columns ='gender', axis = 1)
#X.head()

Unnamed: 0,long_hair,forehead_width_cm,forehead_height_cm,nose_wide,nose_long,lips_thin,distance_nose_to_lip_long
0,1,11.8,6.1,1,0,1,1
1,0,14.0,5.4,0,0,1,0
2,0,11.8,6.3,1,1,1,1
3,0,14.4,6.1,0,1,1,1
4,1,13.5,5.9,0,0,0,0


In [18]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, train_size=0.7, random_state= 42)
type(x_train)

pandas.core.frame.DataFrame

To know the amount of data in each dataset, we can do:

In [19]:
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(3500, 7) (1501, 7) (3500,) (1501,)


### Repeatable Train-Tests split

Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset.


#### Pseudorandom

We do not need true randomness in machine learning. Instead we can use pseudorandomness. Pseudorandomness is a sample of numbers that look close to random, but were generated using a deterministic process.

Shuffling data and initializing coefficients with random values use pseudorandom number generators. These little programs are often a function that you can call that will return a random number. Called again, they will return a new random number.

The numbers are generated in a sequence. The sequence is deterministic and is seeded with an initial number. If you do not explicitly seed the pseudorandom number generator, then it may use the current system time in seconds or milliseconds as the seed.

The value of the seed does not matter. Choose anything you wish. What does matter is that the same seeding of the process will result in the same sequence of random numbers.

The parameter **random_state** controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

**None (default)** Use the global random state instance from numpy.random. Calling the function multiple times will reuse the same instance, and will produce different results.

An **integer** Use a new random number generator seeded by the given integer. Using an int will produce the same results across different calls. However, it may be worthwhile checking that your results are stable across a number of different distinct random seeds. 

### Stratified Train-Test Splits

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified train-test split.

We can achieve this by setting the **“stratify”** argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.



In [26]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1, stratify = Y)

In [27]:
from collections import Counter
print(Counter(Y))
print(Counter(y_train))
print(Counter(y_test))

Counter({'Female': 2501, 'Male': 2500})
Counter({'Female': 1750, 'Male': 1750})
Counter({'Female': 751, 'Male': 750})


### Example with a desbalanced dataset

In [32]:
# split imbalanced dataset into train and test sets without stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 94, 1: 6})
Counter({0: 45, 1: 5})
Counter({0: 49, 1: 1})


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify = y)
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 47, 1: 3})
Counter({0: 47, 1: 3})


## Difference between Test and Validation 

**Training Dataset**: The sample of data used to fit the model.

**Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

**Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

One popular example is to use k-fold cross-validation to tune model hyperparameters instead of a separate validation dataset.

“Validation dataset” disappears if the practitioner is choosing to tune model hyperparameters using k-fold cross-validation with the training dataset.