# Train-test split

in this tutorial we will explore how to split a dataset into train and test from scratch via python

In [1]:
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    
    return train, dataset_copy

## Example of how this could be used on a dataset


In [3]:
from random import seed
from random import randrange

In [4]:
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
train, test = train_test_split(dataset, 0.6)
print("Train dataset is: ", train)
print("Test dataset is: ", test)

Train dataset is:  [[3], [2], [7], [1], [8], [9]]
Test dataset is:  [[4], [5], [6], [10]]


# K-fold cross validation split

1. Divide the dataset into k subsets or folds of roughly equal size.
2. For each fold, train the model on the remaining k-1 folds and evaluate its performance on the current fold.
3. Calculate the performance metric for each fold, such as accuracy, precision, recall, or F1 score.
4. Average the performance metric across all folds to obtain an estimate of the model's generalization performance

In [6]:
def cross_validation_split(dataset, folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / folds)
    for i in range(folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    
    return dataset_split
    

## Example of this on a small dataset

In [10]:
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 3)
print(folds)

[[[3], [2], [7]], [[1], [8], [9]], [[10], [6], [5]]]


# Downsides of using K-fold cross validation

1. time consumin to run
2. requires *k* different models to be trained and evaluated
3. If large dataset, the above problems are badh gaya