# Train-test split & train-validation split
*by Max* 

This notebook explains how to do the train-test split as well as the train-validation split.

We start by importing the needed modules. Important is the own module in the src folder which contains the train-test split function. We also set a random state for the whole notebook.

In [1]:
# import the needed modules
import numpy as np
import pandas as pd

# import own modules from the scr folder
import sys
sys.path.append('../src/')
from train_test_function import train_test_split_fields

# set a random seed
RSEED = 42
np.random.seed(RSEED)

Next step is to load the data.

In [2]:
# set the directory of the data 
OUTPUT_DIR = '../data'
# load the base data from the CSV files
df = pd.read_csv(f'{OUTPUT_DIR}/mean_band_perField_perDate.csv')

Here after you can do some fine tuning on the data or just do the split immediately. The split is done via the function train_test_split_fields in the src/train_test_function.py.

In [3]:
# Do the train-test-split
df_train, df_test = train_test_split_fields(df, train_size=0.7, random_state=RSEED)
# Do the validation split
df_train_val, df_test_val = train_test_split_fields(df_train, train_size=0.7, random_state=RSEED)

In [4]:
print("---"*23)
print(f"The number of observations in the whole data set: {len(df)}")
print("---"*23)
print(f"The number of observations in the train data set: {len(df_train)}")
print(f"The number of observations in the test data set: {len(df_test)}")
print("---"*23)
print(f"The number of observations in the train-validation data set: {len(df_train_val)}")
print(f"The number of observations in the test-validation data set: {len(df_test_val)}")
print("---"*23)

---------------------------------------------------------------------
The number of observations in the whole data set: 4301227
---------------------------------------------------------------------
The number of observations in the train data set: 3011081
The number of observations in the test data set: 1290146
---------------------------------------------------------------------
The number of observations in the train-validation data set: 2108931
The number of observations in the test-validation data set: 902150
---------------------------------------------------------------------
