# Datasplitting

In this file we will manipulate data into train-, validation- and testsets that is compatible with all models used in the project thesis, regardless of whether it was written in Python or R.

This function will be used a total of $4$ times, once for each trait in each of the datasets 180k and 70k.

The data will be partitioned in training/validation/test sets in proportion 80/10/10 in each case.

Import libraries used in file

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

Import processed data (two-step method) that has removed environmental effects.

In [15]:
data = pd.read_feather("C:/Users/gard_/Documents/MasterThesis/ProjectThesis/MyPipeline/Data/Processed/wingBV.feather")

In [16]:
ringnrs = data["ringnr"]
print("Check the dim of ringnrs: \n",ringnrs.shape, "\n")

# Are there any duplicated ringnrs?
print("Any duplicated ringnrs?")
ringnrs.duplicated().any()

Check the dim of ringnrs: 
 (1912,) 

Any duplicated ringnrs?


False

Next, we want to create 10 CV-folds for the training and validationsets. We use ring_train_val dataset.

In [18]:
# Initialize KFold with the desired number of splits (10 folds)
kf = KFold(n_splits = 10, shuffle = True, random_state = 42)

# Create an empty list to hold the different folds with actual data values
folds = []

# Loop through the kf.split method which yields train and test indices
for fold_index, (train_indices, test_indices) in enumerate(kf.split(ringnrs)):
    #print(f"Fold {fold_index+1}:")
    
    # Store the actual training and validation data values in each fold
    train_data = ringnrs.iloc[train_indices]
    test_data = ringnrs.iloc[test_indices]
    
    # Append the fold containing training and testing data
    folds.append({
        'train': train_data,
        'test': test_data
    })

# Convert the list of folds into a numpy array for convenience
folds_array = np.array(folds, dtype=object)

# Print the folds array to inspect the data stored
#print(folds_array)

Now we want to save the training/validation-folds in a CSV-file so that we can extract it from both Python and R. To do this, we first convert it to a pd.data_frame.

In [19]:
# Prepare the data to be saved in a structured format (list of dictionaries)
csv_data = []
for i, fold in enumerate(folds_array):
    for ringnr in fold['train']:
        csv_data.append({'Fold': i + 1, 'Set': 'train', 'ringnr': ringnr})
    for ringnr in fold['test']:
        csv_data.append({'Fold': i + 1, 'Set': 'test', 'ringnr': ringnr})

# Convert to DataFrame and save as CSV
df = pd.DataFrame(csv_data)
# df.to_csv('cv_folds.csv', index=False)
#print(df)

In [20]:
# Save CSV-file
# Train/val
df.to_csv('cv_folds_wing_180k.csv', index=False)

# Test
#ring_test.to_csv('ringnr_test.csv', index = False)