# Create CV folds simulated data

In this file we will manipulate data into train-, validation- and testsets that is compatible with all models used in the master thesis, regardless of whether it was written in Python or R.

This file will be used for all $10$ simulation-architectures (Chiara) as defined in handwritten notes.

The data will be partitioned in training/validation/test sets in proportion 90/10 in each case.

The script is generalized such that you can create CV folds from simulations on 180k or 70k dataset.

Import libraries used in file

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

Run the script for all $10$ architectures in a for-loop.

In [2]:
# Set dataset 180k or 70k
dset = "180k"
for j in range(10):
    # Architecture
    arch = j + 1
    # Simulated:
    data = pd.read_csv("c:/Users/gard_/Documents/MasterThesis/Code/Data/Phenotypes/" + dset + "/Sim_pheno_" + dset + "_arch_" + str(arch) + ".csv")

    ringnrs = data["ringnr"]
    print("Check the dim of ringnrs: \n",ringnrs.shape, "\n")

    # Are there any duplicated ringnrs?
    print("Any duplicated ringnrs?")
    print(ringnrs.duplicated().any())

    # Initialize KFold with the desired number of splits (10 folds)
    kf = KFold(n_splits = 10, shuffle = True, random_state = 42)

    # Create an empty list to hold the different folds with actual data values
    folds = []

    # Loop through the kf.split method which yields train and test indices
    for fold_index, (train_indices, test_indices) in enumerate(kf.split(ringnrs)):
        #print(f"Fold {fold_index+1}:")
        
        # Store the actual training and validation data values in each fold
        train_data = ringnrs.iloc[train_indices]
        test_data = ringnrs.iloc[test_indices]
        
        # Append the fold containing training and testing data
        folds.append({
            'train': train_data,
            'test': test_data
        })

    # Convert the list of folds into a numpy array for convenience
    folds_array = np.array(folds, dtype=object)


    # Prepare the data to be saved in a structured format (list of dictionaries)
    csv_data = []
    for i, fold in enumerate(folds_array):
        for ringnr in fold['train']:
            csv_data.append({'Fold': i + 1, 'Set': 'train', 'ringnr': ringnr})
        for ringnr in fold['test']:
            csv_data.append({'Fold': i + 1, 'Set': 'test', 'ringnr': ringnr})

    # Convert to DataFrame and save as CSV
    df = pd.DataFrame(csv_data)
    # df.to_csv('cv_folds.csv', index=False)
    #print(df)

    # Data path to CV-folder
    CV_path = "C:/Users/gard_/Documents/MasterThesis/Code/Data/CVfolds/Sim_" + dset + "/"
    # Save CSV-file
    df.to_csv(CV_path + 'cv_folds_sim_' + dset + '_arch_' + str(arch) +  '.csv', index=False)

Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
Check the dim of ringnrs: 
 (3032,) 

Any duplicated ringnrs?
False
