:warning:**IMPORTANT NOTICE**:warning:\
*Since the k-fold splitting is a stochastic processes, this notebook is not fully reproducible.*

*Therefore, it is not recommended to re-run this script as it will overwrite the original validation data used in the work presented here.
The purpose of this script is solely a documentation.*

# Generate K-fold cross-validation splits

This notebook generates 5 validation sets which can then be used to evaluate perfromance of different models in the model selection process.

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path
from sklearn.model_selection import ShuffleSplit

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# load excel file
data = pd.read_excel(Path("Metapelite-Database_Bt_CLEAN_2024-02-03.xlsx"))

biotite_composition = np.zeros(shape=(len(data), 6))
biotite_composition[:, 0] = data["Bt-Si"]
biotite_composition[:, 1] = data["Bt-Ti"]
biotite_composition[:, 2] = data["Bt-Al"]
biotite_composition[:, 3] = data["Bt-FeTot"]
biotite_composition[:, 4] = data["Bt-Mn"]
biotite_composition[:, 5] = data["Bt-Mg"]

# also save XMg
biotite_XMg = np.array([data["Bt-XMg"]]).T

# extract one-hot encoded minerals in the following order: Chl, Grt, Crd, And, St, Ky, Sil, Kfs
index_minerals = np.zeros(shape=(len(data), 8))
index_minerals[:, 0] = data["Chl"]
index_minerals[:, 1] = data["Grt"]
index_minerals[:, 2] = data["Crd"]
index_minerals[:, 3] = data["And"]
index_minerals[:, 4] = data["St"]
index_minerals[:, 5] = data["Ky"]
index_minerals[:, 6] = data["Sil"]
index_minerals[:, 7] = data["Kfs"]

# Some minerals (Chl, Grt, St) have NaN values. Replace them with 0. Most likely samples with regional or metastable phases?!
index_minerals = np.nan_to_num(index_minerals, nan=0)

# combine biotite composition and one-hot encoded minerals
biotite_composition_idxmin = np.concatenate((biotite_composition, biotite_XMg, index_minerals), axis=1)

pt = np.zeros(shape=(len(data), 2))
pt[:, 0] = data["Pressure estimate random uniform"] * 1000 # convert to bar
pt[:, 1] = data["Temperature random ordered after Ti-in-Bt"]

ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

# Split the data into train and test sets
for i, idx in enumerate(ss.split(biotite_composition_idxmin, pt)):
    train_index = idx[0]
    test_index = idx[1]

    training_x = biotite_composition_idxmin[train_index]
    training_y = pt[train_index]

    test_x = biotite_composition_idxmin[test_index]
    test_y = pt[test_index]

    # save training and test data as csv files
    train_file = Path("kfold_datasets",f"train_data_{i}.csv")
    test_file = Path("kfold_datasets",f"test_data_{i}.csv")

    np.savetxt(train_file, np.concatenate((training_x, training_y), axis=1), delimiter=",")
    np.savetxt(test_file, np.concatenate((test_x, test_y), axis=1), delimiter=",")