# Leave-One-Out-Cross-Validation (LOOCV)

LOOCV is a type of k-fold cross-validation where k is equal to the number of samples. For each iteration, a single sample is used as the test set, the remaining n -1 samples make-up the training set to build the predictive model estimate model performance (Kuo et al., 2020).

Classifiers: Based on the non-linearity and imbalanced nature of primary endpoint types used  in clinical trial protocols, we decided to evaluate four machine learning algorithms: Complement Naïve Bayes (CNB) (Rennie et al., 2003), Multi-layer Perceptron (MLP) (Kruse et al., 2022), Random Forest (RF) (Ho, 1995) and Support Vector Machine (SVM) (Hearst et al., 1998). 

Word embedding techniques: Term-Frequency Inverse Document Frequency (TF-IDF) (Bafna et al., 2016), global vectors for word representations (GloVe) (Pennington et al., 2014) and SciBERT (Beltagy et al., 2019).

Parameter tuning: We used the scikit-learn tool GridSearchCV in Python (scikit-learn, version 1.6.1) to perform a comprehensive search over specified parameter values for each classifier, with the aim to assess potential predictive performance.

Benefits of LOOCV
1. Reduced bias: By validating on each observation only once, LOOCV produces an almost unbiased assessment of model performance.
2. Maximised training data: Every data point is included in the training set for all but one fold, enabling nearly the whole dataset informs model fitting in each iteration.
3. Enhanced robustness: LOOCV is particularly suited to small datasets like the EUCT-NS dataset. LOOCV avoids overlly small training sets that can arise with other cross-validation methods.

In [None]:
# Basic packages and Libraries
import pandas as pd
from pandas import read_csv
import numpy as np

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
# Import chosen model(s)

from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

import matplotlib.pyplot as plt

In [None]:
euct_df = pd.read_csv(euct_df) 
# Data already preprocessed since prior clustering task

In [None]:
euct_df['concat_corpus'] = euct_df['Title']+ " " + euct_df['Objective'] + " " + euct_df['pr_endpoint'] + " " + euct_df['endpoint_description']
euct_df.head()

# Fill NaN values with an empty string
euct_df['concat_corpus'] = euct_df['concat_corpus'].fillna('')

Generate the embeddings for the corpus
For TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
For GloVe: https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/DOCUMENT_POOL_EMBEDDINGS.md
For SciBERT: https://github.com/allenai/scibert?tab=readme-ov-file

These are the word embeddings I used but there are other options to explore.

In [None]:
# Split the data into X and y
X = data.data # feature matrix after embedding transformation
y = euct_df['manual_label'].values # target variable

Grid Search:
An exhaustive search strategy that evaluates every possible combination of hyperparameters.

In [None]:
# Define the model
mdl = Model(random_state=42) # Replace Model with your chosen model, e.g., RandomForestClassifier()

In [None]:
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters and their values for grid search. I looked at the scikit-learn documentation for my chosen models
# to find the appropriate parameters.
parameters = {'parameters of the model'} 

mdl = Model()
pipeline = make_pipeline(mdl)
mdl = GridSearchCV(pipeline, parameters, cv=10, scoring='accuracy', verbose=1, n_jobs=-1)
mdl.fit(X, y)

print("Best parameters found: ", mdl.best_params_)

In [None]:
# Initialise LOOCV and the best performing model
# We perform Leave-One-Out Cross-Validation (LOOCV) to evaluate the best parameters of each model
loo = LeaveOneOut()
mdl = MDL(#parameters of mdl, change these to the best parameters found by GridSearchCV)
predictions = []
actuals = []

In [None]:
# Perform Leave-One-Out Cross-Validation
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    mdl.fit(X_train, y_train)

    y_pred = mdl.predict(X_test)

    predictions.append(y_pred[0])
    actuals.append(y_test[0])

Evaluation metrics

In [None]:
# Evaluation metrics
classification_metrics = classification_report(actuals, predictions, output_dict=True)
confusion_mat = confusion_matrix(actuals, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=mdl.classes_)