# Active Learning

Active learning is a semi-supervised, machine learning approach that utilises human insight in tandem with natural language processing (NLP) algorithms to annotate data (Settles, 2009).

Query: We used a predefined query function of least certainty sampling to determine what data was to be labelled in each round of active learning. We determined this first uncertainty by applying the classifier previously trained on the EUCT-NS dataset to the NS-HRA dataset. We labelled seven cases for each round of active learning
Annotate: The command: “Enter label for this case (0, 1, 2):” was answered by the first author and each queried case was labelled.
Append: The newly-labelled examples were removed from the unlabelled dataset (U) and the model was retrained on the labelled cases (L) and applied to U to acquire new uncertainty samples. The active learning loop was repeated until the stop criterion was achieved: twenty loops of active learning (140 cases labelled) or a consistent (across five loops) weighted-recall of ≥ 0.7.

Performance metrics
To better evaluate the performance of our multiclass task, we utilised weighted accuracy, weighted presion, recall and f1-score, area under the receiver operator curve (AUROC) and the precision-recall curve. 

In regard to the interpretation of the metrics: AUROC, weighted accuracy, precision, recall and f1-score are valued between 0 and 1, with values nearer to 1 indicating better performance (Kuo et al., 2020). 

For our study, it was determined that recall would be the most important metric as our aim was to identify all cases where a clinical trial protocol depicts a surrogate outcome as the primary endpoint, even if that means occasionally misclassifying some instances. Whilst this approach may increase false positives, it ensures that we capture all potential primary surrogate endpoint usage, which is essential to understand the broader implications of poor reporting of surrogate outcomes in clinical trials. A binary classification of “surrogate vs. non-surrogate” would not have provided the comprehensive view necessary to distinguish between primary endpoints which are not surrogate outcomes but also do not directly contribute to patient health in a final manner - intermediate outcomes (Griffiths et al., 2022). The trade-off between false positives and recall was carefully considered, acknowledging that the cost of misclassifications is outweighed by the importance of minimising false negatives in this context.

In [None]:
# Basic packages and libraries
import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
# import model 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, classification_report, roc_curve, precision_recall_curve
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_predict
import joblib

import matplotlib.pyplot as plt
# Import feature‐extraction tools (e.g., TfidfVectorizer)

Apply the pretrained model to the unlabelled dataset:
We used the model trained on the EUCT-NS dataset to generate probability scores on our second dataset (NS-HRA dataset), then selected the lowest‐confidence (highest‐uncertainty)predictions as the initial queries for our active‐learning loop.

In [None]:
# Define the unlabelled second dataset
X2_unlabeled = X2_df.copy()

# Obtain probabilities and calculate uncertainty scores
probs = model.predict_proba(X2_unlabeled)
uncertainty_scores = np.abs(probs - 0.5).min(axis=1)  # Least confidence sampling

# Select the top N most uncertain samples
n_samples = x # Define how many samples you want to select, we initially selected 7 (~1% of the dataset). We did an alternate approach of selecting 
# 1 sample per active learning iteration which is more efficient for a small dataset.
most_uncertain_indices = uncertainty_scores.argsort()[-n_samples:]

# Extract the corresponding Unique_IDs and preprocessed concatenated text. The unique IDs are used to identify the samples in the original dataset that 
# have not had the text preprocessed. We used this original dataset, the health outcome framework from Manyara et al., 2022 and a systematic search in the Core Outcome Measures in Effectiveness Trials 
# (COMET) database to determine primary endpoint type.
X_initial_raw = ns_hra.iloc[most_uncertain_indices]

def manually_label_samples(selected_rows):
    labels = []
    for idx, row in selected_rows.iterrows():
        print(f"Unique_ID: {row['Unique_ID']}")
        print(f"Preprocessed Text: {row['concat_corpus']}\n")
        label = input("Enter label for this sample (e.g., 0, 1, 2): ")
        labels.append(int(label))
    return labels

# Manually label the selected samples
y_initial = manually_label_samples(X_initial_raw)
X_initial = X_initial_raw['concat_corpus'].tolist()

In [None]:
# Ensure X_train is always a list before extending
if 'X_train' not in locals() or not isinstance(X_train, list):
    X_train = list(X_initial)  # Convert to list explicitly
    y_train = np.array(y_initial)
else:
    X_train.extend(X_initial)
    y_train = np.hstack([y_train, y_initial])

# Remove labeled samples from unlabeled pool
X2_unlabeled_reset = X2_unlabeled.drop(index=most_uncertain_indices).reset_index(drop=True)

In [None]:
Train the new classifier:
We performed another grid search (see LOOCV) to determine if the best classifier was still the same as the one we had previously selected. 
We found that the best classifier was still the same, so we used it to train the model on the new data.

In [None]:
pipeline = make_pipeline(
    # Add your feature extraction step here, e.g., TfidfVectorizer),
    mdl(best_hyperparameters as determined by grid search)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X2_unlabeled_reset)

In [None]:
# We applied cross-validation to the training set to obtain predictions for the training set.
y_pred_cv = cross_val_predict(pipeline, X_train, y_train, cv=3) # cv increased the larger our sample size became. We used 3 folds for the initial sample size of 7 and increased to 
#5 folds on iteration 4 and 10 folds on iteration 13.

In [None]:
iter1 = classification_report(y_train, y_pred_cv, output_dict=True)

In [None]:
# Repeat the active learning loop with the new model and updated dataset
X2_unlabeled = X2_unlabeled_reset.copy()
...
# Manually label the selected samples
y2_initial = manually_label_samples(X_initial_raw)
X2_initial = X_initial_raw['concat_corpus'].tolist() # Change the variable name for each new given iteration to avoid confusion with the previous iteration.

In [None]:
X2_initial = list(X2_initial)
y_train = np.array(y2_initial)

In [None]:
# Add the newly labelled samples to the previous training set for each new iteration until stop criteria is met.
X_train = X_train + X2_initial
y_train = np.hstack([y_initial, y2_initial])
len(X_train), len(y_train)

In [None]:
# Remove labelled samples from unlabelled pool
X2_unlabeled_reset = X2_unlabeled.drop(index=most_uncertain_indices).reset_index(drop=True)

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X2_unlabeled_reset)

In [None]:
y_pred_cv = cross_val_predict(pipeline, X_train, y_train, cv=3)

In [None]:
iter2 = classification_report(y_train, y_pred_cv, output_dict=True)
print(iter2)

We repeated this a total of 20 times (see README)

Plot

In [None]:
# Step 1: Gather all classification reports
iterations = 20
classification_reports = [globals()[f'iter{i}'] for i in range(1, iterations+1)]

# Step 2: Extract metrics
accuracy_list = []
f1_score_0 = []
f1_score_1 = []
f1_score_2 = []

for report in classification_reports:
    accuracy_list.append(report['accuracy'])
    f1_score_0.append(report['0']['f1-score'])
    f1_score_1.append(report['1']['f1-score'])
    f1_score_2.append(report['2']['f1-score'])

iterations = np.arange(1, 21)

# Step 3: Plot accuracy
plt.figure(figsize=(10, 5))
plt.plot(iterations, accuracy_list, marker='o', linestyle='-', color='b', label='Accuracy')
plt.axhline(y=#baseline_accuracy color='r', linestyle='--', label='Baseline Accuracy')
plt.xlabel('Iteration')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. Iteration')
plt.ylim(0, 1)
plt.legend()
plt.grid()
plt.show()

# Step 4: Plot F1 scores per class
plt.figure(figsize=(10, 5))
plt.plot(iterations, f1_score_0, marker='o', linestyle='-', color='r', label='F1-score Class 0 (PFO)')
plt.plot(iterations, f1_score_1, marker='s', linestyle='--', color='g', label='F1-score Class 1 (IO)')
plt.plot(iterations, f1_score_2, marker='^', linestyle='-.', color='m', label='F1-score Class 2 (SO)')
plt.axhline(y=#baseline weighted f1_score , color='r', linestyle='--', label='Baseline Weighted F1-score')
plt.xlabel('Iteration')
plt.ylabel('F1-Score')
plt.title('F1-Score per Class vs. Iteration')
plt.ylim(0, 1)
plt.legend()
plt.grid()
plt.show()


# Additional work: Precision recall curve and threshold selection

In [None]:
y_pred_proba = pipeline.predict_proba(X_train)

In [None]:
# Target recall threshold
desired_recall = 0.70
selected_thresholds = {}

# For each class (assuming classes 0, 1, 2)
classes = [0, 1, 2]

for cls in classes:
    print(f"\n--- Class {cls} ---")
    
    # Binary ground truth: 1 if current class, 0 otherwise
    y_true_bin = (y_train == cls).astype(int)
    
    # Predicted probability scores for the current class
    y_score = y_pred_proba[:, cls]
    
    # Compute precision-recall pairs
    precision, recall, thresholds = precision_recall_curve(y_true_bin, y_score)
    
    # precision_recall_curve gives (n_thresholds + 1) recall/precision points
    # thresholds has shape (n_thresholds,)
    
    # Find index where recall crosses desired_recall
    try:
        index = np.where(recall >= desired_recall)[0][0]
        
        if index >= len(thresholds):
            # Edge case: the last recall value doesn't have a corresponding threshold
            selected_threshold = thresholds[-1]
        else:
            selected_threshold = thresholds[index]
        
        selected_thresholds[cls] = selected_threshold
        print(f"Threshold @ Recall ≥ {desired_recall:.2f}: {selected_threshold:.4f}")
        print(f"  Recall: {recall[index]:.2f}, Precision: {precision[index]:.2f}")
        
    except IndexError:
        print(f"No threshold found for class {cls} reaching recall ≥ {desired_recall}")
        selected_thresholds[cls] = None
    
    # Plot the Precision-Recall curve
    plt.figure()
    plt.plot(recall, precision, marker='.', label=f'Class {cls}')
    plt.axvline(x=desired_recall, color='red', linestyle='--', label=f'Target Recall = {desired_recall:.2f}')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve for Class {cls}')
    plt.legend()
    plt.grid(True)
    plt.show()

print("\nSelected thresholds per class:")
for cls, thresh in selected_thresholds.items():
    print(f"Class {cls}: {thresh}")


Biasing the model for the surrogate outcome (class 2)

In [None]:
threshold_class_2 = selected_thresholds[2]

# New adjusted predictions
adjusted_preds = []

for prob in y_pred_proba:
    if prob[2] >= threshold_class_2:
        adjusted_preds.append(2)  # Force predict class 2
    else:
        adjusted_preds.append(np.argmax(prob))  # Otherwise, normal prediction

adjusted_preds = np.array(adjusted_preds)


In [None]:
print(classification_report(y_train, adjusted_preds))

In [None]:
thresholds = np.linspace(0, 1, 100)

# Store precision and recall at each threshold
precisions = []
recalls = []

for thresh in thresholds:
    adjusted_preds = []
    
    for prob in y_pred_proba:
        if prob[2] >= thresh:
            adjusted_preds.append(2)
        else:
            adjusted_preds.append(np.argmax(prob))
    
    adjusted_preds = np.array(adjusted_preds)
    
    # Precision and recall class 2
    precision = precision_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    recall = recall_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    
    precisions.append(precision)
    recalls.append(recall)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions, label='Precision', color='blue')
plt.plot(thresholds, recalls, label='Recall', color='green')
plt.axhline(0.7, color='red', linestyle='--', label='Target Recall = 0.70')
plt.axvline(threshold_class_2, color='purple', linestyle='--', label=f'Selected Threshold = {threshold_class_2:.2f}')
plt.xlabel('Threshold for Class 2')
plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold (Class 2)')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
thresholds = np.linspace(0, 1, 500)

best_f1 = 0
best_threshold = 0
best_precision = 0
best_recall = 0

for thresh in thresholds:
    adjusted_preds = []
    
    for prob in y_pred_proba:
        if prob[2] >= thresh:
            adjusted_preds.append(2)
        else:
            adjusted_preds.append(np.argmax(prob))
    
    adjusted_preds = np.array(adjusted_preds)
    
    # Calculate precision, recall, f1 for class 2 only
    precision = precision_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    recall = recall_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    f1 = f1_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = thresh
        best_precision = precision
        best_recall = recall

print(f"Best threshold for Class 2 = {best_threshold:.4f}")
print(f"  F1-score = {best_f1:.4f}")
print(f"  Precision = {best_precision:.4f}")
print(f"  Recall = {best_recall:.4f}")

In [None]:
f1_scores = []

for thresh in thresholds:
    adjusted_preds = []
    
    for prob in y_pred_proba:
        if prob[2] >= thresh:
            adjusted_preds.append(2)
        else:
            adjusted_preds.append(np.argmax(prob))
    
    adjusted_preds = np.array(adjusted_preds)
    
    f1 = f1_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    f1_scores.append(f1)

# Plot F1 vs Threshold
plt.figure(figsize=(10,6))
plt.plot(thresholds, f1_scores, color='blue')
plt.axvline(best_threshold, color='red', linestyle='--', label=f'Best Threshold = {best_threshold:.2f}')
plt.xlabel('Threshold for Class 2')
plt.ylabel('F1-score (Class 2)')
plt.title('F1-Score vs Threshold (Class 2)')
plt.grid(True)
plt.legend()
plt.show()

In [None]:
desired_recall = 0.7  
thresholds = np.linspace(0, 1, 500)

# To store candidates
valid_thresholds = []
valid_precisions = []
valid_recalls = []
valid_f1s = []

for thresh in thresholds:
    adjusted_preds = []
    
    for prob in y_pred_proba:
        if prob[2] >= thresh:
            adjusted_preds.append(2)
        else:
            adjusted_preds.append(np.argmax(prob))
    
    adjusted_preds = np.array(adjusted_preds)
    
    # Evaluate class 2 only
    precision = precision_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    recall = recall_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    f1 = f1_score(y_train, adjusted_preds, labels=[2], average='micro', zero_division=0)
    
    # Only accept thresholds meeting the recall constraint
    if recall >= desired_recall:
        valid_thresholds.append(thresh)
        valid_precisions.append(precision)
        valid_recalls.append(recall)
        valid_f1s.append(f1)

# Select best precision among valid thresholds
if valid_thresholds:
    best_idx = np.argmax(valid_precisions)
    best_threshold = valid_thresholds[best_idx]
    best_precision = valid_precisions[best_idx]
    best_recall = valid_recalls[best_idx]
    best_f1 = valid_f1s[best_idx]
    
    print(f"Best threshold (recall ≥ {desired_recall}): {best_threshold:.4f}")
    print(f"  Precision: {best_precision:.4f}")
    print(f"  Recall: {best_recall:.4f}")
    print(f"  F1-score: {best_f1:.4f}")
else:
    print(f"No threshold found achieving recall ≥ {desired_recall}.")


In [None]:
if valid_thresholds:
    plt.figure(figsize=(10,6))
    plt.plot(valid_thresholds, valid_precisions, marker='o', label='Precision')
    plt.plot(valid_thresholds, valid_recalls, marker='x', label='Recall')
    plt.plot(valid_thresholds, valid_f1s, marker='^', label='F1-score')
    plt.axhline(desired_recall, color='red', linestyle='--', label=f'Target Recall = {desired_recall}')
    plt.xlabel('Threshold for Class 2')
    plt.ylabel('Score')
    plt.title('Valid Thresholds (Recall ≥ Target)')
    plt.legend()
    plt.grid(True)
    plt.show()
