# Train classifier of predicting primary endpoint type in the EUCT-NS dataset

We used the optimal word embedding technique, classifier and parameters as determined by the previous step of LOOCV to train the classifier. This was Complement Naive Bayes with TF-IDF embeddings.
Train/Test Split: We divided the EUCT-NS dataset into an 80/20 train/test split. 

In [None]:
# Basic packages and libraries
import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
# import model 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, classification_report, roc_curve, precision_recall_curve
import joblib

import matplotlib.pyplot as plt
# Import feature‐extraction tools (e.g., TfidfVectorizer)

Train the classifier

In [None]:
euct_ns = pd.read_csv('c:\\Users\\s2421127\\Documents\\NLP Project\\ObuayaO\\NLP project\\Chapter 3\\euct_ns.csv', encoding='unicode_escape')

euct_ns['concat_corpus'] = euct_ns['Title']+ " " + euct_ns['Objective'] + " " + euct_ns['pr_endpoint'] + " " + euct_ns['endpoint_description']

euct_ns['concat_corpus'] = euct_ns['concat_corpus'].fillna('')

Generate the embeddings for the corpus For TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html For GloVe: https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/DOCUMENT_POOL_EMBEDDINGS.md For SciBERT: https://github.com/allenai/scibert?tab=readme-ov-file

These are the word embeddings I used but there are other options to explore.

In [None]:
# Split the data into X and y
X = data.data # feature matrix after embedding transformation
y = euct_df['manual_label'].values # target variable

In [None]:
joblib.dump(embedded_transformed_features, "feature_embeddings.pkl")
# Save the feature matrix after embedding transformation to apply later to the second dataset

In [None]:
# Define the model
mdl = MDL(parameters of mdl, change these to the best parameters found by GridSearchCV in LOOCV, random_state = 42) # Replace Model with your chosen model, e.g., RandomForestClassifier()

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Train the model on the training data
mdl.fit(X_train, y_train)

# Make predictions on the test data
y_pred = mdl.predict(X_test)

In [None]:
joblib.dump(mdl, "model.pkl")
# Save the trained model to a file for later use

In [None]:
# Evaluation metrics
target_names = ['class 0', 'class 1', 'class 2'] # Patient Final Outcome class, Intermediate Outcome Class and Surrogate Outcome Class
print(classification_report(y_test, y_pred, target_names=target_names))

# Calculate AUROC score using predicted probabilities
auroc_weighted = roc_auc_score(y_test, y_pred_proba, average='weighted', multi_class='ovr')

# Compute ROC curve for each class
for i in range(mdl.classes_.shape[0]):
	fpr, tpr, _ = roc_curve(y_test, y_pred_proba[:, i], pos_label=i)
	plt.plot(fpr, tpr, lw=2, label=f'Class {i} (AUC = {roc_auc:.2f})')

# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

# Customize the plot
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Training ROC Curve Comparison Between Classes')
plt.legend(loc="lower right")
plt.show()

# Compute precision-recall curve for each class
for i in range(mdl.classes_.shape[0]):
	y_prob_train = mdl.predict_proba(X_train)[:, i]
	y_prob_test = mdl.predict_proba(X_test)[:, i]

	precision, recall, thresholds = precision_recall_curve(y_train == i, y_prob_train)
	plt.plot(recall, precision, lw=2, label=f'Class {i}')

plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Training Precision-Recall curve")
plt.legend(loc="best")
plt.show()

# Compute confusion matrix and display it
conf_matrix = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=mdl.classes_)

Apply model trained on the EUCT-NS dataset to second dataset:
Due to data sharing agreement (see LOOCV). We cannot share the NS-HRA dataset. The NS-HRA dataset was formed in a similar way to the EUCT-NS dataset but instead of HTML parsing, it was XML parsing. One could form a comparable second dataset through this approach. The method for the formation of specialised datasets using clinical trial documentation will be shared at a later date. 

In [None]:
second_ds = pd.read_csv(second_ds)

In [None]:
second_ds['concat_corpus'] = second_ds['Title']+ " " + second_ds['Objective'] + " " + second_ds['1ry_endpoint'] 
second_ds['concat_corpus'] = second_ds['concat_corpus'].fillna('')

In [None]:
embedded_transformed_features = joblib.load("feature_embeddings.pkl")
# Load the feature matrix after embedding transformation

In [None]:
X2 = embedded_transformed_features.transform(second_ds['concat_corpus'])

In [None]:
mdl = joblib.load('model.pkl')
# Load the trained model

In [None]:
# Apply the model to the second dataset
model.predict(X2)
y_pred = model.predict(X2)