<a href="https://colab.research.google.com/github/LorraineWong/WQD7005-Data-Mining-S2152880/blob/main/Notebook/Project_health_deterioration_model_clear_output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🛠️ 0. Setup and Configuration**

**Step 1: Install Core Dependencies**

Install all required Python packages for AI-powered feature engineering, modeling, and NLP tasks.

In [None]:
# Install Core Dependencies
!pip install -q numpy pandas matplotlib seaborn scikit-learn xgboost transformers imbalanced-learn tqdm

**Step 2: Set Working Directory**

All generated data files (e.g., raw outputs, datasets, summaries) will be saved to this path for clarity and version control.

In [None]:
# Set Working Directory in Colab/Drive
import os
my_file_path = "/content/drive/MyDrive/UM Data Science Course Information/WQD7005/Assignment Project/"
os.makedirs(my_file_path, exist_ok=True)

**Step 3: Authenticate Hugging Face**

plan to access transformer models (e.g., MiniLM, BERT-tiny), login to Hugging Face Hub is required.

In [None]:
# Setup Hugging Face Token
from huggingface_hub import notebook_login
notebook_login()

**Step 4: Securely Load Azure API Credentials**

Using secrets.json avoids hardcoding sensitive information. This supports secure API usage and easier sharing of my notebook.

In [None]:
# Securely Load Azure API Credentials
# Azure endpoint and keys
import json

# Load secrets.json after upload
secrets_file = os.path.join(my_file_path, "secrets.json")
if os.path.exists(secrets_file):
    with open(secrets_file, "r") as f:
        secrets = json.load(f)

endpoint = secrets["AZURE_ENDPOINT"]
subscription_key = secrets["AZURE_KEY"]

**Step 5: Configure Azure OpenAI Client and Define GPT Prompt Wrapper**

This step sets up the Azure OpenAI client and defines a reusable function for sending prompts to GPT-4o, enabling automated clinical text generation and interpretation.




In [None]:
# Import Supporting Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from openai import AzureOpenAI
import random
import time

# Configure Azure OpenAI Client
api_version = "2024-12-01-preview"
deployment = "gpt-4o"
client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=endpoint,
    api_key=subscription_key,
)
# Prompt execution wrapper for reuse
def model_prompt(prompt, system_prompt="Act as a professional clinicians.", temperature=0.7, max_tokens=4096):
    response = client.chat.completions.create(
        model=deployment,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
    )
    return response.choices[0].message.content

**Step 6: Single Sample Data Generation via GPT (Validation Prompt)**

To verify the response structure of GPT-4o by generating a realistic single-patient daily monitoring record, ensuring the output conforms to expected JSON schema for later batch generation.

In [None]:
# Single Sample Data Generation via model
data_prompt = """
Generate a single, realistic patient monitoring record for one randomly selected adult patient.

Provide the following fields:
- oxygen_saturation (in %)
- heart_rate (in bpm)
- temperature (in °C)
- blood_pressure (systolic/diastolic, e.g. "120/80")
- weight (in kg)
- blood_glucose (in mg/dL)

At the end, include a brief clinical_note (1–2 sentences, max 30 words) summarizing the patient status based on the values above. Use professional clinical tone with realistic variation (e.g. stable, recovering, mild concerns).

Output as a valid JSON object with keys:
oxygen_saturation, heart_rate, temperature, blood_pressure, weight, blood_glucose, clinical_note.

Constraints:
- Only output one JSON object.
- No markdown or explanation.
- Include realistic variation across different health conditions (e.g. fatigue, post-op, dietary changes, stress).
- Ensure all fields are complete, no missing values.
"""

print(model_prompt(data_prompt))

# **🧬 1. Dataset Simulation and Feature Engineering**

Since a synthetic patient dataset with labeled **note_status** was already generated and preprocessed in the previous assignment, so this section focuses on loading the prepared dataset, checking basic data structure, verifying label quality, and ensuring readiness for AI-driven feature engineering and modeling.

**Step 1: Load Preprocessed Dataset**

Load the previously prepared patient dataset containing all required features and labels.

In [None]:
import pandas as pd

# Load preprocessed dataset from previous assignment
df = pd.read_csv(my_file_path + "preprocessing_generate_patient_dataset.csv")

# Display basic info and preview
print(df.info())

In [None]:
print(df.head())

In [None]:
print(df.describe())

**Step 2: Data Structure and Integrity Check**

Check data types, confirm absence of missing values, and review main variables.

In [None]:
# Check data info and missing values
print(df.info())
print(df.isnull().sum())

**Step 3: Check Label Distribution**

Review the distribution of the clinical status label (note_status) to ensure it is suitable for modeling.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
ax = sns.countplot(
    data=df,
    x="note_status",
    order=df["note_status"].value_counts().index,
    palette="pastel"
)

plt.title("Distribution of Clinical Note Status Labels", fontsize=14)
plt.xlabel("Note Status", fontsize=12)
plt.ylabel("Number of Records", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()

# Add count labels on each bar
for p in ax.patches:
    count = int(p.get_height())
    ax.annotate(f"{count}", (p.get_x() + p.get_width() / 2, p.get_height()),
                ha="center", va="bottom", fontsize=11, color="black")

plt.show()

**Step 4: Standardize Note Status Label Encoding**

Map the clinical status label (note_status) to standardized integer codes: 0 for Stable, 1 for Recovering, 2 for Deteriorating, and 3 for Critical. Replace the existing note_status_encoded with these values for consistency in downstream modeling.

In [None]:
# Define standardized label mapping
note_status_mapping = {
    "Stable": 0,
    "Recovering": 1,
    "Deteriorating": 2,
    "Critical": 3
}

# Apply mapping to the note_status column and overwrite note_status_encoded
df['note_status_encoded'] = df['note_status'].map(note_status_mapping)

# Preview mapping results
print(df[['note_status', 'note_status_encoded']].head())
print(df['note_status_encoded'].value_counts().sort_index())

**Summary**

The preprocessed patient dataset has been successfully loaded and validated. All core variables, including vital sign z-scores, clinical notes, and coded clinical status labels, are present with no missing values. The distribution of the “note_status” label has been visualized, revealing the class imbalance issue that will be addressed in the modeling phase. After confirming the dataset structure and quality, we are ready for AI-driven NLP feature engineering and predictive modeling.

# **🤖 2. NLP Feature Engineering**

**Step 1: Sentiment Analysis on Clinical Notes**

Extract sentiment label and score from each clinical note using a transformer-based sentiment analysis model.

In [None]:
from transformers import pipeline
from tqdm import tqdm
import pandas as pd

sentiment_pipe = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

def get_sentiment(text):
    result = sentiment_pipe(text[:512])[0]
    return pd.Series([result['label'], result['score']])

tqdm.pandas(desc="Sentiment Analysis")
df[['sentiment_label', 'sentiment_score']] = df['clinical_note'].progress_apply(get_sentiment)

In [None]:
# Preview sentiment features
print(df[['sentiment_label', 'sentiment_score']].head())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
ax = sns.countplot(
    data=df,
    x="sentiment_label",
    order=df["sentiment_label"].value_counts().index,
    palette="pastel"
)

plt.title("Distribution of Sentiment Labels", fontsize=14)
plt.xlabel("Sentiment Label", fontsize=12)
plt.ylabel("Number of Records", fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()

# Add count labels on each bar
for p in ax.patches:
    count = int(p.get_height())
    ax.annotate(f"{count}", (p.get_x() + p.get_width() / 2, p.get_height()),
                ha="center", va="bottom", fontsize=11, color="black")

plt.show()

To check for ambiguous clinical notes, I analyze the distribution of sentiment scores. Scores near 0.5 suggest uncertainty, while scores closer to 0 or 1 indicate confident classification.

In [None]:
# Count the number of potential "neutral" cases (sentiment_score between 0.4 and 0.6)
neutral_count = ((df['sentiment_score'] >= 0.4) & (df['sentiment_score'] <= 0.6)).sum()
total = len(df)
percent_neutral = 100 * neutral_count / total

print(f"Potential 'neutral' cases (score between 0.4 and 0.6): {neutral_count} ({percent_neutral:.2f}%)")

In [None]:
# Visualize the distribution of sentiment scores for all clinical notes
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.hist(df['sentiment_score'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of Sentiment Scores")
plt.xlabel("Sentiment Score")
plt.ylabel("Count")
plt.show()

In [None]:
# Preview random samples of "neutral" sentiment cases
neutral_samples = df[(df['sentiment_score'] >= 0.4) & (df['sentiment_score'] <= 0.6)].sample(5)
print(neutral_samples[['clinical_note', 'note_status','sentiment_label', 'sentiment_score']])

The sentiment scores are strongly polarized, with very few notes falling in the ambiguous range (0.4–0.6). Only 2.8% of clinical notes show unclear sentiment, confirming that binary sentiment labels are appropriate for this dataset.

**Step 2: Encode Sentiment Labels**

Convert the sentiment labels into binary numeric values for downstream modeling. Encode sentiment labels as binary values (1=POSITIVE, 0=NEGATIVE) for model input.

In [None]:
# Encode sentiment label as binary (POSITIVE=1, NEGATIVE=0)
df['sentiment_label_encoded'] = df['sentiment_label'].map({'POSITIVE': 1, 'NEGATIVE': 0})

# Preview encoded results
print(df[['sentiment_label', 'sentiment_label_encoded']].head())
print(df['sentiment_label_encoded'].value_counts())

**Step 3: Generate MiniLM Embeddings for Clinical Notes**

Convert each clinical note into a dense vector using the MiniLM model, creating numerical features that capture the semantic meaning of the text.

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from tqdm import tqdm

# Load MiniLM model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Function to get mean-pooled sentence embedding
def get_embedding(text):
    inputs = tokenizer(text[:512], return_tensors="pt", truncation=True, padding=True, max_length=64)
    with torch.no_grad():
        emb = model(**inputs).last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
    return emb

# Generate embeddings with progress bar
embeddings = []
for note in tqdm(df['clinical_note'], desc="Generating MiniLM Embeddings"):
    embeddings.append(get_embedding(note))
embeddings = np.vstack(embeddings)

In [None]:
# Convert to DataFrame and merge with main df
embeddings_df = pd.DataFrame(embeddings, columns=[f'embedding_{i+1}' for i in range(embeddings.shape[1])])
df = pd.concat([df.reset_index(drop=True), embeddings_df], axis=1)

# Preview embedding features
print(embeddings_df.shape)
print(embeddings_df.head())

# **🏷️ 3. Feature and Target Assignment**

**Step 1: Select Features for Modeling**

Combine structured vital signs, sentiment analysis features, and MiniLM embeddings to form the initial input feature set.

In [None]:
# List of structured vital sign features
vital_features = [
    'temperature_zscore', 'heart_rate_zscore', 'blood_glucose_zscore',
    'oxygen_saturation_zscore', 'systolic_bp_zscore', 'diastolic_bp_zscore', 'weight_zscore'
]

# NLP features (sentiment + embeddings)
nlp_features = ['sentiment_label_encoded', 'sentiment_score'] + [f'embedding_{i+1}' for i in range(embeddings_df.shape[1])]

# Combine all features for model input
feature_cols = vital_features + nlp_features
X = df[feature_cols]
print("Feature matrix shape:", X.shape)

**Step 2: Define Target Variable**

Set the encoded clinical status label as the prediction target for multi-class classification.

In [None]:
# Target variable (multi-class clinical status)
y = df['note_status_encoded']
print("Target distribution:\n", y.value_counts().sort_index())

# **🔀 4. Train-Test Split**


**Step 1: Split the Dataset**

Split the dataset into training and test sets, stratifying by the target variable to maintain class distribution.

In [None]:
from sklearn.model_selection import train_test_split

# 80% for training, 20% for testing, stratified by class
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Training target distribution:\n", y_train.value_counts().sort_index())
print("Test target distribution:\n", y_test.value_counts().sort_index())

# **⚖️ 5. Class Imbalance Handling (SMOTE)**

**Step 1: Balance the Training Set with SMOTE**

Apply SMOTE to the training set to generate synthetic minority samples and balance class distribution. The process may take some time for high-dimensional data.

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter
import time
from tqdm import tqdm

# Optional: label mapping for pretty output
note_status_mapping = {0: "Stable", 1: "Recovering", 2: "Deteriorating", 3: "Critical"}

# Print original class distribution (with labels)
print("Original training class distribution:")
for k, v in Counter(y_train).items():
    print(f"  {note_status_mapping[k]} ({k}): {v}")

# Start timer
start = time.time()

# Run SMOTE with overall progress feel
print("Applying SMOTE to balance classes...")
for _ in tqdm(range(1), desc="SMOTE Oversampling"):
    sm = SMOTE(random_state=42)
    X_train_bal, y_train_bal = sm.fit_resample(X_train, y_train)

# End timer
end = time.time()
print(f"SMOTE completed in {end-start:.2f} seconds.")

In [None]:
# Print balanced class distribution (with labels)
print("Balanced training class distribution:")
for k, v in Counter(y_train_bal).items():
    print(f"  {note_status_mapping[k]} ({k}): {v}")

print("Balanced training set shape:", X_train_bal.shape)

# **🚀 6. Model Development and Evaluation**

**Step 1: Step 1: Define the Traditional Model Evaluation Function**

Define a general-purpose evaluation function for traditional models, reporting metrics, confusion matrix, and classification report.

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name="Model"):
    """
    Evaluate a classification model with standard metrics and visualize the confusion matrix.
    Returns a summary dictionary for results table, including raw evaluation artifacts for LLM analysis.
    """
    from sklearn.metrics import (
        accuracy_score, f1_score, precision_score, recall_score,
        classification_report, confusion_matrix
    )
    import matplotlib.pyplot as plt
    import seaborn as sns

    # Predict
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    # Metrics
    acc = accuracy_score(y_test, y_pred)
    f1_macro = f1_score(y_test, y_pred, average="macro")
    prec_macro = precision_score(y_test, y_pred, average="macro")
    recall_macro = recall_score(y_test, y_pred, average="macro")
    f1_train = f1_score(y_train, y_train_pred, average="macro")

    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Print results
    print(f"\n===== {model_name} Evaluation =====")
    print(f"Accuracy: {acc:.4f} | Macro F1: {f1_macro:.4f} | Precision: {prec_macro:.4f} | Recall: {recall_macro:.4f}")

    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
    plt.title(f"{model_name} Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.show()

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, digits=3))

    # Add these for LLM-friendly result summaries
    cm_list = cm.tolist()  # For JSON/prompt/LLM
    report_dict = classification_report(y_test, y_pred, digits=3, output_dict=True)  # For AI parsing
    report_str = classification_report(y_test, y_pred, digits=3)  # For human reading

    # Return summary dict
    return {
        "Model": model_name,
        "Train F1": round(f1_train, 4),
        "Test F1": round(f1_macro, 4),
        "Accuracy": round(acc, 4),
        "Precision": round(prec_macro, 4),
        "Recall": round(recall_macro, 4),
        "Confusion Matrix": cm_list,
        "Classification Report (dict)": report_dict,
        "Classification Report (str)": report_str
    }

**Step 2: Train and Evaluate Random Forest**

Train a Random Forest classifier. Evaluate its test set performance using key metrics and visualize the confusion matrix.

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import RandomForestClassifier

results_list = []

# Train Random Forest on balanced training set
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=10,
    min_samples_split=8,
    min_samples_leaf=4,
    max_features='sqrt',
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train_bal, y_train_bal)

# Evaluate model performance on test set
rf_results = evaluate_model(rf, X_train_bal, y_train_bal, X_test, y_test, model_name="Random Forest")
results_list.append(rf_results)

**Step 3: Train and Evaluate XGBoost**

Train an XGBoost classifier using the same features and balanced data. Evaluate its performance using the same metrics for fair comparison.

In [None]:
from xgboost import XGBClassifier

# Train XGBoost on balanced training set
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.07,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=6,
    gamma=2,
    use_label_encoder=False,
    eval_metric='mlogloss',
    random_state=42,
    n_jobs=-1
)
xgb.fit(X_train_bal, y_train_bal)

# Evaluate model performance on test set
xgb_results = evaluate_model(xgb, X_train_bal, y_train_bal, X_test, y_test, model_name="XGBoost")
results_list.append(xgb_results)

**Step 4: Train and Evaluate MLP Neural Network**

Train a Multi-Layer Perceptron (MLP) neural network. Evaluate and visualize its classification performance.

In [None]:
from sklearn.neural_network import MLPClassifier

# Train Multi-layer Perceptron on balanced training set
mlp = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='adam',
    alpha=0.01,
    batch_size=64,
    learning_rate_init=0.002,
    max_iter=200,
    early_stopping=True,
    random_state=42
)
mlp.fit(X_train_bal, y_train_bal)

# Evaluate model performance on test set
mlp_results = evaluate_model(mlp, X_train_bal, y_train_bal, X_test, y_test, model_name="MLP Neural Network")
results_list.append(mlp_results)

**Step 5: Define the Transformer Model Evaluation Function**

Define a specialized evaluation function for transformer-based models, capturing all metrics and artifacts needed for LLM-assisted interpretation.

In [None]:
from sklearn.model_selection import train_test_split

# Split data into train and test sets
train_texts, test_texts, train_labels_raw, test_labels_raw = train_test_split(
    df["clinical_note"].tolist(),
    df["note_status"].tolist(),
    test_size=0.2,
    random_state=42
)

# Manual label mapping for strict order: 0=Stable, 1=Recovering, 2=Deteriorating, 3=Critical
status2id = {
    "Stable": 0,
    "Recovering": 1,
    "Deteriorating": 2,
    "Critical": 3
}
id2status = {v: k for k, v in status2id.items()}

# Map original labels to integer labels
train_labels = [status2id[x] for x in train_labels_raw]
test_labels  = [status2id[x] for x in test_labels_raw]

print("Label mapping:", status2id)
print("Train label unique values:", set(train_labels))
print("Test label unique values:", set(test_labels))


In [None]:
import numpy as np
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report
)
import matplotlib.pyplot as plt
import seaborn as sns

def evaluate_transformer_model(y_true, y_pred, model_name="Model", target_names=None):
    """
    Evaluate a transformer model's predictions and print confusion matrix and metrics.
    Returns summary dictionary for results table, including confusion matrix and classification report.
    """
    acc = accuracy_score(y_true, y_pred)
    f1_macro = f1_score(y_true, y_pred, average="macro")
    prec_macro = precision_score(y_true, y_pred, average="macro")
    recall_macro = recall_score(y_true, y_pred, average="macro")

    cm = confusion_matrix(y_true, y_pred)

    print(f"\n===== {model_name} Evaluation =====")
    print(f"Accuracy: {acc:.4f} | Macro F1: {f1_macro:.4f} | Precision: {prec_macro:.4f} | Recall: {recall_macro:.4f}")

    plt.figure(figsize=(5, 4))
    sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
    plt.title(f"{model_name} Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.show()

    # Classification reports
    report_dict = classification_report(y_true, y_pred, target_names=target_names, digits=3, output_dict=True)
    report_str = classification_report(y_true, y_pred, target_names=target_names, digits=3)
    print("\nClassification Report:")
    print(report_str)

    return {
        "Model": model_name,
        "Test F1": round(f1_macro, 4),
        "Accuracy": round(acc, 4),
        "Precision": round(prec_macro, 4),
        "Recall": round(recall_macro, 4),
        "Confusion Matrix": cm.tolist(),                # <-- For saving or LLM prompt
        "Classification Report (dict)": report_dict,    # <-- For LLM, code, summary
        "Classification Report (str)": report_str       # <-- For direct prompt/human reading
    }

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import os

os.environ["WANDB_DISABLED"] = "true"  # Disable external logging

def train_and_predict_transformer(model_ckpt, train_texts, train_labels, test_texts, test_labels, model_name="Transformer"):
    # 1. Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=4)

    # 2. Tokenize texts
    train_encodings = tokenizer(train_texts, truncation=True, padding=True)
    test_encodings = tokenizer(test_texts, truncation=True, padding=True)

    train_dataset = Dataset.from_dict({**train_encodings, "label": train_labels})
    test_dataset = Dataset.from_dict({**test_encodings, "label": test_labels})

    # 3. Training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=5,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        logging_dir="./logs",
        logging_strategy="epoch",
        save_strategy="no",
        report_to="none"
    )

    # 4. Train
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset
    )
    trainer.train()

    # 5. Predict
    preds = trainer.predict(test_dataset)
    y_pred = np.argmax(preds.predictions, axis=1)
    return y_pred

**Step 6: Train and Evaluate BERT-base Transformer**

Fine-tune a BERT-base transformer on clinical note text for health status prediction. Evaluate its test performance with standard metrics and confusion matrix.

In [None]:
target_names = ['Stable', 'Recovering', 'Deteriorating', 'Critical']

# 1. BERT-base-uncased
bert_pred = train_and_predict_transformer("bert-base-uncased", train_texts, train_labels, test_texts, test_labels, model_name="BERT-base")
results_list.append(evaluate_transformer_model(test_labels, bert_pred, "BERT-base", target_names=target_names))

**Step 7: Train and Evaluate BioBERT Transformer**

Fine-tune a BioBERT transformer model on the same prediction task. Assess its results using identical metrics for comparison.

In [None]:
# 2. BioBERT
biobert_pred = train_and_predict_transformer("dmis-lab/biobert-base-cased-v1.1", train_texts, train_labels, test_texts, test_labels, model_name="BioBERT")
results_list.append(evaluate_transformer_model(test_labels, biobert_pred, "BioBERT", target_names=target_names))

**Step 8: Train and Evaluate DeBERTa Transformer**

Fine-tune a DeBERTa transformer model for the multi-class classification task. Evaluate and compare its predictive performance.

In [None]:
# 3. DeBERTa
deberta_pred = train_and_predict_transformer("microsoft/deberta-base", train_texts, train_labels, test_texts, test_labels, model_name="DeBERTa")
results_list.append(evaluate_transformer_model(test_labels, deberta_pred, "DeBERTa", target_names=target_names))

**Step 9: Summarize and Compare All Model Results**

Aggregate all results into a summary table for direct comparison across all traditional and transformer-based models.

In [None]:
import pandas as pd

summary_df = pd.DataFrame(results_list)
display(summary_df)

# **🧑‍🔬7. LLM-Assisted Model Interpretation and Reporting**

Leverage a Large Language Model (LLM) such as GPT-4o to automatically interpret, compare, and summarize the predictive performance of all evaluated models. This enables objective, human-readable scientific reporting and evidence-based model selection.

**Step 1: Prepare the Model Performance Summary Table**

Convert the pandas summary table of all model results into a markdown-formatted string for easier consumption by an LLM.

In [None]:
# Select key columns for summary and convert to markdown for LLM input
summary_table_text = summary_df[["Model", "Test F1", "Accuracy", "Precision", "Recall"]].to_markdown(index=False)

**Step 2: Define an Expert Prompt for the LLM**

Write an instruction prompt that asks the LLM to analyze and summarize the model comparison table with scientific rigor and clarity.

In [None]:
LLM_SUMMARY_PROMPT = """
You are an expert data scientist. Below is a summary table reporting key test set performance metrics (Test F1, Accuracy, Precision, Recall) for six machine learning models (three traditional and three transformer-based) on a multi-class clinical status prediction task.

Summary Table (test set results):

{summary_table}

Instructions:
1. Compare the performance of traditional machine learning models (Random Forest, XGBoost, MLP Neural Network) with transformer-based models (BERT-base, BioBERT, DeBERTa), citing specific metrics and models by name.
2. Identify the best-performing model(s) and justify your conclusion with numerical evidence.
3. Discuss interesting trends, weaknesses, or trade-offs, such as class imbalance, computational resources, and overfitting risks.
4. Briefly comment on each model's practicality for clinical deployment, considering real-world resource or interpretability constraints.
5. Suggest one area for further improvement or future research.
6. Conclude with a clear, formal academic recommendation (1-2 sentences).

Write your summary in concise, formal, and academic English, suitable for a scientific report. Use bullet points if appropriate for clarity.
"""


**Step 3: Format and Compose the Final Prompt**

Insert the model performance table into the prompt template for LLM processing.

In [None]:
# Merge the prompt with the actual model performance table
final_prompt = LLM_SUMMARY_PROMPT.format(summary_table=summary_table_text)

**Step 4: Generate an Expert Summary via LLM**

Send the formatted prompt to your LLM API (e.g., Azure, OpenAI GPT-4o) and print the summary for reporting.

In [None]:
# Call your LLM (replace with your actual LLM function, e.g., OpenAI/Azure call)
llm_response = model_prompt(final_prompt, system_prompt="You are an expert clinical data scientist. Write in academic style.")

# Output the summary for inclusion in your report
print(llm_response)