<a href="https://www.kaggle.com/code/avtnshm/clinical-modernbert-v-biomedicalmb-on-ddxplus-data?scriptVersionId=288157272" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# A comparison paper between two ModernBERT Models- Clinical MB  & BioMedicalMB


### Introduction to the Models and Dataset used
#### ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy (arxiv.org/abs/2412.13663)

#### BioClinical ModernBERT- a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP.(arxiv.org/abs/2506.10896)

#### Clinical ModernBERT - encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT...(arxiv.org/abs/2504.03964)

#### DDXPlus - a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth pathology, symptoms and antecedents for each patient.(arxiv.org/abs/2205.09148)

### Aim of the Notebook Paper
#### The objective of this Kaggle NB running on GPU Tx2 is to compare the performance of the above two models on the DDXPlus and tabulate and present the data in easily readabel format.

### Methodlogy -
- First we load the large DDXPlus dataset into given train, test and validate datasets and also load the evidences json in dataset form.
- Then the Models are loaded from transformers library and used to generate embeddings from the 30 per cent of the dataset as 1.2 M is quite a large number computationally, for one, the working memory of Kaggle NB is ~20G, but the embeddings genearte will need more storage space than that, thus the stratified sampling.
- We then visulaize the generated embeddings using the embeddings space and t-SNE plots
- Then comes the downstream evaluation tasks, using first a logistic regression model and then a more intricate MLPClassifier Model.
- Finally, we compare the top-k scores, accuracy and general text evaulation for both the models and tabluate and try to understand the results obtained.

### Loading and reading the DDXPlus Dataset

In [None]:
# Step 1: Load Data

import pandas as pd
import numpy as np
import json

train_df = pd.read_csv('/kaggle/input/mldataset/ddxplus/train.csv')
test_df = pd.read_csv('/kaggle/input/mldataset/ddxplus/test.csv')
validate_df = pd.read_csv('/kaggle/input/mldataset/ddxplus/validate.csv')

print(train_df.shape, validate_df.shape, test_df.shape)

In [None]:
with open('/kaggle/input/mldataset/ddxplus/release_evidences.json') as f:
    evidences = json.load(f)

evidences_df = pd.DataFrame.from_dict(evidences, orient='index')

#### Checking the datasets

In [None]:
train_df.head(5)

In [None]:
evidences_df.head(5)

In [None]:
train_df.info()

#### Observations - we see that the train dataset from DDXPlus datasets has over a million non null values of patients, namely, age, ddx, gender, pathology, evidences and initial eviences, we also observe that evidences dataset has meaningful relationships defined for pathology and evidences in the form of question, antecedent, value meaning and possobile as seen below-

In [None]:
evidences_df.info()

In [None]:
train_df["TEXT"] = train_df["EVIDENCES"].apply(codes_to_text)
validate_df["TEXT"] = validate_df["EVIDENCES"].apply(codes_to_text)
test_df["TEXT"] = test_df["EVIDENCES"].apply(codes_to_text)

#### Stratifying the Dataset to 30 per cent due to its large size and GPU constraints

In [None]:
from sklearn.model_selection import train_test_split

train_30, _ = train_test_split(
    train_df,
    test_size=0.70,
    stratify=train_df["PATHOLOGY"],
    random_state=42
)

valid_30, _ = train_test_split(
    validate_df,
    test_size=0.70,
    stratify=validate_df["PATHOLOGY"],
    random_state=42
)

test_30, _ = train_test_split(
    test_df,
    test_size=0.70,
    stratify=test_df["PATHOLOGY"],
    random_state=42
)

print(train_30.shape, valid_30.shape, test_30.shape)

#### Loading the Models and tokenzier from transformers library

In [None]:
!pip install -q transformers
import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using:", device)

cmb_model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT').to(device)
cmb_tok = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')

bmb_model = AutoModel.from_pretrained("thomas-sounack/BioClinical-ModernBERT-base").to(device)
bmb_tok = AutoTokenizer.from_pretrained("thomas-sounack/BioClinical-ModernBERT-base")

#### Generating the Embeddings using both the loaded Models

In [None]:
def embed(text_list, tokenizer, model, batch_size=32):
    all_vecs = []

    for i in range(0, len(text_list), batch_size):
        batch = text_list[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=128).to(device)

        with torch.no_grad():
            out = model(**inputs).last_hidden_state[:,0,:]

        all_vecs.append(out.cpu().numpy())

    return np.vstack(all_vecs)

In [None]:
train_texts = train_30["TEXT"].tolist()
valid_texts = valid_30["TEXT"].tolist()
test_texts  = test_30["TEXT"].tolist()

cmb_train = embed(train_texts, cmb_tok, cmb_model)
bmb_train = embed(train_texts, bmb_tok, bmb_model)

In [None]:
cmb_valid = embed(valid_texts, cmb_tok, cmb_model)
bmb_valid = embed(valid_texts, bmb_tok, bmb_model)

cmb_test = embed(test_texts, cmb_tok, cmb_model)
bmb_test = embed(test_texts, bmb_tok, bmb_model)

In [None]:
!pip install umap-learn -q

In [None]:
import numpy as np
import pandas as pd

# Number of samples for visualization
N = 5000

# Random indices
idx = np.random.choice(len(train_30), size=N, replace=False)

cmb_vis = cmb_train[idx]
bmb_vis = bmb_train[idx]
labels_vis = train_30["PATHOLOGY"].iloc[idx].values

In [None]:
import umap.umap_ as umap

reducer = umap.UMAP(n_neighbors=30, min_dist=0.1, metric="cosine")

cmb_2d = reducer.fit_transform(cmb_vis)
bmb_2d = reducer.fit_transform(bmb_vis)

In [None]:
from sklearn.metrics import top_k_accuracy_score

pred_probs_cmb = clf_cmb.predict_proba(cmb_valid)
pred_probs_bmb = clf_bmb.predict_proba(bmb_valid)

print("CMB Top-3:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=3))
print("BMB Top-3:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=3))

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
labels_encoded = le.fit_transform(labels_vis)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(14,6))

plt.subplot(1,2,1)
plt.scatter(cmb_2d[:,0], cmb_2d[:,1], c=labels_encoded, cmap="tab20", s=5)
plt.title("CMB Embedding Space (UMAP)")
plt.xticks([]); plt.yticks([])

plt.subplot(1,2,2)
plt.scatter(bmb_2d[:,0], bmb_2d[:,1], c=labels_encoded, cmap="tab20", s=5)
plt.title("BMB Embedding Space (UMAP)")
plt.xticks([]); plt.yticks([])

plt.show()

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(
    n_components=2,
    learning_rate="auto",
    perplexity=50,
    init="pca",
    random_state=42
)

cmb_tsne = tsne.fit_transform(cmb_vis)
bmb_tsne = tsne.fit_transform(bmb_vis)

In [None]:
plt.figure(figsize=(14,6))

plt.subplot(1,2,1)
plt.scatter(cmb_tsne[:,0], cmb_tsne[:,1], c=labels_encoded, cmap="tab20", s=5)
plt.title("CMB Embeddings (t-SNE)")
plt.xticks([]); plt.yticks([])

plt.subplot(1,2,2)
plt.scatter(bmb_tsne[:,0], bmb_tsne[:,1], c=labels_encoded, cmap="tab20", s=5)
plt.title("BMB Embeddings (t-SNE)")
plt.xticks([]); plt.yticks([])

plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score

y_train = train_30["PATHOLOGY"]
y_valid = valid_30["PATHOLOGY"]

clf_cmb = LogisticRegression(max_iter=200)
clf_cmb.fit(cmb_train, y_train)

pred_cmb = clf_cmb.predict(embed(valid_texts, cmb_tok, cmb_model))
print("CMB Macro F1:", f1_score(y_valid, pred_cmb, average='macro'))

In [None]:
clf_bmb = LogisticRegression(max_iter=200)
clf_bmb.fit(bmb_train, y_train)

pred_bmb = clf_bmb.predict(embed(valid_texts, bmb_tok, bmb_model))
print("BMB Macro F1:", f1_score(y_valid, pred_bmb, average='macro'))

In [None]:
from sklearn.neural_network import MLPClassifier

mlp_cmb = MLPClassifier(hidden_layer_sizes=(512,256), max_iter=20)
mlp_cmb.fit(cmb_train, y_train)
pred_cmb_mlp = mlp_cmb.predict_proba(cmb_valid)

print("CMB Top-3:", top_k_accuracy_score(y_valid, pred_cmb_mlp, k=3))
print("CMB Top-5:", top_k_accuracy_score(y_valid, pred_cmb_mlp, k=5))

In [None]:
mlp_bmb = MLPClassifier(hidden_layer_sizes=(256,), max_iter=5)
mlp_bmb.fit(bmb_train, y_train)
pred_bmb_mlp = mlp_bmb.predict_proba(bmb_valid)

print("BMB Top-3:", top_k_accuracy_score(y_valid, pred_bmb_mlp, k=3))
print("BMB Top-5:", top_k_accuracy_score(y_valid, pred_bmb_mlp, k=5))

In [None]:
print("CMB Top-1:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=1))
print("CMB Top-2:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=2))
print("CMB Top-3:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=3))
print("CMB Top-4:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=4))
print("CMB Top-5:", top_k_accuracy_score(y_valid, pred_probs_cmb, k=5))

print("\nBMB Top-1:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=1))
print("BMB Top-2:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=2))
print("BMB Top-3:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=3))
print("BMB Top-4:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=4))
print("BMB Top-5:", top_k_accuracy_score(y_valid, pred_probs_bmb, k=5))

In [None]:
from sklearn.metrics import classification_report
import pandas as pd

report_cmb = classification_report(
    y_valid, pred_cmb, output_dict=True, zero_division=0
)

report_bmb = classification_report(
    y_valid, pred_bmb, output_dict=True, zero_division=0
)

df_cmb = pd.DataFrame(report_cmb).T
df_bmb = pd.DataFrame(report_bmb).T

df_cmb.head()

In [None]:
df_cmb.info()

In [None]:
def predict_topk(text, tokenizer, model, clf, k=5):
    vec = embed([text], tokenizer, model)[0].reshape(1,-1)
    probs = clf.predict_proba(vec)[0]
    topk_idx = probs.argsort()[-k:][::-1]
    return [(clf.classes_[i], probs[i]) for i in topk_idx]

In [None]:
case1 = "fever, productive cough, chest pain, shortness of breath"
print("CMB:", predict_topk(case1, cmb_tok, cmb_model, clf_cmb))
print("BMB:", predict_topk(case1, bmb_tok, bmb_model, clf_bmb))

In [None]:
case2 = "severe abdominal pain in right lower quadrant, nausea, vomiting, mild fever"
print("CMB:", predict_topk(case2, cmb_tok, cmb_model, clf_cmb))
print("BMB:", predict_topk(case2, bmb_tok, bmb_model, clf_bmb))

In [None]:
import shutil
shutil.rmtree("/kaggle/working", ignore_errors=True)