# Task 3: Pre-trained transformers

### Aim
In this homework, the aim of the Argilla's dataset is the following:

- Predicting the medical field from which a medical note comes from.

To solve this classification problem, we will use three types of algorithm:

1) Machine learning algorithm: Support vector machine and Logistic regression
2) Transformer-encoder: Medical Bert model + classification head
3) Transformer-decoder: Qwen 2.5

### Accuracy:

Accuracy is tested, as instructed, using:
- Macro-f1 metric
- Human visual inspection

## 1) Machine learning

In [4]:
import numpy as np
import sklearn
import matplotlib
import transformers
import pandas as pd
import tqdm
import torch
import spacy
import nltk
import evaluate

### 1.1 Dataset import

We re-use code from task 1 to import our argilla dataset, where we will only keep the text and the labels.

In [5]:
import sys, site
print(sys.executable)
print("USER_SITE:", site.getusersitepackages())
print("sys.path[0:5]:", sys.path[:5])

/Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/bin/python
USER_SITE: /Users/choekyelnyungmartsang/.local/lib/python3.11/site-packages
sys.path[0:5]: ['/Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/lib/python311.zip', '/Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/lib/python3.11', '/Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/lib/python3.11/lib-dynload', '', '/Users/choekyelnyungmartsang/.local/lib/python3.11/site-packages']


In [6]:

pd.set_option('display.max_colwidth', 200)

df = pd.read_parquet("hf://datasets/argilla/medical-domain/data/train-00000-of-00001-67e4e7207342a623.parquet")

def extract_label(pred):
    if isinstance(pred, (list, np.ndarray)) and len(pred) > 0 and isinstance(pred[0], dict):
        return pred[0].get("label")
    return None

df['label'] = df['prediction'].apply(extract_label)
df['text_length'] = df['metrics'].apply(lambda x: x.get('text_length') if isinstance(x, dict) else None)

# drop empty columns
df = df.drop(columns=['inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'metadata', 'status', 'event_timestamp', 'metrics'], errors='ignore')

#print(df.head)
df.head()

Unnamed: 0,text,id,label,text_length
0,"PREOPERATIVE DIAGNOSIS:, Iron deficiency anemia.,POSTOPERATIVE DIAGNOSIS:, Diverticulosis.,PROCEDURE:, Colonoscopy.,MEDICATIONS: , MAC.,PROCEDURE: , The Olympus pediatric variable colonoscope w...",00001265-03e2-47b2-b6cf-bed32dad2fa9,Gastroenterology,1085
1,"CLINICAL INDICATION: ,Normal stress test.,PROCEDURES PERFORMED:,1. Left heart cath.,2. Selective coronary angiography.,3. LV gram.,4. Right femoral arteriogram.,5. Mynx closure device.,PROCE...",0007edf0-1413-4b16-8212-3a13c2ab4e43,Surgery,1798
2,"FINDINGS:,Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.,Preliminary scout film demonstrates anterior end plate spondylosis at T1...",00097d1e-1357-4447-a39a-fe8f8b7c36ae,Radiology,1141
3,"PREOPERATIVE DIAGNOSIS: , Blood loss anemia.,POSTOPERATIVE DIAGNOSES:,1. Diverticulosis coli.,2. Internal hemorrhoids.,3. Poor prep.,PROCEDURE PERFORMED:, Colonoscopy with photos.,ANESTHESIA: ...",001622b6-0182-4fee-9881-ae15e81ce836,Surgery,1767
4,"REASON FOR VISIT: ,Elevated PSA with nocturia and occasional daytime frequency.,HISTORY: , A 68-year-old male with a history of frequency and some outlet obstructive issues along with irritative ...",0029245f-8b45-4796-ba09-7760612289c6,SOAP / Chart / Progress Notes,1519


#### 1.2 Data set formatting

The aim of this section is to:
- Split the data set in training and testing set.
- Transforming a specific word "x" in the product of :
- _Its frequency in a specific document d (TF)_
- _Total number of documents/ number of times "x" appears per document_ (IDF)

In [7]:
###################################
#0. Split data set into train/test
#################################

# This code is inspired from : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
X=df["text"]
y=df["label"]

# We create two data sets by splitting:
# - 80% of data set in training set
# - 20% in test set
# - Applying stratify=y to make sure that categories are roughly equal with train and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y) # I split the text : 80% training, 20% test

############################
# 1. TF-IDF
############################

# Using sklearn TfidfVectorizer, we can directly pre-processed our text:
# - everything in lowercase
# - tokenize words
# - every feature of same length

# We finally return for each document d and word k: k frequency in document d * inverse frequency of k in all documents (TF-IDF).

## This code is adapted from https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents="unicode", # I want to strip all accents
                             lowercase=True,  # I want everything lowercase
                             stop_words="english", # We delete English-stop words
                             min_df=5,  # I want words to be at least in 5 documents
                             max_df=0.8, # very frequent words are not useful to distinguish between documents
                             ngram_range=(1,2) # include also pairs of words
                             )


X_train_TDIFD = vectorizer.fit_transform(X_train)
X_test_TDIFD=vectorizer.transform(X_test) # I transform X_test according to our TfidVectorizer

### 1.3 Linear SVM

In [8]:
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

SVM=LinearSVC(random_state=0, tol=1e-5,class_weight="balanced")
SVM.fit(X_train_TDIFD,y_train)

SVM.score(X_test_TDIFD,y_test) # Accuracy

f1_score_macro_SVM=f1_score(y_test, SVM.predict(X_test_TDIFD), average='macro') # Macro F_1 score -->"harmonic mean of the precision and recall" https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
print("F1 score macro SVM: ",f1_score_macro_SVM)

F1 score macro SVM:  0.21555928764956586


In [9]:
for i in range(3):
    print("Predicted class:",SVM.predict(X_test_TDIFD[i]))
    print("Text: ", X_test.iloc[i][0:500]) ### 500 hundred first characters

Predicted class: [' Cardiovascular / Pulmonary']
Text:  PREOPERATIVE DIAGNOSIS: , Coronary artery disease.,POSTOPERATIVE DIAGNOSIS: , Coronary artery disease plus intimal calcification in the mid abdominal aorta without significant stenosis.,DESCRIPTION OF PROCEDURE:,LEFT HEART CATHETERIZATION WITH ANGIOGRAPHY AND MID ABDOMINAL AORTOGRAPHY:,Under local anesthesia with 2% lidocaine with premedication, a right groin preparation was done.  Using the percutaneous Seldinger technique via the right femoral artery, a left heart catheterization was performed
Predicted class: [' Nephrology']
Text:  EXAM: , CT of abdomen with and without contrast.  CT-guided needle placement biopsy.,HISTORY: , Left renal mass.,TECHNIQUE: , Pre and postcontrast enhanced images were acquired through the kidneys.,FINDINGS: , Comparison made to the prior MRI.  There is re-demonstration of multiple bilateral cystic renal lesions.  Several of these demonstrate high attenuation in the precontrast phase of the exam sugg

### 1.4 Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

lr = LogisticRegression(
    random_state=0,
    tol=1e-5,
    class_weight="balanced",
    solver="saga",
    max_iter=3000,
    n_jobs=-1
)

param_grid = {"C": [0.1, 1.0, 10.0]} 

grid = GridSearchCV(
    estimator=lr,
    param_grid=param_grid,
    scoring="f1_macro",
    cv=3, 
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train_TDIFD, y_train)

LR = grid.best_estimator_
print("Best C:", grid.best_params_["C"])

y_pred = LR.predict(X_test_TDIFD)
print("F1 score macro LR (tuned):", f1_score(y_test, y_pred, average="macro"))
print("Accuracy (tuned):", LR.score(X_test_TDIFD, y_test))

In [17]:
for i in range(3):
    print("Predicted class:",LR.predict(X_test_TDIFD[i]))
    print("Text: ", X_test.iloc[i][0:500]) ### 500 hundred first characters

Predicted class: [' Cardiovascular / Pulmonary']
Text:  PREOPERATIVE DIAGNOSIS: , Coronary artery disease.,POSTOPERATIVE DIAGNOSIS: , Coronary artery disease plus intimal calcification in the mid abdominal aorta without significant stenosis.,DESCRIPTION OF PROCEDURE:,LEFT HEART CATHETERIZATION WITH ANGIOGRAPHY AND MID ABDOMINAL AORTOGRAPHY:,Under local anesthesia with 2% lidocaine with premedication, a right groin preparation was done.  Using the percutaneous Seldinger technique via the right femoral artery, a left heart catheterization was performed
Predicted class: [' Nephrology']
Text:  EXAM: , CT of abdomen with and without contrast.  CT-guided needle placement biopsy.,HISTORY: , Left renal mass.,TECHNIQUE: , Pre and postcontrast enhanced images were acquired through the kidneys.,FINDINGS: , Comparison made to the prior MRI.  There is re-demonstration of multiple bilateral cystic renal lesions.  Several of these demonstrate high attenuation in the precontrast phase of the exam sugg

##### _N.B: XGBoost_

Considering the high dimensionality of our data , XGboost takes too much time to run and SVM or LR are already strong baseline ML algorithm to compare our transformers to.

#### Conclusion

We select the logistic regression as our machine learning classifier, with an F1 macro score of *0.398*

The "human eye" test is convincing, with the algorithm giving meaningful labels to medical notes.

## 2. Encoder task

#### Model specification

We decided to use MedBERT Model. This is an encoder transformer, pre-trained for NER.

We will add a classification head to it (using AutoModelForSequenceClassification) in order to use it for prediction. This head has  untrained weights.

This model was used because of its pre-training on medical data, notably on  "a collection of clinical notes released in N2C2 2018 and N2C2 2022 challenges", which are highly relevant for our task.

#### 2.1 Model import

In [18]:
import sys, torch
print("python:", sys.executable)
print("torch version:", torch.__version__)
print("torch file:", torch.__file__)

python: /Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/bin/python
torch version: 2.7.1
torch file: /Users/choekyelnyungmartsang/opt/anaconda3/envs/hello/lib/python3.11/site-packages/torch/__init__.py


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, set_seed

import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"  # disables HF progress bars (fixes progress updater crashes)

set_seed(42)

tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")

model = AutoModelForSequenceClassification.from_pretrained("Charangan/MedBERT",num_labels=40, resume_download=True)

# This code is adapted from https://huggingface.co/transformers/v4.2.2/training.html?

# I am freezing the encoder, but allowing to update weights of the classification head
for param in model.base_model.parameters():
    param.requires_grad = False

#### dataset formatting

In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)

dataset = dataset.select_columns(['text', 'label'])

# Labels are string, I need to change them as numbers.
labels=dataset.unique("label")

# I create a dictionnary that take label as key and return a value.


label2id = {
    "Gastroenterology": 0,
    "Surgery": 1,
    "Radiology": 2,
    "SOAP / Chart / Progress Notes": 3,
    "Letters": 4,
    "Lab Medicine - Pathology": 5,
    "Consult - History and Phy.": 6,
    "Podiatry": 7,
    "General Medicine": 8,
    "Psychiatry / Psychology": 9,
    "Cardiovascular / Pulmonary": 10,
    "Urology": 11,
    "Ophthalmology": 12,
    "Physical Medicine - Rehab": 13,
    "Neurology": 14,
    "Autopsy": 15,
    "Orthopedic": 16,
    "Hematology - Oncology": 17,
    "Allergy / Immunology": 18,
    "Pediatrics - Neonatal": 19,
    "Dentistry": 20,
    "Neurosurgery": 21,
    "Pain Management": 22,
    "Nephrology": 23,
    "Emergency Room Reports": 24,
    "Obstetrics / Gynecology": 25,
    "Speech - Language": 26,
    "Diets and Nutritions": 27,
    "Endocrinology": 28,
    "IME-QME-Work Comp etc.": 29,
    "Cosmetic / Plastic Surgery": 30,
    "Discharge Summary": 31,
    "ENT - Otolaryngology": 32,
    "Chiropractic": 33,
    "Office Notes": 34,
    "Dermatology": 35,
    "Sleep Medicine": 36,
    "Rheumatology": 37,
    "Hospice - Palliative Care": 38,
    "Bariatrics": 39,
}

# function for matching key to values
# Map will gives me one row of my dataset, into a dictionnary form.
# So i want to :
# 1) extract label value from dictionnary
# 2) replace it using my dictionnary with a numerical value
def matching(example):
    label=example["label"].strip() # labels have a whitespace as first character, that i strip
    example["label"]=label2id[label]
    return example

dataset=dataset.map(matching)


Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

In [None]:
from datasets import load_dataset
from datasets import ClassLabel
from datasets import DatasetDict


# We firstly rename our column  from label to labels to match trainer wrappers
dataset = dataset.rename_column("label", "labels")
#Â dataset = dataset.cast_column("labels", ClassLabel(num_classes=40))
id2label = {v: k for k, v in label2id.items()}
dataset = dataset.cast_column("labels", ClassLabel(names=[id2label[i] for i in range(40)]))

## train/test=80/20
splits = dataset.train_test_split(
    test_size=0.2,
    stratify_by_column="labels", # We stratify once again to have the same
    seed=42                      # rate of classes between to data frames
)

# split train into train/validation
train_val = splits["train"].train_test_split(
    test_size=0.125,   # 12.5% of 80% = 10% overall
    stratify_by_column="labels",
    seed=42
)

# single DatasetDict with 3 splits 70/10/20=train/val/test
final_df = DatasetDict({
    "train": train_val["train"],
    "validation": train_val["test"],
    "test": splits["test"]
})

### Now, we need to tokenize our data set. Adapted from: https://huggingface.co/docs/datasets/use_dataset

def tokenization(example):
    return tokenizer(example["text"], truncation=True, max_length=512) # i will truncate every exmaple that are longer than 512  token. This is
                                                                       # the max input size of our model

final_df_tokenized = final_df.map(tokenization, batched=True)


final_df_tokenized.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

Casting the dataset:   0%|          | 0/4966 [00:00<?, ? examples/s]

Map:   0%|          | 0/3475 [00:00<?, ? examples/s]

Map:   0%|          | 0/497 [00:00<?, ? examples/s]

Map:   0%|          | 0/994 [00:00<?, ? examples/s]

In [None]:
print("Train, validation and test df dimensions:")
print(final_df)

Train, validation and test df dimensions:
DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 3475
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 497
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 994
    })
})


#### Define testing metrics (accuracy, f1 macro)

In [None]:
# This code is adapted from :
# https://discuss.huggingface.co/t/how-to-get-the-predicted-labels-per-epoch-or-step-for-the-huggingface-transformers-trainer/12078/2?
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

#### training arguments

In [None]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?
from transformers import TrainingArguments
from transformers import set_seed

set_seed(42)

training_args = TrainingArguments(
    output_dir='.',                  # output directory
    num_train_epochs=3,              # total # of training epochs --> small, as we only train the head
    per_device_train_batch_size=8,   # batch size per device during training --> small, as i run that on limited GPU time (Google collab set-up)
    per_device_eval_batch_size=16,   # batch size for evaluation --> small, as i run that on limited GPU time (Google collab set-up)
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    push_to_hub=False,
    data_seed=42,
    seed=42,
)


#### training loop

In [None]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["validation"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1
)

trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Step,Training Loss
500,3.1339
1000,2.8504


TrainOutput(global_step=1305, training_loss=2.963610699533046, metrics={'train_runtime': 347.0344, 'train_samples_per_second': 30.04, 'train_steps_per_second': 3.76, 'total_flos': 2743868604211200.0, 'train_loss': 2.963610699533046, 'epoch': 3.0})

#### Evaluate accuracy

In [None]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 2.837169885635376,
 'eval_accuracy': 0.2625754527162978,
 'eval_f1': 0.01979835050585184,
 'eval_precision': 0.03741178912411789,
 'eval_recall': 0.03678243082143638,
 'eval_runtime': 29.0018,
 'eval_samples_per_second': 34.274,
 'eval_steps_per_second': 2.172,
 'epoch': 3.0}

### Human check

In [None]:
output=trainer.predict(final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
id2label = {v: k for k, v in label2id.items()}

In [None]:
import random
import numpy as np
import pandas as pd

def human_check_from_output(
    output,
    raw_test,
    id2label,
    n=20,
    seed=42,
    text_chars=500
):
    random.seed(seed)

    pred_ids = np.argmax(output.predictions, axis=1)

    idxs = random.sample(range(len(raw_test)), k=min(n, len(raw_test)))

    rows = []
    for i in idxs:
        gold_id = raw_test[i]["labels"]
        pred_id = int(pred_ids[i])

        rows.append({
            "idx": i,
            "correct": gold_id == pred_id,
            "gold_label": id2label[gold_id],
            "pred_label": id2label[pred_id],
            "text": raw_test[i]["text"][:text_chars].replace("\n", " ")
        })

    df = pd.DataFrame(rows).sort_values("correct")
    print(f"Human check | n={len(df)} | sample accuracy = {df['correct'].mean():.2%}")
    return df


In [None]:
hc_encoder_df = human_check_from_output(
    output=output,
    raw_test=final_df["test"],
    id2label=id2label,
    n=25,
    seed=42
)

hc_encoder_df


In [None]:
hc_encoder_df[hc_encoder_df["correct"] == False].head(20)

In [None]:
#Â immediately tells if the model collapses to a few classes
import numpy as np
from collections import Counter

preds = output.predictions.argmax(axis=1)
print(Counter(preds).most_common(10))

In [None]:
for i in range(10):
  label=output.predictions[i].argmax()
  if label==1:
      print("Predicted class: surgery")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==6:
      print("Predicted class: Consult - History and Phy.")
      print ("Text: ",final_df["test"][i]["text"][0:500])


Predicted class: surgery
Text:  PREOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebrae and T9 vertebrae.,POSTOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebra and T9 vertebra.,PROCEDURE PERFORMED:,1.  Fracture reduction with insertion of prosthetic device at T8 with kyphoplasty.,2.  Vertebroplasties at T7 and T9 with insertion of prosthetic device.,ANESTHESIA: , Local with sedation.,SPECIMEN: , Bone from the T8 vertebra.,COMPLICATIONS:,  None.,SURGICAL INDICATION
Predicted class: surgery
Text:  ADMISSION DIAGNOSES:,1.  Pneumonia, failed outpatient treatment.,2.  Hypoxia.,3.  Rheumatoid arthritis.,DISCHARGE DIAGNOSES:,1.  Atypical pneumonia, suspected viral.,2.  Hypoxia.,3.  Rheumatoid arthritis.,4.  Suspected mild stress-induced adrenal insufficiency.,HOSPITAL COURSE: , This very independent 79-year old had struggled with cough, fevers, weakness, and chills for the week prior to admission.  She was seen on multiple occasi

### Conclusion:

Without fine-tunning, our encoder transformer has a f1 macro score of 1.98%. As we can see in the "human check" part, it is predicting surgery for 80% of the texts, as "surgery" is the predominant label in argilla dataset

## Fine-tuning

We will fine-tune our model based on: https://huggingface.co/learn/llm-course/en/chapter3/3.

This fine-tuning comes down to allow weight updates from the BERT-medical encoder for our classification task. Therefore, in this setting, we allow the encoder and the head weights to be changed and tuned to our task.

### Model importation

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, set_seed

set_seed(42)

tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
model = AutoModelForSequenceClassification.from_pretrained(
    "Charangan/MedBERT",
    num_labels=40,
    id2label=id2label,
    label2id=label2id
)

# This code is adapted from https://huggingface.co/transformers/v4.2.2/training.html?

# I am defreezing entire encoder
for param in model.base_model.parameters():
    param.requires_grad = True # I allow the weights to be updated

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Charangan/MedBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training arguments

In [None]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import TrainingArguments



training_args = TrainingArguments(
    output_dir='.',          # output directory
    num_train_epochs=3,              # total # of training epochs --> small, as we only train the head
    per_device_train_batch_size=8,  # batch size per device during training --> small, as i run that on CPU only architecture
    per_device_eval_batch_size=16,   # batch size for evaluation --> small, as i run that on CPU only architecture
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    fp16=False, # Enable mixed precision. Set to False if you run on CPU only architecture
    data_seed=42,
    seed=42,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    logging_steps=50,
    report_to="none",   # avoids wandb prompt
    learning_rate=2e-5
)


#### Training loops

In [None]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["validation"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1
)

trainer.train()

Step,Training Loss
500,2.4433
1000,1.722


TrainOutput(global_step=1305, training_loss=1.9513233404049928, metrics={'train_runtime': 421.964, 'train_samples_per_second': 24.706, 'train_steps_per_second': 3.093, 'total_flos': 2743868604211200.0, 'train_loss': 1.9513233404049928, 'epoch': 3.0})

### Evaluate accuracy

In [None]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 1.7033979892730713,
 'eval_accuracy': 0.2857142857142857,
 'eval_f1': 0.15695630680273398,
 'eval_precision': 0.16528717975998833,
 'eval_recall': 0.15822482246652267,
 'eval_runtime': 9.2172,
 'eval_samples_per_second': 107.842,
 'eval_steps_per_second': 6.835,
 'epoch': 3.0}

#### Human check

In [43]:
output=trainer.predict(final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
id2label = {v: k for k, v in label2id.items()}

In [None]:
hc_encoder_fine_tuned_df = human_check_from_output(
    output=output,
    raw_test=final_df["test"],
    id2label=id2label,
    n=25,
    seed=42
)

hc_encoder_fine_tuned_df

In [None]:
hc_encoder_fine_tuned_df[hc_encoder_fine_tuned_df["correct"] == False].head(20)

In [None]:
for i in range(10):
  label=output.predictions[i].argmax()
  if label==2:
      print("Predicted class: Radiology")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==6:
      print("Predicted class: Consult - History and Phy.")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==8:
      print("Predicted class: General Medicine")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==10:
      print("Predicted class: Cardiovascular/ Pulmonary")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==12:
      print("Predicted class: Ophthalmology")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==16:
      print("Predicted class: Orthopedic")
      print ("Text: ",final_df["test"][i]["text"][0:500])
  if label==31:
      print("Predicted class: Discharge Summary")
      print ("Text: ",final_df["test"][i]["text"][0:500])

Predicted class: Orthopedic
Text:  PREOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebrae and T9 vertebrae.,POSTOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebra and T9 vertebra.,PROCEDURE PERFORMED:,1.  Fracture reduction with insertion of prosthetic device at T8 with kyphoplasty.,2.  Vertebroplasties at T7 and T9 with insertion of prosthetic device.,ANESTHESIA: , Local with sedation.,SPECIMEN: , Bone from the T8 vertebra.,COMPLICATIONS:,  None.,SURGICAL INDICATION
Predicted class: Discharge Summary
Text:  ADMISSION DIAGNOSES:,1.  Pneumonia, failed outpatient treatment.,2.  Hypoxia.,3.  Rheumatoid arthritis.,DISCHARGE DIAGNOSES:,1.  Atypical pneumonia, suspected viral.,2.  Hypoxia.,3.  Rheumatoid arthritis.,4.  Suspected mild stress-induced adrenal insufficiency.,HOSPITAL COURSE: , This very independent 79-year old had struggled with cough, fevers, weakness, and chills for the week prior to admission.  She was seen on mu

## Conclusion

Fine-tunning helped tremendously our model, taking its f1-macro score from 1.98% to 15.7%. Moreover, we can see in the "human check" section that know, the model is not always guessing the most common label.

# 3. Decoder track
Decoder-only large language models (LLMs) can be applied to classification tasks by reformulating classification as text generation. The model is prompted to generate the correct label given an input medical note.

We evaluate decoder models under these three settings:

Zero-shot prompting

Few-shot (5-shot) prompting

Parameter-Efficient Fine-Tuning (PEFT) using LoRA

All models are evaluated using macro-F1.

We use the decoder-only model **Qwen/Qwen2.5-0.5B-Instruct**, a compact instruction-tuned LLM. This model was selected as the smallest decoder model that could be reliably loaded and executed within our computational constraints, while still supporting instruction-following and text generation.

## 3.1 Model importation

In [156]:
from transformers import AutoTokenizer, AutoModelForCausalLM

set_seed(42)

model_id = "Qwen/Qwen2.5-0.5B-Instruct"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto"
)

def format_qwen_chat(user_content: str) -> str:
    messages = [
        {"role": "system", "content": "You are a medical coding assistant."},
        {"role": "user", "content": user_content},
    ]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [157]:
print(label2id)

{'Gastroenterology': 0, 'Surgery': 1, 'Radiology': 2, 'SOAP / Chart / Progress Notes': 3, 'Letters': 4, 'Lab Medicine - Pathology': 5, 'Consult - History and Phy.': 6, 'Podiatry': 7, 'General Medicine': 8, 'Psychiatry / Psychology': 9, 'Cardiovascular / Pulmonary': 10, 'Urology': 11, 'Ophthalmology': 12, 'Physical Medicine - Rehab': 13, 'Neurology': 14, 'Autopsy': 15, 'Orthopedic': 16, 'Hematology - Oncology': 17, 'Allergy / Immunology': 18, 'Pediatrics - Neonatal': 19, 'Dentistry': 20, 'Neurosurgery': 21, 'Pain Management': 22, 'Nephrology': 23, 'Emergency Room Reports': 24, 'Obstetrics / Gynecology': 25, 'Speech - Language': 26, 'Diets and Nutritions': 27, 'Endocrinology': 28, 'IME-QME-Work Comp etc.': 29, 'Cosmetic / Plastic Surgery': 30, 'Discharge Summary': 31, 'ENT - Otolaryngology': 32, 'Chiropractic': 33, 'Office Notes': 34, 'Dermatology': 35, 'Sleep Medicine': 36, 'Rheumatology': 37, 'Hospice - Palliative Care': 38, 'Bariatrics': 39}


In [158]:
example_text = final_df["test"][0]["text"]
print(example_text)

PREOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebrae and T9 vertebrae.,POSTOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebra and T9 vertebra.,PROCEDURE PERFORMED:,1.  Fracture reduction with insertion of prosthetic device at T8 with kyphoplasty.,2.  Vertebroplasties at T7 and T9 with insertion of prosthetic device.,ANESTHESIA: , Local with sedation.,SPECIMEN: , Bone from the T8 vertebra.,COMPLICATIONS:,  None.,SURGICAL INDICATIONS:,  The patient is an 80-year-old female who had previous history of compression fractures.  She had recently undergone an additional compression fracture of the T8 vertebrae.  She was in extreme pain.  This pain interfered with activities of daily living and was unimproved with conservative treatment modalities.  She is understanding the risks, benefits, and potential complications as well as all treatment alternatives.  The patient provided informed consent.,OPERATIVE TECHNIQUE: , The patient

## 3.2 Zero shot prompting

We first evaluate the decoder model in a zero-shot prompting setting. In this setup, the model receives only a natural language instruction and the input medical note, without any examples.

We define a fixed prompt template that:
- instructs the model to act as a medical coding assistant,
- provides the complete list of possible labels,
- requires the model to output exactly one label from the list.

In [190]:
import re
import torch

set_seed(42)

LABELS = list(label2id.keys())  # or however you store your 40 labels
#LABELS_STR = "\n".join(f"- {lab}" for lab in LABELS)
LABELS_STR = "\n".join(LABELS)

def make_zeroshot_prompt(text: str) -> str:
    user_content = (
        "Choose EXACTLY ONE label from the list below that best matches the medical note.\n\n"
        f"Labels:\n{LABELS_STR}\n\n"
        f"Medical note:\n{text}\n\n"
        "Answer with exactly one label from the list and nothing else.\n"
        "Label:"
    )
    return format_qwen_chat(user_content)

In [191]:
print(LABELS_STR)

Gastroenterology
Surgery
Radiology
SOAP / Chart / Progress Notes
Letters
Lab Medicine - Pathology
Consult - History and Phy.
Podiatry
General Medicine
Psychiatry / Psychology
Cardiovascular / Pulmonary
Urology
Ophthalmology
Physical Medicine - Rehab
Neurology
Autopsy
Orthopedic
Hematology - Oncology
Allergy / Immunology
Pediatrics - Neonatal
Dentistry
Neurosurgery
Pain Management
Nephrology
Emergency Room Reports
Obstetrics / Gynecology
Speech - Language
Diets and Nutritions
Endocrinology
IME-QME-Work Comp etc.
Cosmetic / Plastic Surgery
Discharge Summary
ENT - Otolaryngology
Chiropractic
Office Notes
Dermatology
Sleep Medicine
Rheumatology
Hospice - Palliative Care
Bariatrics


Next we define a function that performs zero-shot label prediction by generating a response from the decoder model and extracting the predicted label from the generated text.

In [237]:
def predict_label_zeroshot(text: str, max_new_tokens: int = 8) -> str:
    prompt = make_zeroshot_prompt(text)

    inputs = tok(prompt, return_tensors="pt", truncation=True).to(model.device)
    input_len = inputs["input_ids"].shape[-1]
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,      # deterministic
            pad_token_id=tok.eos_token_id,
            eos_token_id=tok.eos_token_id,
        )

    decoded = tok.decode(out[0], skip_special_tokens=True)

    # take only what comes after "Label:"
    #pred_text = decoded.split("Label:")[-1].strip()
    gen_ids = out[0][input_len:]
    raw = tok.decode(gen_ids, skip_special_tokens=True).strip()

    return raw

Because decoder models may generate additional text or formatting artifacts, we apply a normalization step that maps the generated output to one of the predefined labels. This includes removing punctuation and handling minor formatting variations. If the generated output does not match with any valid label, it is mapped to an `"UNKNOWN"` category.

In [241]:
def normalize_to_label(pred_text: str, labels=LABELS) -> str:
    if pred_text is None:
        return "UNKNOWN"

    pred_text = pred_text.strip()

    # remove common bullet formatting
    pred_text = re.sub(r"^[\-\*\u2022]\s*", "", pred_text)

    # keep only first line (prevents matching label list / explanations)
    first_line = pred_text.splitlines()[0].strip()

    # clean punctuation
    cleaned = re.sub(r"[^A-Za-z0-9_\-/ ]+", "", first_line).strip()

    # exact match (case-insensitive)
    for lab in labels:
        if cleaned.lower() == lab.lower():
            return lab

    # startswith match (handles "Surgery because ...")
    for lab in labels:
        if cleaned.lower().startswith(lab.lower()):
            return lab
    
    for lab in labels:
        if lab.lower().startswith(cleaned.lower()) and len(cleaned) >= 4:
            return lab

    return "UNKNOWN"

Example of raw and normalized decoder output

In [243]:
raw = predict_label_zeroshot(example_text)
pred_label = normalize_to_label(raw)
print("RAW:", raw)
print("PRED:", pred_label)

RAW: Orthopedic. piernt
PRED: Orthopedic


In [165]:
final_df["test"]

Dataset({
    features: ['text', 'labels'],
    num_rows: 994
})

Next we apply the zero-shot prompting procedure to the entire test set, generating one predicted label per medical note and collecting gold and predicted labels for evaluation.


In [66]:
y_true, y_pred = [], []

for sample in final_df["test"]:
    text = sample["text"]
    gold = sample["labels"]

    raw = predict_label_zeroshot(text[:3000])
    pred = normalize_to_label(raw)

    y_true.append(gold)
    y_pred.append(pred)

In [67]:
print(y_true)

[16, 10, 16, 34, 8, 17, 1, 10, 6, 0, 1, 1, 30, 0, 16, 1, 1, 2, 1, 16, 25, 6, 10, 11, 8, 34, 21, 34, 17, 9, 21, 10, 1, 21, 0, 19, 11, 25, 1, 11, 8, 6, 1, 17, 3, 1, 2, 1, 32, 14, 1, 2, 14, 32, 10, 34, 0, 6, 6, 10, 16, 1, 6, 24, 31, 10, 1, 7, 1, 6, 1, 6, 16, 13, 6, 22, 1, 3, 2, 16, 1, 21, 8, 14, 14, 1, 1, 1, 16, 6, 10, 0, 16, 6, 0, 22, 0, 1, 6, 10, 31, 1, 3, 10, 16, 10, 1, 2, 6, 10, 6, 3, 16, 22, 37, 1, 32, 16, 23, 2, 6, 1, 1, 17, 14, 1, 2, 3, 16, 2, 15, 28, 20, 31, 1, 16, 8, 3, 16, 25, 2, 14, 6, 32, 32, 32, 6, 1, 9, 16, 16, 7, 10, 22, 0, 1, 1, 3, 2, 6, 11, 1, 29, 1, 23, 10, 1, 22, 9, 0, 0, 10, 32, 22, 3, 21, 10, 23, 25, 11, 32, 1, 11, 21, 12, 8, 2, 31, 39, 1, 23, 2, 1, 1, 8, 24, 10, 10, 39, 16, 31, 6, 1, 1, 6, 6, 2, 1, 30, 24, 0, 16, 2, 12, 11, 17, 19, 1, 1, 12, 10, 16, 1, 16, 3, 1, 16, 6, 14, 1, 1, 9, 6, 1, 1, 12, 16, 1, 31, 27, 11, 25, 0, 14, 32, 1, 10, 1, 10, 1, 14, 10, 8, 1, 1, 2, 3, 10, 2, 0, 35, 1, 11, 6, 1, 31, 6, 20, 17, 1, 14, 2, 8, 1, 1, 1, 20, 10, 31, 14, 10, 8, 16, 31, 8, 21,

In [68]:
id2label = {v: k for k, v in label2id.items()}

for i in range(len(y_true)):
    y_true[i] = id2label[y_true[i]]

In [69]:
y_true

['Orthopedic',
 'Cardiovascular / Pulmonary',
 'Orthopedic',
 'Office Notes',
 'General Medicine',
 'Hematology - Oncology',
 'Surgery',
 'Cardiovascular / Pulmonary',
 'Consult - History and Phy.',
 'Gastroenterology',
 'Surgery',
 'Surgery',
 'Cosmetic / Plastic Surgery',
 'Gastroenterology',
 'Orthopedic',
 'Surgery',
 'Surgery',
 'Radiology',
 'Surgery',
 'Orthopedic',
 'Obstetrics / Gynecology',
 'Consult - History and Phy.',
 'Cardiovascular / Pulmonary',
 'Urology',
 'General Medicine',
 'Office Notes',
 'Neurosurgery',
 'Office Notes',
 'Hematology - Oncology',
 'Psychiatry / Psychology',
 'Neurosurgery',
 'Cardiovascular / Pulmonary',
 'Surgery',
 'Neurosurgery',
 'Gastroenterology',
 'Pediatrics - Neonatal',
 'Urology',
 'Obstetrics / Gynecology',
 'Surgery',
 'Urology',
 'General Medicine',
 'Consult - History and Phy.',
 'Surgery',
 'Hematology - Oncology',
 'SOAP / Chart / Progress Notes',
 'Surgery',
 'Radiology',
 'Surgery',
 'ENT - Otolaryngology',
 'Neurology',
 'Surge

In [70]:
y_pred

['Orthopedic',
 'Gastroenterology',
 'Neurology',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Orthopedic',
 'Radiology',
 'UNKNOWN',
 'Speech - Language',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Surgery',
 'Gastroenterology',
 'Radiology',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Gastroenterology',
 'Emergency Room Reports',
 'Orthopedic',
 'Gastroenterology',
 'Gastroenterology',
 'Orthopedic',
 'Gastroenterology',
 'UNKNOWN',
 'Psychiatry / Psychology',
 'Neurology',
 'Cardiovascular / Pulmonary',
 'Surgery',
 'Neurology',
 'Gastroenterology',
 'Pediatrics - Neonatal',
 'Orthopedic',
 'Gastroenterology',
 'UNKNOWN',
 'Orthopedic',
 'Speech - Language',
 'Gastroenterology',
 'Gastroenterology',
 'UNKNOWN',
 'Cardiovascular / Pulmonary',
 'Gastroenterology',
 'Gastroenterology',
 'Orthopedic',
 'Gastroenterology',
 'Neurology',
 'Orthopedic',
 'Radiology',
 'Pediatrics - Neon

As a quick sanity check, we compute the exact-match accuracy over the test set.

In [71]:
number_correct = 0

for i, j in zip(y_true, y_pred):
    if i == j:
        number_correct += 1

percent_correct = number_correct / len(y_true) * 100

print(percent_correct)

14.084507042253522


In [72]:
final_df["train"]

Dataset({
    features: ['text', 'labels'],
    num_rows: 3475
})

We evaluate zero-shot predictions using the same metrics as in previous sections.

In [73]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = y_true
    preds = pred
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [74]:
compute_metrics(y_pred)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


{'accuracy': 0.14084507042253522,
 'f1': 0.07979367958792635,
 'precision': 0.13561862425731094,
 'recall': 0.10806442153805491}

In [75]:
import random
import pandas as pd

def human_check_qwen(
    final_test,           
    y_true_labels,          
    y_pred_labels,          
    predict_raw_fn=None,     
    normalize_fn=None,      
    n=25,
    seed=42,
    text_chars=500,
    show_only_errors=False
):
    random.seed(seed)
    N = len(final_test)
    idxs = random.sample(range(N), k=min(n, N))

    rows = []
    for i in idxs:
        text = final_test[i]["text"]
        gold = y_true_labels[i]
        pred = y_pred_labels[i]
        correct = (gold == pred)

        raw = None
        pred_norm = None
        if predict_raw_fn is not None and normalize_fn is not None:
            raw = predict_raw_fn(text[:3000])
            pred_norm = normalize_fn(raw)

        rows.append({
            "idx": i,
            "correct": correct,
            "gold": gold,
            "pred": pred,
            "raw": raw,
            "pred_norm": pred_norm,
            "text": text[:text_chars].replace("\n", " ")
        })

    df = pd.DataFrame(rows).sort_values(["correct", "idx"], ascending=[True, True])

    if show_only_errors:
        df = df[df["correct"] == False]

    sample_acc = (df["correct"].mean() if len(df) else 0.0)
    print(f"Human check | n={len(df)} | sample accuracy = {sample_acc:.2%}")
    return df


In [76]:
hc_qwen_raw = human_check_qwen(
    final_test=final_df["test"],
    y_true_labels=y_true,
    y_pred_labels=y_pred,
    predict_raw_fn=predict_label_zeroshot,
    normalize_fn=normalize_to_label,
    n=15,
    seed=42
)
hc_qwen_raw


Human check | n=15 | sample accuracy = 20.00%


Unnamed: 0,idx,correct,gold,pred,raw,pred_norm,text
2,25,False,Office Notes,Gastroenterology,Gastroenterology,Gastroenterology,"DIAGNOSES:,1. Juvenile myoclonic epilepsy.,2. Recent generalized tonic-clonic seizure.,MEDICATIONS:,1. Lamictal 250 mg b.i.d.,2. Depo-Provera.,INTERIM HISTORY: , The patient returns for follow..."
14,89,False,Consult - History and Phy.,Gastroenterology,Gastroenterology,Gastroenterology,"REASON FOR CONSULTATION: , Azotemia.,HISTORY OF PRESENT ILLNESS: ,The patient is a 36-year-old gentleman admitted to the hospital because he passed out at home.,Over the past week, he has been no..."
1,114,False,Rheumatology,Gastroenterology,Gastroenterology,Gastroenterology,"HISTORY: , A is a young lady, who came here with a diagnosis of seizure disorder and history of Henoch-Schonlein purpura with persistent proteinuria. A was worked up for collagen vascular disease..."
7,142,False,Consult - History and Phy.,Cardiovascular / Pulmonary,Cardiovascular / Pulmonary,Cardiovascular / Pulmonary,"REASON FOR CONSULT:, Evaluation of alcohol withdrawal and dependance as well as evaluation of anxiety.,HISTORY OF PRESENT ILLNESS: , This is a 50-year-old male who was transferred from Sugar Land..."
5,250,False,Neurology,Gastroenterology,Gastroenterology,Gastroenterology,"REASON FOR VISIT: ,This is an 83-year-old woman referred for diagnostic lumbar puncture for possible malignancy by Dr. X. She is accompanied by her daughter.,HISTORY OF PRESENT ILLNESS:, The pat..."
4,281,False,General Medicine,Gastroenterology,Gastroenterology,Gastroenterology,"CHIEF COMPLAINT: ,The patient does not have any chief complaint.,HISTORY OF PRESENT ILLNESS:, This is a 93-year-old female who called up her next-door neighbor to say that she was not feeling we..."
13,558,False,Cardiovascular / Pulmonary,Chiropractic,Chiropractic,Chiropractic,"PREOPERATIVE DIAGNOSIS: , Right hemothorax.,POSTOPERATIVE DIAGNOSIS: , Right hemothorax.,PROCEDURE PERFORMED: , Insertion of a #32 French chest tube on the right hemithorax.,ANESTHESIA: , 1% Lidoc..."
10,692,False,Neurosurgery,Gastroenterology,Gastroenterology,Gastroenterology,"PREOPERATIVE DIAGNOSIS:, Rule out temporal arteritis.,POSTOPERATIVE DIAGNOSIS: ,Rule out temporal arteritis.,PROCEDURE:, Bilateral temporal artery biopsy.,ANESTHESIA:, Local anesthesia 1% Xylo..."
8,754,False,Surgery,Gastroenterology,Gastroenterology,Gastroenterology,"PREOPERATIVE DIAGNOSIS: , Secondary capsular membrane, right eye.,POSTOPERATIVE DIAGNOSIS: , Secondary capsular membrane, right eye.,PROCEDURE PERFORMED: , YAG laser capsulotomy, right eye.,INDICA..."
11,758,False,SOAP / Chart / Progress Notes,UNKNOWN,Poison ivy,UNKNOWN,"SUBJECTIVE:, He is a 24-year-old male who said that he had gotten into some poison ivy this weekend while he was fishing. He has had several cases of this in the past and he says that is usually..."


In [77]:
hc_qwen_raw[hc_qwen_raw["correct"] == False].head(10)

Unnamed: 0,idx,correct,gold,pred,raw,pred_norm,text
2,25,False,Office Notes,Gastroenterology,Gastroenterology,Gastroenterology,"DIAGNOSES:,1. Juvenile myoclonic epilepsy.,2. Recent generalized tonic-clonic seizure.,MEDICATIONS:,1. Lamictal 250 mg b.i.d.,2. Depo-Provera.,INTERIM HISTORY: , The patient returns for follow..."
14,89,False,Consult - History and Phy.,Gastroenterology,Gastroenterology,Gastroenterology,"REASON FOR CONSULTATION: , Azotemia.,HISTORY OF PRESENT ILLNESS: ,The patient is a 36-year-old gentleman admitted to the hospital because he passed out at home.,Over the past week, he has been no..."
1,114,False,Rheumatology,Gastroenterology,Gastroenterology,Gastroenterology,"HISTORY: , A is a young lady, who came here with a diagnosis of seizure disorder and history of Henoch-Schonlein purpura with persistent proteinuria. A was worked up for collagen vascular disease..."
7,142,False,Consult - History and Phy.,Cardiovascular / Pulmonary,Cardiovascular / Pulmonary,Cardiovascular / Pulmonary,"REASON FOR CONSULT:, Evaluation of alcohol withdrawal and dependance as well as evaluation of anxiety.,HISTORY OF PRESENT ILLNESS: , This is a 50-year-old male who was transferred from Sugar Land..."
5,250,False,Neurology,Gastroenterology,Gastroenterology,Gastroenterology,"REASON FOR VISIT: ,This is an 83-year-old woman referred for diagnostic lumbar puncture for possible malignancy by Dr. X. She is accompanied by her daughter.,HISTORY OF PRESENT ILLNESS:, The pat..."
4,281,False,General Medicine,Gastroenterology,Gastroenterology,Gastroenterology,"CHIEF COMPLAINT: ,The patient does not have any chief complaint.,HISTORY OF PRESENT ILLNESS:, This is a 93-year-old female who called up her next-door neighbor to say that she was not feeling we..."
13,558,False,Cardiovascular / Pulmonary,Chiropractic,Chiropractic,Chiropractic,"PREOPERATIVE DIAGNOSIS: , Right hemothorax.,POSTOPERATIVE DIAGNOSIS: , Right hemothorax.,PROCEDURE PERFORMED: , Insertion of a #32 French chest tube on the right hemithorax.,ANESTHESIA: , 1% Lidoc..."
10,692,False,Neurosurgery,Gastroenterology,Gastroenterology,Gastroenterology,"PREOPERATIVE DIAGNOSIS:, Rule out temporal arteritis.,POSTOPERATIVE DIAGNOSIS: ,Rule out temporal arteritis.,PROCEDURE:, Bilateral temporal artery biopsy.,ANESTHESIA:, Local anesthesia 1% Xylo..."
8,754,False,Surgery,Gastroenterology,Gastroenterology,Gastroenterology,"PREOPERATIVE DIAGNOSIS: , Secondary capsular membrane, right eye.,POSTOPERATIVE DIAGNOSIS: , Secondary capsular membrane, right eye.,PROCEDURE PERFORMED: , YAG laser capsulotomy, right eye.,INDICA..."
11,758,False,SOAP / Chart / Progress Notes,UNKNOWN,Poison ivy,UNKNOWN,"SUBJECTIVE:, He is a 24-year-old male who said that he had gotten into some poison ivy this weekend while he was fishing. He has had several cases of this in the past and he says that is usually..."


### Conclusion
The zero-shot decoder model achieves a macro-F1 of approximately 0.04, indicating very limited performance on this multi-class classification task.

## 3.3 Five-shots prompting

We next evaluate the decoder model using five-shot prompting. To construct the prompt, we select five labeled examples from the training set, which are included as demonstrations of the task.


In [None]:
examples_five_shot_raw = final_df["train"].shuffle(seed=42).select(range(5))

In [None]:
#label2id
id2label

{0: 'Gastroenterology',
 1: 'Surgery',
 2: 'Radiology',
 3: 'SOAP / Chart / Progress Notes',
 4: 'Letters',
 5: 'Lab Medicine - Pathology',
 6: 'Consult - History and Phy.',
 7: 'Podiatry',
 8: 'General Medicine',
 9: 'Psychiatry / Psychology',
 10: 'Cardiovascular / Pulmonary',
 11: 'Urology',
 12: 'Ophthalmology',
 13: 'Physical Medicine - Rehab',
 14: 'Neurology',
 15: 'Autopsy',
 16: 'Orthopedic',
 17: 'Hematology - Oncology',
 18: 'Allergy / Immunology',
 19: 'Pediatrics - Neonatal',
 20: 'Dentistry',
 21: 'Neurosurgery',
 22: 'Pain Management',
 23: 'Nephrology',
 24: 'Emergency Room Reports',
 25: 'Obstetrics / Gynecology',
 26: 'Speech - Language',
 27: 'Diets and Nutritions',
 28: 'Endocrinology',
 29: 'IME-QME-Work Comp etc.',
 30: 'Cosmetic / Plastic Surgery',
 31: 'Discharge Summary',
 32: 'ENT - Otolaryngology',
 33: 'Chiropractic',
 34: 'Office Notes',
 35: 'Dermatology',
 36: 'Sleep Medicine',
 37: 'Rheumatology',
 38: 'Hospice - Palliative Care',
 39: 'Bariatrics'}

In [None]:
examples_five_shot = []

for i in range(5):
    example = {"text": examples_five_shot_raw["text"][i], "label": id2label[examples_five_shot_raw["labels"][i]]}
    examples_five_shot.append(example)

In [None]:
examples_five_shot

[{'text': "REASON FOR VISIT: , Followup left-sided rotator cuff tear and cervical spinal stenosis.,HISTORY OF PRESENT ILLNESS: , Ms. ABC returns today for followup regarding her left shoulder pain and left upper extremity C6 radiculopathy.  I had last seen her on 06/21/07.,At that time, she had been referred to me Dr. X and Dr. Y for evaluation of her left-sided C6 radiculopathy.  She also had a significant rotator cuff tear and is currently being evaluated for left-sided rotator cuff repair surgery, I believe on, approximately 07/20/07.  At our last visit, I only had a report of her prior cervical spine MRI.  I did not have any recent images.  I referred her for cervical spine MRI and she returns today.,She states that her symptoms are unchanged.  She continues to have significant left-sided shoulder pain for which she is being evaluated and is scheduled for surgery with Dr. Y.,She also has a second component of pain, which radiates down the left arm in a C6 distribution to the level 

Now we evaluate the decoder model in a five-shot prompting setting. In this setup, the model receives a natural language instruction, the input medical note and five examples of text with labels.

We define a fixed prompt template that:
- instructs the model to act as a medical coding assistant,
- provides the complete list of possible labels,
- includes five labeled example mdeical notes,
- requires the model to output exactly one label from the list.

In [None]:
def make_fiveshot_prompt(text: str) -> str:
    shots = ""
    for ex in examples_five_shot:
        shots += (
            "Medical note:\n"
            f"{ex['text']}\n"
            f"Label: {ex['label']}\n\n"
        )

    user_content = (
        "Choose EXACTLY ONE label from the list below that best matches the medical note.\n\n"
        f"Labels:\n{LABELS_STR}\n\n"
        "Examples:\n"
        f"{shots}\n"
        "Now classify this medical note:\n"
        f"Medical note:\n{text}\n\n"
        "Answer with exactly one label from the list and nothing else.\n"
        "Label:"
    )
    return format_qwen_chat(user_content)

Next we define a function that performs five-shot label prediction by generating a response from the decoder model and extracting the predicted label from the generated text.

In [None]:
def predict_label_fiveshot(text: str, max_new_tokens: int = 12) -> str:
    prompt = make_fiveshot_prompt(text)

    inputs = tok(prompt, return_tensors="pt", truncation=True).to(model.device)

    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,      # deterministic
            pad_token_id=tok.eos_token_id,
            eos_token_id=tok.eos_token_id,
        )

    decoded = tok.decode(out[0], skip_special_tokens=True)

    # take only what comes after "Label:"
    pred_text = decoded.split("Label:")[-1].strip()

    return pred_text

In [None]:
example_text

'PREOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebrae and T9 vertebrae.,POSTOPERATIVE DIAGNOSES:,1.  Pathologic insufficiency.,2.  Fracture of the T8 vertebra and T9 vertebra.,PROCEDURE PERFORMED:,1.  Fracture reduction with insertion of prosthetic device at T8 with kyphoplasty.,2.  Vertebroplasties at T7 and T9 with insertion of prosthetic device.,ANESTHESIA: , Local with sedation.,SPECIMEN: , Bone from the T8 vertebra.,COMPLICATIONS:,  None.,SURGICAL INDICATIONS:,  The patient is an 80-year-old female who had previous history of compression fractures.  She had recently undergone an additional compression fracture of the T8 vertebrae.  She was in extreme pain.  This pain interfered with activities of daily living and was unimproved with conservative treatment modalities.  She is understanding the risks, benefits, and potential complications as well as all treatment alternatives.  The patient provided informed consent.,OPERATIVE TECHNIQUE: , The patien

Example of raw and normalized decoder output

In [None]:
raw = predict_label_fiveshot(example_text)
pred_label = normalize_to_label(raw)
print("RAW:", raw)
print("PRED:", pred_label)

RAW: Surgery
Human: What is the most likely cause of
PRED: Surgery


Next we apply the five-shot prompting procedure to the entire test set, generating one predicted label per medical note and collecting gold and predicted labels for evaluation.

In [None]:
y_true_fiveshot, y_pred_fiveshot = [], []

for sample in final_df["test"]:
    text = sample["text"]
    gold = sample["labels"]

    raw = predict_label_fiveshot(text[:3000])
    pred = normalize_to_label(raw)

    y_true_fiveshot.append(gold)
    y_pred_fiveshot.append(pred)

We evaluate five-shot predictions using the same evaluation metrics as before.

In [None]:
compute_metrics(y_pred_fiveshot)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'accuracy': 0.23943661971830985,
 'f1': 0.06569638746020036,
 'precision': 0.08851822316259522,
 'recall': 0.07912226480641664}

In [None]:
hc_qwen_five_raw = human_check_qwen(
    final_test=final_df["test"],
    y_true_labels=y_true_fiveshot,
    y_pred_labels=y_pred_fiveshot,
    predict_raw_fn=predict_label_fiveshot,
    normalize_fn=normalize_to_label,
    n=15,
    seed=42
)
hc_qwen_five_raw


In [None]:

hc_qwen_five_raw[hc_qwen_five_raw["correct"] == False].head(10)

### Conclusion
Five-shot prompting improves performance compared to the zero-shot setting, achieving a macro-F1 of approximately 0.06.


## 3.4 Fine-tunning: PEFT SFT using Lora

Finally, we evaluate a supervised decoder-based approach using LoRA. LoRA freezes the base decoder model and trains a small number of additional parameters injected into the attention layers.

We define a fixed prompt template that:
- instructs the model to act as a medical coding assistant,
- provides the complete list of possible labels,
- requires the model to output exactly one label from the list.

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

set_seed(42)

def make_prompt(text: str) -> str:
    text = text[:1000] # had to do this for training
    return f"""You are a medical coding assistant. Choose EXACTLY ONE label from the list below that best matches the medical note. Labels: {LABELS_STR}, Medical note: {text}. Answer with exactly one label from the list and nothing else. Label:"""

In [None]:
final_df["train"]

Dataset({
    features: ['text', 'labels'],
    num_rows: 3475
})

We initialize the base decoder model and tokenizer for supervised fine-tuning. Padding tokens are configured to ensure correct batching during sequence-to-sequence training.

In [None]:
from transformers import DataCollatorForSeq2Seq, AutoTokenizer, AutoModelForCausalLM

token = AutoTokenizer.from_pretrained(model_id, use_fast=True)

if token.pad_token is None:
    token.pad_token = token.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None)

`torch_dtype` is deprecated! Use `dtype` instead!


We configure LoRA to inject trainable adapters into layers of the decoder model.

In [None]:
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

We freeze the pretrained base model and inject LoRA adapter parameters.

In [None]:
model = get_peft_model(base_model, lora_config) # freeze pretrained base model and inject lora parameters
model.print_trainable_parameters() # see how many trainable parameters we have

trainable params: 1,081,344 || all params: 495,114,112 || trainable%: 0.2184


To train the decoder model in a supervised manner, we tokenize the promptâ€“response pairs and construct labels such that the loss is computed only on the generated label tokens, while the prompt tokens are ignored.

In [None]:
def tokenize_prompt_response(example):
    prompt_ids = token(example["prompt"], add_special_tokens=False).input_ids # tokenize prompt
    resp_ids = token(" " + example["response"], add_special_tokens=False).input_ids # tokenize response

    input_ids = prompt_ids + resp_ids
    attention_mask = [1] * len(input_ids) # attend to whole input ([1])

    labels = [-100] * len(prompt_ids) + resp_ids # compute loss for labels, only tokens after medical text and instruction (label) is used for learning, no loss for predicting inside medical text

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

We convert each labeled training example into a promptâ€“response pair.

In [None]:
def prompt_plus_response(example):
    return {
        "prompt": make_prompt(example["text"]),
        "response": id2label[int(example["labels"])],  # id to label
    }

We apply the promptâ€“response transformation to the training, validation, and test splits to prepare data for supervised decoder fine-tuning and evaluation.

In [None]:
train_pairs = final_df["train"].map(prompt_plus_response)
val_pairs = final_df["validation"].map(prompt_plus_response)
test_pairs  = final_df["test"].map(prompt_plus_response)

Map:   0%|          | 0/3475 [00:00<?, ? examples/s]

Map:   0%|          | 0/497 [00:00<?, ? examples/s]

Map:   0%|          | 0/994 [00:00<?, ? examples/s]

In [None]:
test_pairs

Dataset({
    features: ['text', 'labels', 'prompt', 'response'],
    num_rows: 994
})

We tokenize the promptâ€“response pairs and construct the final input representations required for supervised decoder fine-tuning.

In [None]:
train_pairs_tokenized = []
val_pairs_tokenized = []
test_pairs_tokenized = []

for example in train_pairs:
    tokenized_example = tokenize_prompt_response(example)
    train_pairs_tokenized.append(tokenized_example)

for example in val_pairs:
    tokenized_example = tokenize_prompt_response(example)
    val_pairs_tokenized.append(tokenized_example)

for example in test_pairs:
    tokenized_example = tokenize_prompt_response(example)
    test_pairs_tokenized.append(tokenized_example)

In [None]:
ex = train_pairs_tokenized[0]
print("len(input_ids):", len(ex["input_ids"])) # length of input in tokens
print("len(attention_mask):", len(ex["attention_mask"])) # which tokens are "real" (no padding)
print("len(labels):", len(ex["labels"]))
# during training, the model predicts the next token everywhere, but loss and gradient update occur only at positions whose labels are not -100 (label tokens), so the model learns to generate the correct label given the prompt

len(input_ids): 518
len(attention_mask): 518
len(labels): 518


In [None]:
ex
print(ex["labels"][-10:]) # token ids of label are non -100, during training the label is supervised to predict these label tokens via next token prediction, with the prompt tokens ignored in the loss

[-100, -100, -100, -100, 63232, 608, 21266, 608, 16033, 18068]


In [None]:
# decode back into words to check
resp_token_ids = [y for y in ex["labels"] if y != -100]
print("num response tokens:", len(resp_token_ids))
print("decoded response:", tok.decode(resp_token_ids))

num response tokens: 6
decoded response:  SOAP / Chart / Progress Notes


In [None]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=token,
    padding=True,
    label_pad_token_id=-100,
    return_tensors="pt",
)

We specify training hyperparameters for LoRA fine-tuning of the decoder model, including learning rate, batch size, and number of training epochs.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./qwen_lora_sft",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,   # keeps memory low; effective batch = 8
    learning_rate=2e-4,
    warmup_steps=100,
    weight_decay=0.01,
    max_grad_norm=1.0,
    seed=42,
    data_seed=42,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    save_total_limit=2,
    logging_steps=50,
    report_to="none",
    remove_unused_columns=False,
    fp16=False,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_pairs_tokenized,
    eval_dataset=val_pairs_tokenized,
    data_collator=data_collator,
    tokenizer=token,
)

  trainer = Trainer(
The model is already on multiple devices. Skipping the move to device specified in `args`.


In [None]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


Step,Training Loss
500,1.2051
1000,1.0693
1500,0.8475
2000,0.8953
2500,0.8345
3000,0.8445
3500,0.7784
4000,0.7574
4500,0.7373
5000,0.6849


TrainOutput(global_step=10425, training_loss=0.7423966310350157, metrics={'train_runtime': 2094.2738, 'train_samples_per_second': 4.978, 'train_steps_per_second': 4.978, 'total_flos': 1.1479919614808832e+16, 'train_loss': 0.7423966310350157, 'epoch': 3.0})

We evaluate the LoRA-fine-tuned decoder model on the test set by generating predicted labels for each medical note and collecting gold and predicted labels for evaluation.


In [None]:
y_true_prompt, y_pred_prompt = [], []

model.eval()

for sample in final_df["test"]:
    prompt = make_prompt(sample["text"])

    inputs = token(prompt, return_tensors="pt", truncation=True).to(model.device)
    in_len = inputs["input_ids"].shape[-1]

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=10,
            do_sample=False,
            pad_token_id=token.eos_token_id,
            eos_token_id=token.eos_token_id,
        )

    new_tokens = output_ids[0, in_len:]
    generated_text = token.decode(new_tokens, skip_special_tokens=True).strip()

    pred_label = normalize_to_label(generated_text)
    y_true_prompt.append(sample["labels"])
    y_pred_prompt.append(pred_label)

We evaluate LoRA predictions using the same evaluation metrics as before.

In [None]:
compute_metrics(y_pred_prompt)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'accuracy': 0.04527162977867203,
 'f1': 0.002165543792107796,
 'precision': 0.0011317907444668008,
 'recall': 0.025}

In [None]:
hc_qwen_lora_raw = human_check_qwen(
    final_test=final_df["test"],
    y_true_labels=y_true_fiveshot,
    y_pred_labels=y_pred_fiveshot,
    predict_raw_fn=predict_label_fiveshot,
    normalize_fn=normalize_to_label,
    n=15,
    seed=42
)
hc_qwen_lora_raw

In [None]:

hc_qwen_lora_raw[hc_qwen_lora_raw["correct"] == False].head(10)

### Conclusion

With PEFT SFT fine-tunning, we were not able to improve the f1 macro score, with  a result of 0.2%.

# 8. Final Results and Model Selection

The table below summarizes the performance of the baseline (TF-IDF and log reg), encoder-based models and decoder-based models using macro-averaged F1 as the primary evaluation metric.

| Model | Type | Training | Macro-F1 |
|------|------|----------|----------|
| TF-IDF + Logistic Regression | Classical | Supervised | **0.39** |
| Frozen MedBERT+ classifier head|Encoder | Supervised| 0.0198|
|MedBERT+ classifier head| Encoder| Supervised | 0.157|
| Qwen-0.5B | Decoder | Zero-shot | 0.04 |
| Qwen-0.5B | Decoder | Five-shot | 0.06 |
| Qwen-0.5B + LoRA | Decoder | PEFT SFT | 0.00 |

### Baseline

Baseline, using logistic regression on TF-IDF , gave us the best Macro-F1 value.
This method has several strenghts that explains such a result:
- Need a smaller sample size to be trained (less parameters to be optimized)
- Do not have a maximum token size. We do not need to truncate our medical reports.
- With the weight=balance option, we adapt the weights according to the label frequency, penalizing more errors on less frequent labels.

### Encoder

Structurally, our Med-Bert encoder has three weaknesses, that were not accounted for in this homework:

- We did not take into account that classes are unbalanced in the loss function(i.e. weights are unweighted).
- Medical records are truncated once we reach the max_token size of 512. We loose a lot of informaiton.
- These models would be better with more samples to be trained and more GPU powers for their training hyperparameters (higher batch size, higher number of epochs, smaller learning rates). Neither were available to us.

#### Medbert with frozen weights

The frozen encoder model achieves a very low macro F1. This is expected because the encoder representations are not adapted to our specific 40 class document classification task. With the encoder frozen, only a small classification head is trained, which limits the model to separating classes using fixed pretrained features.

In addition, our medical notes are long and we truncate to 512 tokens, so important information may be removed. Finally, the label distribution is imbalanced, and without explicit class weighting in the training loss, the head can overpredict frequent classes. Together, these factors explain why a frozen encoder performs poorly on macro averaged F1.

### Fine-tuned Medbert:

Allowing the decoder part to be fine-tuned to our specific task and data increased the accuracy by almost a factor of 8 !

However, the macro-f1 score is still lower than our baseline model. This can be explained by the general limitation of our decoder and our setting explained above.

### Decoder

None of the evaluated decoder-based approaches outperform the TF-IDF + Logistic Regression baseline under the primary metric (macro-F1). While five-shot prompting improves performance compared to zero-shot prompting, decoder models performed much worse than the baseline for this classification task.

As a result, no decoder model is selected as beating the baseline under the defined primary metric (macro-F1). These findings highlight the strength of classical methods for multi-class text classification and the challenges of applying decoder-only LLMs to such tasks with limited labeled data.

Limitations and possible improvements:
Performance of the decoder model could likely be improved with additional experimentation, for example:
- using a decoder model better suited to the medical domain (domain-adapted pretraining)
- training on longer input contexts with less aggressive truncation (medical notes were truncated to approximately 3000 characters for zero- and five-shot prompting and to 1000 characters for LoRA fine-tuning due to computational constraints)
- more careful tuning of LoRA/QLoRA hyperparameters such as learning rate, number of training epochs, etc.