# Task 3: Pre-trained transformers

## Aim
In this task, the aim is to train different algorithm to be able to classify correctly our medical transcritped notes. Classifcations are labels directly extracted from argilla dataset, as shown in task 1 (e.g. surgery, orthopedics, ...)

In [None]:
import numpy as np
import sklearn
import matplotlib
import transformers
import pandas as pd
import tqdm
import torch
import spacy
import nltk
import evaluate


spacy.cli.download("en_core_web_sm")

[38;5;2mâœ” Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3mâš  Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## 1. Dataset import

We re-use code from task 1 to import our argilla dataset, where we will only keep the text and the labels.

In [None]:

pd.set_option('display.max_colwidth', 200)

df = pd.read_parquet("hf://datasets/argilla/medical-domain/data/train-00000-of-00001-67e4e7207342a623.parquet")

def extract_label(pred):
    if isinstance(pred, (list, np.ndarray)) and len(pred) > 0 and isinstance(pred[0], dict):
        return pred[0].get("label")
    return None

df['label'] = df['prediction'].apply(extract_label)
df['text_length'] = df['metrics'].apply(lambda x: x.get('text_length') if isinstance(x, dict) else None)

# drop empty columns
df = df.drop(columns=['inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'metadata', 'status', 'event_timestamp', 'metrics'], errors='ignore')

#print(df.head)

## 2. Baseline ML algorithms

We will try the 3 propopsed algorithms ( linear regression, linear SVM and XGboost) and pick the best performing one.

In [None]:
###################################
#0. Split data set into train/test
#################################
# This code is inspired from : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
X=df["text"]
y=df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42) # I split the text : 80% training, 20% test
############################
# 1. TF-IFD
############################

# Using sklearn TfidfVectorizer, we can directly pre-processed our text:
# - everything in lowercase
# - tokenize words
# - every feature of same length

# We finally return the inverse frequency of each token according to all documents.

## This code is adapted from https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents="unicode", # I want to strip all accents
                             lowercase=True,  # I want everything lowercase
                             stop_words="english", # I want to delete common stop words in english
                             min_df=5,  # I want words to be at least in 5 documents
                             max_df=0.8 # very frequent words are not useful to distinguish between documents
                            )


X_train = vectorizer.fit_transform(X_train)
X_test=vectorizer.transform(X_test) # I transform X_test according to X_train frequency per document over apperance in every documents



### 2.2 Linear SVM

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

SVM=LinearSVC(random_state=0, tol=1e-5,class_weight="balanced")
SVM.fit(X_train,y_train)

SVM.score(X_test,y_test) # Accuracy

f1_score_macro_SVM=f1_score(y_test, SVM.predict(X_test), average='macro') # Macro F_1 score -->"harmonic mean of the precision and recall" https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
print("F1 score macro SVM: ",f1_score_macro_SVM)

F1 score macro SVM:  0.1643204293076342


### 2.3 Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

LR=LogisticRegression(random_state=0, tol=1e-5,class_weight="balanced") # we have 40 categories, but some are over-represented. Therefore, we balanced
                                                                 # weights according to their initial frequency in training set
LR.fit(X_train,y_train)

LR.score(X_test,y_test)

f1_score_macro_LR=f1_score(y_test, LR.predict(X_test), average='macro')
print("F1 score macro LR: ",f1_score_macro_LR)

F1 score macro LR:  0.3944886061291781


### 2.4 XGBoost

Considering the high dimensionality of our data , XGboost takes too much time to run and SVM or LR are already strong baseline ML algorithm to compare our transformers to.

## 3. Encoder task

#### Model specification

We decided to use MedBERT Model. This is an encoder transformer, pre-trained for  NER. We will use it for classifcation task.

In [50]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
model = AutoModelForSequenceClassification.from_pretrained("Charangan/MedBERT",num_labels=40)

# This code is adapted from https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com

# I am freezing the encoder, but allowing to update weights of the classification head
for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Charangan/MedBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### dataset formatting

In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)

dataset = dataset.select_columns(['text', 'label'])

# Labels are string, I need to change them as numbers.
labels=dataset.unique("label")

# I create a dictionnary that take label as key and return a value.
# I HAVE ASKED CHATGPT TO WRITE THE DICTIONNARY, AS IT JUST REPETITIVE AND LONG

label2id = {
    "Gastroenterology": 0,
    "Surgery": 1,
    "Radiology": 2,
    "SOAP / Chart / Progress Notes": 3,
    "Letters": 4,
    "Lab Medicine - Pathology": 5,
    "Consult - History and Phy.": 6,
    "Podiatry": 7,
    "General Medicine": 8,
    "Psychiatry / Psychology": 9,
    "Cardiovascular / Pulmonary": 10,
    "Urology": 11,
    "Ophthalmology": 12,
    "Physical Medicine - Rehab": 13,
    "Neurology": 14,
    "Autopsy": 15,
    "Orthopedic": 16,
    "Hematology - Oncology": 17,
    "Allergy / Immunology": 18,
    "Pediatrics - Neonatal": 19,
    "Dentistry": 20,
    "Neurosurgery": 21,
    "Pain Management": 22,
    "Nephrology": 23,
    "Emergency Room Reports": 24,
    "Obstetrics / Gynecology": 25,
    "Speech - Language": 26,
    "Diets and Nutritions": 27,
    "Endocrinology": 28,
    "IME-QME-Work Comp etc.": 29,
    "Cosmetic / Plastic Surgery": 30,
    "Discharge Summary": 31,
    "ENT - Otolaryngology": 32,
    "Chiropractic": 33,
    "Office Notes": 34,
    "Dermatology": 35,
    "Sleep Medicine": 36,
    "Rheumatology": 37,
    "Hospice - Palliative Care": 38,
    "Bariatrics": 39,
}

# function for matching key to values
# Map will gives me one row of my dataset, into a dictionnary form.
# So i want to :
# 1) extract label value from dictionnary
# 2) replace it using my dictionnary with a numerical value
def matching(example):
    label=example["label"].strip() # labels have a whitespace as first character, that i strip
    example["label"]=label2id[label]
    return example

dataset=dataset.map(matching)


Map:   0%|          | 0/4966 [00:00<?, ? examples/s]

In [None]:
from datasets import load_dataset

dataset = dataset.rename_column("label", "labels") # for trainer wrappers, i need to rename label as labels

final_df=dataset.train_test_split(test_size=0.2) # 80/20 split

### Now, we need to tokenize our data set. Adapted from: https://huggingface.co/docs/datasets/use_dataset

def tokenization(example):
    return tokenizer(example["text"], truncation=True, max_length=512) # i will truncate every exmaple that are longer than 512  token. This is
                                                                       # the max input size of our model

final_df_tokenized = final_df.map(tokenization, batched=True)


final_df_tokenized.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

Map:   0%|          | 0/3972 [00:00<?, ? examples/s]

Map:   0%|          | 0/994 [00:00<?, ? examples/s]

#### Define testing metrics (accuracy, f1 macro)

In [55]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

#### training arguments

In [56]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import TrainingArguments



training_args = TrainingArguments(
    output_dir='.',          # output directory
    num_train_epochs=3,              # total # of training epochs --> small, as we only train the head
    per_device_train_batch_size=8,  # batch size per device during training --> small, as i run that on CPU only architecture
    per_device_eval_batch_size=16,   # batch size for evaluation --> small, as i run that on CPU only architecture
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    push_to_hub=True,
)


#### training loop

In [57]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["test"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1
)

trainer.train()

Step,Training Loss
500,2.7855
1000,2.7304


It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`hf upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.
It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`hf upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.
It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(..

TrainOutput(global_step=1491, training_loss=2.742554038423248, metrics={'train_runtime': 152.9953, 'train_samples_per_second': 77.885, 'train_steps_per_second': 9.745, 'total_flos': 3136301034799104.0, 'train_loss': 2.742554038423248, 'epoch': 3.0})

#### Evalute accuracy

In [58]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 2.7441234588623047,
 'eval_accuracy': 0.2806841046277666,
 'eval_f1': 0.02032156368221942,
 'eval_precision': 0.013436852441587464,
 'eval_recall': 0.04184335886418831,
 'eval_runtime': 7.4364,
 'eval_samples_per_second': 133.668,
 'eval_steps_per_second': 8.472,
 'epoch': 3.0}

### Fine-tunning

We will fine-tune our model based on: https://huggingface.co/learn/llm-course/en/chapter3/3

In [59]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
model = AutoModelForSequenceClassification.from_pretrained("Charangan/MedBERT",num_labels=40)

# This code is adapted from https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com

# I am defreezing entire encoder
for param in model.base_model.parameters():
    param.requires_grad = True # I allow the weights to be updated

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Charangan/MedBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [60]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import TrainingArguments



training_args = TrainingArguments(
    output_dir='.',          # output directory
    num_train_epochs=3,              # total # of training epochs --> small, as we only train the head
    per_device_train_batch_size=8,  # batch size per device during training --> small, as i run that on CPU only architecture
    per_device_eval_batch_size=16,   # batch size for evaluation --> small, as i run that on CPU only architecture
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    fp16=True,                       # Enable mixed precision
)


In [61]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["test"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1
)

trainer.train()

Step,Training Loss
500,2.4786


Step,Training Loss
500,2.4786
1000,1.7749


TrainOutput(global_step=1491, training_loss=1.9178988727285589, metrics={'train_runtime': 374.4726, 'train_samples_per_second': 31.821, 'train_steps_per_second': 3.982, 'total_flos': 3136301034799104.0, 'train_loss': 1.9178988727285589, 'epoch': 3.0})

In [62]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 1.766693353652954,
 'eval_accuracy': 0.2837022132796781,
 'eval_f1': 0.12326781145347301,
 'eval_precision': 0.13015887484648142,
 'eval_recall': 0.13069268530119285,
 'eval_runtime': 8.8064,
 'eval_samples_per_second': 112.873,
 'eval_steps_per_second': 7.154,
 'epoch': 3.0}

## Adapting loss function

**STILL NEED TO BE FINISHED**

We will adapt our loss function to penalize harder when we mistake labels of low frequencies labels. With this, we want to counterbalance that our dataset is heavily biased towards surgery.

We adapted this code, from hugging face forum, to handle this task: https://discuss.huggingface.co/t/create-a-weighted-loss-function-to-handle-imbalance/138178/3?utm_source=chatgpt.com

In [67]:
# custom loss
from torch import nn

loss = nn.CrossEntropyLoss()
def nll_loss(logits, labels):
    return loss(logits, labels)

# subclass trainer
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False,**kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss = nll_loss(logits, labels)

        return (loss, outputs) if return_outputs else loss

In [68]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = CustomTrainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["test"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1

)

trainer.train()

Step,Training Loss
500,1.12
1000,1.214


TrainOutput(global_step=1491, training_loss=1.135620199058937, metrics={'train_runtime': 399.7978, 'train_samples_per_second': 29.805, 'train_steps_per_second': 3.729, 'total_flos': 3136301034799104.0, 'train_loss': 1.135620199058937, 'epoch': 3.0})

In [69]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'eval_loss': 2.1866230964660645,
 'eval_accuracy': 0.1267605633802817,
 'eval_f1': 0.08226783212630015,
 'eval_precision': 0.08296233562727673,
 'eval_recall': 0.08729211407499064,
 'eval_runtime': 7.2151,
 'eval_samples_per_second': 137.767,
 'eval_steps_per_second': 8.732,
 'epoch': 3.0}