# Task 3: Pre-trained transformers

### Aim
In this task, the aim is to train different algorithm to be able to classify correctly our medical transcritped notes. Classifcations are labels directly extracted from argilla dataset, as shown in task 1 (e.g. surgery, orthopedics, ...)

## Libraries

In [3]:
import numpy as np
import sklearn
import matplotlib 
import transformers 
import pandas as pd
import tqdm 
import torch 
import spacy 
import nltk 
import langdetect
import evaluate

spacy.cli.download("en_core_web_sm")

[38;5;2mâœ” Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3mâš  Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## 1. Dataset import

We re-use code from task 1 to import our argilla dataset, where we will only keep the text and the labels.

In [38]:

pd.set_option('display.max_colwidth', 200)

df = pd.read_parquet("hf://datasets/argilla/medical-domain/data/train-00000-of-00001-67e4e7207342a623.parquet")

def extract_label(pred):
    if isinstance(pred, (list, np.ndarray)) and len(pred) > 0 and isinstance(pred[0], dict):
        return pred[0].get("label")
    return None

df['label'] = df['prediction'].apply(extract_label)
df['text_length'] = df['metrics'].apply(lambda x: x.get('text_length') if isinstance(x, dict) else None)

# drop empty columns
df = df.drop(columns=['inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'metadata', 'status', 'event_timestamp', 'metrics'], errors='ignore')

#print(df.head)

## 2. Baseline ML algorithms

We will try the 3 propopsed algorithms ( linear regression, linear SVM and XGboost) and pick the best performing one.

### 2.1 Text pre-processing

In [53]:
###################################
#0. Split data set into train/test
#################################
# This code is inspired from : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
X=df["text"]
y=df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42) # I split the text : 80% training, 20% test
############################
# 1. TF-IFD
############################

# Using sklearn TfidfVectorizer, we can directly pre-processed our text:
# - everything in lowercase
# - tokenize words
# - every feature of same length 

# We finally return the inverse frequency of each token according to all documents.

## This code is adapted from https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents="unicode", # I want to strip all accents
                             lowercase=True,  # I want everything lowercase
                             stop_words="english", # I want to delete common stop words in english
                             min_df=5,  # I want words to be at least in 5 documents
                             max_df=0.8 # very frequent words are not useful to distinguish between documents
                            ) 


X_train = vectorizer.fit_transform(X_train)
X_test=vectorizer.transform(X_test) # I transform X_test according to X_train frequency per document over apperance in every documents



### 2.2 Linear SVM

In [51]:
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

SVM=LinearSVC(random_state=0, tol=1e-5,class_weight="balanced")
SVM.fit(X_train,y_train)

SVM.score(X_test,y_test) # Accuracy

f1_score_macro_SVM=f1_score(y_test, SVM.predict(X_test), average='macro') # Macro F_1 score -->"harmonic mean of the precision and recall" https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
print("F1 score macro SVM: ",f1_score_macro_SVM)


F1 score macro SVM:  0.1643204293076342


### 2.3 Logistic regression

In [52]:
from sklearn.linear_model import LogisticRegression

LR=LogisticRegression(random_state=0, tol=1e-5,class_weight="balanced") # we have 40 categories, but some are over-represented. Therefore, we balanced
                                                                 # weights according to their initial frequency in training set
LR.fit(X_train,y_train)

LR.score(X_test,y_test)

f1_score_macro_LR=f1_score(y_test, LR.predict(X_test), average='macro')
print("F1 score macro LR: ",f1_score_macro_LR)

F1 score macro LR:  0.3944886061291781


### 2.4 XGboost

Considering the high dimensionality of our data , XGboost takes too much time to run and SVM or LR are already strong baseline ML algorithm to compare our transformers to.

## 3. Encoder task

#### Model specification

We decided to use MedBERT Model. This is an encoder transformer, pre-trained for  NER. We will use it for classifcation task.

In [63]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
model = AutoModelForSequenceClassification.from_pretrained("Charangan/MedBERT",num_labels=40)

# This code is adapted from https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com

# I am freezing the encoder, but allowing to update weights of the classification head
for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Charangan/MedBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Data set formatting

In [76]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)

dataset = dataset.select_columns(['text', 'label'])

# Labels are string, I need to change them as numbers.
labels=dataset.unique("label")

# I create a dictionnary that take label as key and return a value. 
# I HAVE ASKED CHATGPT TO WRITE THE DICTIONNARY, AS IT JUST REPETITIVE AND LONG

label2id = {
    "Gastroenterology": 0,
    "Surgery": 1,
    "Radiology": 2,
    "SOAP / Chart / Progress Notes": 3,
    "Letters": 4,
    "Lab Medicine - Pathology": 5,
    "Consult - History and Phy.": 6,
    "Podiatry": 7,
    "General Medicine": 8,
    "Psychiatry / Psychology": 9,
    "Cardiovascular / Pulmonary": 10,
    "Urology": 11,
    "Ophthalmology": 12,
    "Physical Medicine - Rehab": 13,
    "Neurology": 14,
    "Autopsy": 15,
    "Orthopedic": 16,
    "Hematology - Oncology": 17,
    "Allergy / Immunology": 18,
    "Pediatrics - Neonatal": 19,
    "Dentistry": 20,
    "Neurosurgery": 21,
    "Pain Management": 22,
    "Nephrology": 23,
    "Emergency Room Reports": 24,
    "Obstetrics / Gynecology": 25,
    "Speech - Language": 26,
    "Diets and Nutritions": 27,
    "Endocrinology": 28,
    "IME-QME-Work Comp etc.": 29,
    "Cosmetic / Plastic Surgery": 30,
    "Discharge Summary": 31,
    "ENT - Otolaryngology": 32,
    "Chiropractic": 33,
    "Office Notes": 34,
    "Dermatology": 35,
    "Sleep Medicine": 36,
    "Rheumatology": 37,
    "Hospice - Palliative Care": 38,
    "Bariatrics": 39,
}

# function for matching key to values
# Map will gives me one row of my dataset, into a dictionnary form.
# So i want to :
# 1) extract label value from dictionnary
# 2) replace it using my dictionnary with a numerical value
def matching(example):
    label=example["label"].strip() # labels have a whitespace as first character, that i strip
    example["label"]=label2id[label]
    return example
    
dataset=dataset.map(matching)
        

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 4966/4966 [00:00<00:00, 7223.14 examples/s]


In [77]:
from datasets import load_dataset

dataset = dataset.rename_column("label", "labels") # for trainer wrappers, i need to rename label as labels

final_df=dataset.train_test_split(test_size=0.2) # 80/20 split

### Now, we need to tokenize our data set. Adapted from: https://huggingface.co/docs/datasets/use_dataset

def tokenization(example):
    return tokenizer(example["text"], truncation=True, max_length=512) # i will truncate every exmaple that are longer than 512  token. This is 
                                                                       # the max input size of our model

final_df_tokenized = final_df.map(tokenization, batched=True)




Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 3972/3972 [00:05<00:00, 784.46 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 994/994 [00:01<00:00, 617.60 examples/s]


In [78]:
final_df_tokenized.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])

#### Define metric computation

In [None]:
### Now, i will evalute it  using m_1 metrics and our dataset. The code is adapted from: https://huggingface.co/docs/transformers/tasks/sequence_classification
import numpy as np
import evaluate
accuracy = evaluate.load("accuracy")



def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

#trainer.evaluate(eval_dataset=final_df_tokenized["test"])

#### Training arguments

In [81]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import TrainingArguments
#from huggingface_hub import login
#login() Login to huggingface to store token


training_args = TrainingArguments(
    output_dir='.',          # output directory
    num_train_epochs=3,              # total # of training epochs --> small, as we only train the head
    per_device_train_batch_size=8,  # batch size per device during training --> small, as i run that on CPU only architecture
    per_device_eval_batch_size=16,   # batch size for evaluation --> small, as i run that on CPU only architecture
    warmup_steps=100,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    push_to_hub=True,
)

#### Training loop

In [82]:
# This code is adapted from : https://huggingface.co/transformers/v4.2.2/training.html?utm_source=chatgpt.com
from transformers import Trainer
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=final_df_tokenized["train"],         # training dataset
    eval_dataset= final_df_tokenized["test"],          # evaluation dataset
    data_collator=data_collator, # allows dynamical padding --> every batch will have the same lenghts, which is max_length of this batch
    compute_metrics=compute_metrics # added to return f1
)

trainer.train()



Step,Training Loss


KeyboardInterrupt: 

#### Evalute model accuracy

In [None]:
trainer.evaluate(eval_dataset=final_df_tokenized["test"])

### 3.2 Supervised fine-tuning

### 3.3 Continual pre-training 