# Task 3: Pre-trained transformers

### Aim
In this task, the aim is to train different algorithm to be able to classify correctly our medical transcritped notes. Classifcations are labels directly extracted from argilla dataset, as shown in task 1 (e.g. surgery, orthopedics, ...)

## Libraries

In [9]:
import numpy as np
import sklearn
import matplotlib 
import transformers 
import pandas as pd
import tqdm 
import torch 
import spacy 
import nltk 
import langdetect

spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## 1. Dataset import

We re-use code from task 1 to import our argilla dataset, where we will only keep the text and the labels.

In [11]:

pd.set_option('display.max_colwidth', 200)

df = pd.read_parquet("hf://datasets/argilla/medical-domain/data/train-00000-of-00001-67e4e7207342a623.parquet")

def extract_label(pred):
    if isinstance(pred, (list, np.ndarray)) and len(pred) > 0 and isinstance(pred[0], dict):
        return pred[0].get("label")
    return None

df['label'] = df['prediction'].apply(extract_label)
df['text_length'] = df['metrics'].apply(lambda x: x.get('text_length') if isinstance(x, dict) else None)

# drop empty columns
df = df.drop(columns=['inputs', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'multi_label', 'explanation', 'metadata', 'status', 'event_timestamp', 'metrics'], errors='ignore')

#print(df.head)

<bound method NDFrame.head of                                                                                                                                                                                                          text  \
0     PREOPERATIVE DIAGNOSIS:,  Iron deficiency anemia.,POSTOPERATIVE DIAGNOSIS:,  Diverticulosis.,PROCEDURE:,  Colonoscopy.,MEDICATIONS: , MAC.,PROCEDURE: , The Olympus pediatric variable colonoscope w...   
1     CLINICAL INDICATION:  ,Normal stress test.,PROCEDURES PERFORMED:,1.  Left heart cath.,2.  Selective coronary angiography.,3.  LV gram.,4.  Right femoral arteriogram.,5.  Mynx closure device.,PROCE...   
2     FINDINGS:,Axial scans were performed from L1 to S2 and reformatted images were obtained in the sagittal and coronal planes.,Preliminary scout film demonstrates anterior end plate spondylosis at T1...   
3     PREOPERATIVE DIAGNOSIS: , Blood loss anemia.,POSTOPERATIVE DIAGNOSES:,1.  Diverticulosis coli.,2.  Internal hemorrhoids.,3.  Poo

## 2. Baseline ML algorithms

We will try the 3 propopsed algorithms ( linear regression, linear SVM and XGboost) and pick the best performing one.

### 2.1 Text pre-processing

In [46]:
###################################
#0. Split data set into train/test
#################################
# This code is inspired from : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

from sklearn.model_selection import train_test_split
X=df["text"]
y=df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42) # I split the text : 80% training, 20% test
############################
# 1. TF-IFD
############################

# Using sklearn TfidfVectorizer, we can directly pre-processed our text:
# - everything in lowercase
# - tokenize words
# - every feature of same length 

# We finally return the inverse frequency of each token according to all documents.

## This code is adapted from https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(strip_accents="unicode", # I want to strip all accents
                             lowercase=True,  # I want everything lowercase
                             stop_words="english", # I want to delete common stop words in english
                             min_df=5,  # I want words to be at least in 5 documents
                             max_df=0.8 # very frequent words are not useful to distinguish between documents
                            ) 


X_train = vectorizer.fit_transform(X_train)
X_test=vectorizer.transform(X_test) # I transform X_test according to X_train frequency per document over apperance in every documents



### 2.2 Linear SVM

In [51]:
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score

SVM=LinearSVC(random_state=0, tol=1e-5,class_weight="balanced")
SVM.fit(X_train,y_train)

SVM.score(X_test,y_test) # Accuracy

f1_score_macro_SVM=f1_score(y_test, SVM.predict(X_test), average='macro') # Macro F_1 score -->"harmonic mean of the precision and recall" https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
print("F1 score macro SVM: ",f1_score_macro_SVM)


F1 score macro SVM:  0.1643204293076342


## 2.3 Logistic regression

In [52]:
from sklearn.linear_model import LogisticRegression

LR=LogisticRegression(random_state=0, tol=1e-5,class_weight="balanced") # we have 40 categories, but some are over-represented. Therefore, we balanced
                                                                 # weights according to their initial frequency in training set
LR.fit(X_train,y_train)

LR.score(X_test,y_test)

f1_score_macro_LR=f1_score(y_test, LR.predict(X_test), average='macro')
print("F1 score macro LR: ",f1_score_macro_LR)

F1 score macro LR:  0.3944886061291781


### 2.4 XGboost

Considering the high dimensionality of our data , XGboost takes too much time to run and SVM or LR are already strong baseline ML algorithm to compare our transformers too.