## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [None]:
!pip install -q -U datasets evaluate transformers[sentencepiece]
!pip install -q -U accelerate

[K     |████████████████████████████████| 451 kB 5.0 MB/s 
[K     |████████████████████████████████| 72 kB 1.1 MB/s 
[K     |████████████████████████████████| 5.5 MB 65.7 MB/s 
[K     |████████████████████████████████| 212 kB 61.5 MB/s 
[K     |████████████████████████████████| 182 kB 60.0 MB/s 
[K     |████████████████████████████████| 115 kB 60.9 MB/s 
[K     |████████████████████████████████| 127 kB 54.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 43.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 60.8 MB/s 
[K     |████████████████████████████████| 175 kB 5.0 MB/s 
[?25h

## Hyperparameters

In [None]:
model_checkpoint = "bert-base-uncased"
max_length  = 256
batch_size = 8
metric_name = "f1"
num_train_epochs = 20

### Preprocessing

In [None]:
import pandas as pd
raw_data = pd.read_csv("multi_label_reports.csv")

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(raw_data, test_size=0.1, random_state=1)

train, val = train_test_split(train, test_size=0.15, random_state=1) # 0.25 x 0.8 = 0.2

In [None]:
train.to_csv("train_bert.csv",index = False)
val.to_csv("val_bert.csv",index = False)
test.to_csv("test_bert.csv",index = False)

## Load dataset

In [None]:
from datasets import load_dataset

# dataset = load_dataset("csv", data_files = "multi_label_reports.csv")

data_files = {"train": "train_bert.csv", "validation": "val_bert.csv", "test": "test_bert.csv"}
dataset = load_dataset("csv", data_files=data_files)

Using custom data configuration default-68637554b73287cd


Downloading and preparing dataset csv/default to /home/studio-lab-user/.cache/huggingface/datasets/csv/default-68637554b73287cd/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

   

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating validation split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /home/studio-lab-user/.cache/huggingface/datasets/csv/default-68637554b73287cd/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


  0%|          | 0/3 [00:00<?, ?it/s]

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'CT 3D Reconstruction -PE', 'CT Abd', 'CT Abdomen', 'CT Chest', 'CT Chest Pulmonary Embolus -CH', 'CT Dissection -CH -AB', 'CT Extremities', 'CT Head', 'CT Head Special', 'CT Hematuria Protocol', 'CT Neck', 'CT Pelvis', 'CT Spine', 'CT Stone Protocol', 'CT Stone Protocol Abd Pelvis NonEnh -AB', 'CT Thorax', 'CV Coronaries Cs Only -CA -XA', 'ECHO 2D M Qual Dop -EC', 'ECHO 2D M Quan Dop', 'ECHO 2D M Quan Dop -EC', 'GU Retrograde Pyelogram', 'GU Urology OR Procedure -GU', 'IR Non-Vascular Intervention', 'MR Brain', 'NM Lung Aerosol -CH', 'NM Lung Perfusion -CH', 'US Abdomen ', 'US Appendix -AB', 'US Doppler', 'US Doppler Abd/Pel/Obs -AB', 'US Kidneys -AB', 'US Miscellaneous Small Parts -EX', 'US Obs 1st Trimester -OB', 'US Obs 2nd/3rd Trimester -OB', 'US Scrotum/Testes', 'US Scrotum/Testes -AB', 'US Soft Tissue Masses -EX', 'US Transabdominal Pelvis -PE', 'US Transvag Combined', 'US Transvag Combined-PE', 'XR AC Joints -SH', 'X

Let's check the first example of the training split:

In [None]:
example = dataset['train'][0]
example

{'text': 'Left ankle: The ankle mortise is maintained. No joint effusion or fracture isidentified. There is a large plantar surface calcaneal spur.',
 'CT 3D Reconstruction -PE': 0,
 'CT Abd': 0,
 'CT Abdomen': 0,
 'CT Chest': 0,
 'CT Chest Pulmonary Embolus -CH': 0,
 'CT Dissection -CH -AB': 0,
 'CT Extremities': 0,
 'CT Head': 0,
 'CT Head Special': 0,
 'CT Hematuria Protocol': 0,
 'CT Neck': 0,
 'CT Pelvis': 0,
 'CT Spine': 0,
 'CT Stone Protocol': 0,
 'CT Stone Protocol Abd Pelvis NonEnh -AB': 0,
 'CT Thorax': 0,
 'CV Coronaries Cs Only -CA -XA': 0,
 'ECHO 2D M Qual Dop -EC': 0,
 'ECHO 2D M Quan Dop': 0,
 'ECHO 2D M Quan Dop -EC': 0,
 'GU Retrograde Pyelogram': 0,
 'GU Urology OR Procedure -GU': 0,
 'IR Non-Vascular Intervention': 0,
 'MR Brain': 0,
 'NM Lung Aerosol -CH': 0,
 'NM Lung Perfusion -CH': 0,
 'US Abdomen ': 0,
 'US Appendix -AB': 0,
 'US Doppler': 0,
 'US Doppler Abd/Pel/Obs -AB': 0,
 'US Kidneys -AB': 0,
 'US Miscellaneous Small Parts -EX': 0,
 'US Obs 1st Trimester -

The dataset consists of radiology reports, labeled with 131 examinations. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [None]:
labels = [label for label in dataset['train'].features.keys() if label not in ['text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['CT 3D Reconstruction -PE',
 'CT Abd',
 'CT Abdomen',
 'CT Chest',
 'CT Chest Pulmonary Embolus -CH',
 'CT Dissection -CH -AB',
 'CT Extremities',
 'CT Head',
 'CT Head Special',
 'CT Hematuria Protocol',
 'CT Neck',
 'CT Pelvis',
 'CT Spine',
 'CT Stone Protocol',
 'CT Stone Protocol Abd Pelvis NonEnh -AB',
 'CT Thorax',
 'CV Coronaries Cs Only -CA -XA',
 'ECHO 2D M Qual Dop -EC',
 'ECHO 2D M Quan Dop',
 'ECHO 2D M Quan Dop -EC',
 'GU Retrograde Pyelogram',
 'GU Urology OR Procedure -GU',
 'IR Non-Vascular Intervention',
 'MR Brain',
 'NM Lung Aerosol -CH',
 'NM Lung Perfusion -CH',
 'US Abdomen ',
 'US Appendix -AB',
 'US Doppler',
 'US Doppler Abd/Pel/Obs -AB',
 'US Kidneys -AB',
 'US Miscellaneous Small Parts -EX',
 'US Obs 1st Trimester -OB',
 'US Obs 2nd/3rd Trimester -OB',
 'US Scrotum/Testes',
 'US Scrotum/Testes -AB',
 'US Soft Tissue Masses -EX',
 'US Transabdominal Pelvis -PE',
 'US Transvag Combined',
 'US Transvag Combined-PE',
 'XR AC Joints -SH',
 'XR Abdomen',
 'XR Ank

## Preprocess data

We use our own tokenizer to convert the input into BERT structure input.



In [None]:
from transformers import AutoTokenizer
import numpy as np

# tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def preprocess_data(examples):
  # take a batch of texts
  text = examples["text"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=max_length)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [None]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [None]:
# example

In [None]:
tokenizer.decode(example['input_ids'])

'[CLS] left ankle : the ankle mortise is maintained. no joint effusion or fracture isidentified. there is a large plantar surface calcaneal spur. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [None]:
str(example['labels'])

'[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]'

In [None]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['XR Ankle', ' LT -LEX']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch datasets.

In [None]:
encoded_dataset.set_format("torch")

In [None]:
import joblib
joblib.dump(encoded_dataset, 'dataset.pkl')

['dataset.pkl']

In [None]:
encoded_dataset = joblib.load('dataset.pkl')

In [None]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1568
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 277
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 206
    })
})

## Define model

Define our model from previously trained model checkpoint.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

## Train the model!

We are going to train the model using HuggingFace's Trainer for this task.

In [None]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Let's start training!

## 512 tokens and 16 batch

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1568
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1960
  Number of trainable parameters = 109588362
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.220919,0.0,0.5,0.0
2,No log,0.104548,0.0,0.5,0.0
3,No log,0.077199,0.0,0.5,0.0
4,No log,0.067003,0.0,0.5,0.0
5,No log,0.062103,0.0,0.5,0.0


***** Running Evaluation *****
  Num examples = 277
  Batch size = 16
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-98
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-98/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-98/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-98/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-98/special_tokens_map.json
Deleting older checkpoint [bert-finetuned-sem_eval-english/checkpoint-1568] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 277
  Batch size = 16
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-196
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-196/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-196/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoin

### Evaluate

After training, we evaluate our model on the validation set.

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 277
  Batch size = 8


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.030985,0.657244,0.757471,0.433213
2,No log,0.030985,0.657244,0.757471,0.433213
3,0.026100,0.030985,0.657244,0.757471,0.433213
4,0.026100,0.030985,0.657244,0.757471,0.433213
5,0.026100,0.030985,0.657244,0.757471,0.433213
6,0.026100,0.030985,0.657244,0.757471,0.433213
7,0.026100,0.030985,0.657244,0.757471,0.433213
8,0.025900,0.030985,0.657244,0.757471,0.433213
9,0.025900,0.030985,0.657244,0.757471,0.433213
10,0.025900,0.030985,0.657244,0.757471,0.433213


{'eval_loss': 0.030985243618488312,
 'eval_f1': 0.6572438162544169,
 'eval_roc_auc': 0.757471054075311,
 'eval_accuracy': 0.4332129963898917}

## 256 tokens and 8 batchs

In [None]:

model = AutoModelForSequenceClassification.from_pretrained("./models/multi-label-train-1/checkpoint-3724", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer.train()

***** Running training *****
  Num examples = 1568
  Num Epochs = 20
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3920
  Number of trainable parameters = 109588362


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.032235,0.616845,0.739738,0.382671
2,No log,0.029194,0.650575,0.760942,0.440433
3,0.027000,0.028652,0.656682,0.762844,0.458484
4,0.027000,0.026628,0.676404,0.777552,0.483755
5,0.027000,0.025854,0.704619,0.80232,0.530686
6,0.021200,0.025635,0.688196,0.784945,0.494585
7,0.021200,0.024784,0.72051,0.812499,0.534296
8,0.017600,0.023889,0.711974,0.804248,0.509025
9,0.017600,0.023715,0.743215,0.828211,0.606498
10,0.017600,0.023556,0.752844,0.835591,0.599278


***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-196
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-196/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-196/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-196/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-196/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-392
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-392/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-392/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-392/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-39

TrainOutput(global_step=3920, training_loss=0.016417638866268857, metrics={'train_runtime': 1597.4287, 'train_samples_per_second': 19.632, 'train_steps_per_second': 2.454, 'total_flos': 4130619050557440.0, 'train_loss': 0.016417638866268857, 'epoch': 20.0})

### Evaluate

## 1st train 0 to 20 epochs

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 277
  Batch size = 8


{'eval_loss': 0.03397705405950546,
 'eval_f1': 0.6215235792019347,
 'eval_roc_auc': 0.7371383368849229,
 'eval_accuracy': 0.4043321299638989,
 'eval_runtime': 3.9729,
 'eval_samples_per_second': 69.722,
 'eval_steps_per_second': 8.81,
 'epoch': 20.0}

## 2nd train 21 to 40 epochs

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 277
  Batch size = 8


{'eval_loss': 0.021440846845507622,
 'eval_f1': 0.7879396984924624,
 'eval_roc_auc': 0.8614694432911009,
 'eval_accuracy': 0.6606498194945848,
 'eval_runtime': 3.9053,
 'eval_samples_per_second': 70.93,
 'eval_steps_per_second': 8.962,
 'epoch': 20.0}

## 3rd train 

In [None]:

model = AutoModelForSequenceClassification.from_pretrained("./models/multilabel-train-2nd-3920-66%-ac/checkpoint-3724", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=40,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1568
  Num Epochs = 40
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7840
  Number of trainable parameters = 109588362


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.022023,0.779352,0.855,0.642599
2,No log,0.021454,0.772455,0.856689,0.65704
3,0.011500,0.021561,0.791296,0.868757,0.66787
4,0.011500,0.022558,0.773227,0.856702,0.66065
5,0.011500,0.021675,0.780392,0.866763,0.67148
6,0.009900,0.021938,0.791296,0.868757,0.67509
7,0.009900,0.021041,0.791016,0.873272,0.67509
8,0.008300,0.021658,0.793682,0.870605,0.6787
9,0.008300,0.021084,0.791708,0.869668,0.67509
10,0.008300,0.019968,0.800785,0.876151,0.68231


***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-196
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-196/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-196/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-196/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-196/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-392
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-392/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-392/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-392/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-39

TrainOutput(global_step=7840, training_loss=0.005927907599478352, metrics={'train_runtime': 3205.2737, 'train_samples_per_second': 19.568, 'train_steps_per_second': 2.446, 'total_flos': 8261238101114880.0, 'train_loss': 0.005927907599478352, 'epoch': 40.0})

## 4th Train

In [None]:

model = AutoModelForSequenceClassification.from_pretrained("./models/MLT-3rd-5820-77_2%-ac/checkpoint-7840", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=40,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1568
  Num Epochs = 40
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7840
  Number of trainable parameters = 109588362


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.022628,0.81268,0.889921,0.707581
2,No log,0.024056,0.798086,0.884243,0.707581
3,0.004400,0.02423,0.787821,0.881351,0.696751
4,0.004400,0.024057,0.806084,0.890713,0.729242
5,0.004400,0.023035,0.804533,0.892495,0.722022
6,0.003900,0.024309,0.798095,0.886052,0.714801
7,0.003900,0.024068,0.798113,0.889669,0.722022
8,0.003300,0.023348,0.813623,0.896271,0.740072
9,0.003300,0.024711,0.797336,0.886038,0.725632
10,0.003300,0.024527,0.805687,0.89161,0.732852


***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-196
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-196/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-196/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-196/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-196/special_tokens_map.json
Deleting older checkpoint [bert-finetuned-sem_eval-english/checkpoint-5880] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-392
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-392/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-392/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkp

TrainOutput(global_step=7840, training_loss=0.0025748017110994886, metrics={'train_runtime': 3201.6817, 'train_samples_per_second': 19.59, 'train_steps_per_second': 2.449, 'total_flos': 8261238101114880.0, 'train_loss': 0.0025748017110994886, 'epoch': 40.0})

## Train 5th

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./models/MLT-4th-1568-77_4%-ac/checkpoint-7840", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=100,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 1568
  Num Epochs = 100
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 19600
  Number of trainable parameters = 109588362
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.02581,0.803419,0.889762,0.722022
2,No log,0.026399,0.795455,0.88691,0.722022
3,0.002200,0.026631,0.800755,0.89062,0.740072
4,0.002200,0.025868,0.809524,0.891677,0.732852
5,0.002200,0.025169,0.805268,0.894317,0.732852
6,0.002100,0.024445,0.810606,0.894409,0.740072
7,0.002100,0.024794,0.812441,0.897156,0.743682
8,0.002000,0.026591,0.801136,0.889722,0.740072
9,0.002000,0.026524,0.797721,0.886949,0.718412
10,0.002000,0.02642,0.802632,0.893366,0.740072


***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-196
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-196/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-196/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-196/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-196/special_tokens_map.json
Deleting older checkpoint [bert-finetuned-sem_eval-english/checkpoint-1568] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 277
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-392
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-392/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-392/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkp

KeyboardInterrupt: 

## Evaluation

### EV on train 5th model

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./models/MLT-5th-1372-77_4%-ac/checkpoint-1372", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=100,
    weight_decay=0.01,
    save_total_limit = 2,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

loading configuration file ./models/MLT-5th-1372-77_4%-ac/checkpoint-1372/config.json
Model config BertConfig {
  "_name_or_path": "./models/MLT-5th-1372-77_4%-ac/checkpoint-1372",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "CT 3D Reconstruction -PE",
    "1": "CT Abd",
    "2": "CT Abdomen",
    "3": "CT Chest",
    "4": "CT Chest Pulmonary Embolus -CH",
    "5": "CT Dissection -CH -AB",
    "6": "CT Extremities",
    "7": "CT Head",
    "8": "CT Head Special",
    "9": "CT Hematuria Protocol",
    "10": "CT Neck",
    "11": "CT Pelvis",
    "12": "CT Spine",
    "13": "CT Stone Protocol",
    "14": "CT Stone Protocol Abd Pelvis NonEnh -AB",
    "15": "CT Thorax",
    "16": "CV Coronaries Cs Only -CA -XA",
    "17": "ECHO 2D M Qual Dop -EC",
    "18": "ECHO 2D M Qu

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 277
  Batch size = 8


{'eval_loss': 0.024794353172183037,
 'eval_f1': 0.8124410933081999,
 'eval_roc_auc': 0.8971555728645645,
 'eval_accuracy': 0.7436823104693141,
 'eval_runtime': 4.0027,
 'eval_samples_per_second': 69.204,
 'eval_steps_per_second': 8.744}

In [None]:
predictions = trainer.predict(encoded_dataset["test"])

***** Running Prediction *****
  Num examples = 206
  Batch size = 8


In [None]:
predictions.label_ids

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
true_label = []
for i in range(len(encoded_dataset["test"]['labels'])):
    predicted_labels = [id2label[idx] for idx, label in enumerate(encoded_dataset["test"]['labels'][i]) if label == 1.0]
    true_label.append(predicted_labels)

In [None]:
true_label_dt = pd.DataFrame(true_label, columns = ['true_exam 1', 'true_exam 2', 'true_exam 3'])
true_label_dt.fillna('No examination', inplace=True)
true_label_dt.head()

Unnamed: 0,true_exam 1,true_exam 2,true_exam 3
0,XR Shoulder,RT -SH,No examination
1,XR Chest,Non Dedicated Unit -CH,No examination
2,XR Chest,Non Dedicated Unit -CH,No examination
3,XR Cervical Spine -VC,No examination,No examination
4,XR Abdomen,1 View -AB,No examination


In [None]:
# predict_label = []
# for i in range(len(predictions.label_ids)):
#     predicted_labels = [id2label[idx] for idx, label in enumerate(predictions.label_ids[i]) if label == 1.0]
#     predict_label.append(predicted_labels)
    
# predict_label_dt = pd.DataFrame(predict_label, columns = ['pred_exam 1', 'pred_exam 2', 'pred_exam 3'])
# predict_label_dt.head()
import pandas as pd

predict_labels = []
for i in range(len(predictions.predictions)):
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.from_numpy(predictions.predictions[i].squeeze()))
    predictions_current = np.zeros(probs.shape)
    predictions_current[np.where(probs >= 0.5)] = 1
    predicted_label = [id2label[idx] for idx, label in enumerate(predictions_current) if label == 1.0]
    predict_labels.append(predicted_label)
    
predict_label_dt = pd.DataFrame(predict_labels, columns = ['pred_exam 1', 'pred_exam 2','pred_exam_3','pred_exam_4'])

predict_label_dt.fillna('No examination', inplace=True)

predict_label_dt.head()


Unnamed: 0,pred_exam 1,pred_exam 2,pred_exam_3,pred_exam_4
0,XR Shoulder,RT -SH,No examination,No examination
1,XR Chest,Non Dedicated Unit -CH,No examination,No examination
2,XR Chest,Non Dedicated Unit -CH,No examination,No examination
3,XR Cervical Spine -VC,No examination,No examination,No examination
4,XR Abdomen,Multi View -AB,No examination,No examination


In [None]:
###############################################
def max_print_out(pattern=False):
    '''It will maximize print out line and set float format with .2f'''
    number = None if pattern else 10
    # Set options to avoid truncation when displaying a dataframe
    pd.set_option("display.max_rows", number)
    pd.set_option("display.max_columns", 50)
    # Set floating point numbers to be displayed with 2 decimal places
    pd.set_option('display.float_format', '{:.2f}'.format)
    # for showing all entities 

In [None]:
whole_dt = pd.concat([true_label_dt, 
                      predict_label_dt], axis = 1)
max_print_out(True)

In [None]:
whole_dt.columns

Index(['true_exam 1', 'true_exam 2', 'true_exam 3', 'pred_exam 1',
       'pred_exam 2', 'pred_exam_3', 'pred_exam_4'],
      dtype='object')

In [None]:
report_1 = whole_dt[whole_dt['true_exam 1'] != whole_dt['pred_exam 1']] 

In [None]:
report_2 = whole_dt[whole_dt['true_exam 2'] != whole_dt['pred_exam 2']] 

In [None]:
report_3 = whole_dt[whole_dt['true_exam 3'] != whole_dt['pred_exam_3']] 

In [None]:
report_3.head()

Unnamed: 0,true_exam 1,true_exam 2,true_exam 3,pred_exam 1,pred_exam 2,pred_exam_3,pred_exam_4
13,CT Abd,Pelvis,Enhanced -AB,CT Stone Protocol Abd Pelvis NonEnh -AB,No examination,No examination,No examination
14,CT Abd,Pelvis,Enhanced -AB,CT Abd,No examination,No examination,No examination
35,US Abdomen,>3 Organ -AB,No examination,US Abdomen,<3 Organ -AB,>3 Organ -AB,No examination
48,CT Abd,Pelvis,Non Enhanced -AB,CT Abd,Pelvis,Enhanced -AB,No examination
62,XR Kidneys,Ureters,Bladder -AB,CT Stone Protocol Abd Pelvis NonEnh -AB,No examination,No examination,No examination


In [None]:
reports = pd.concat([report_1, 
                      report_2,report_3], axis = 0)

In [None]:
reports.drop_duplicates(inplace= True)

In [None]:
reports.sort_index(inplace = True)

In [None]:
print("Total wrong predictions", len(reports))

Total wrong predictions 48


In [None]:
print("Total test set", len(encoded_dataset["test"]['labels']))

Total test set 206


#### Reports with wrong predictions

In [None]:
reports

Unnamed: 0,true_exam 1,true_exam 2,true_exam 3,pred_exam 1,pred_exam 2,pred_exam_3,pred_exam_4
4,XR Abdomen,1 View -AB,No examination,XR Abdomen,Multi View -AB,No examination,No examination
11,XR Chest,Non Dedicated Unit -CH,No examination,XR Shoulder,RT -SH,No examination,No examination
13,CT Abd,Pelvis,Enhanced -AB,CT Stone Protocol Abd Pelvis NonEnh -AB,No examination,No examination,No examination
14,CT Abd,Pelvis,Enhanced -AB,CT Abd,No examination,No examination,No examination
15,XR Hip,LT 2 Views -PE,No examination,No examination,No examination,No examination,No examination
20,XR Ankle,LT -LEX,No examination,XR Chest,Non Dedicated Unit -CH,No examination,No examination
21,ECHO 2D M Quan Dop -EC,No examination,No examination,ECHO 2D M Quan Dop,Mobile -EC,No examination,No examination
25,XR Foot,LT -LEX,No examination,XR Lumbar Spine -VC,No examination,No examination,No examination
27,US Kidneys -AB,No examination,No examination,US Transvag Combined-PE,No examination,No examination,No examination
28,XR Chest,1 View -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination,No examination


In [None]:
whole_dt

Unnamed: 0,true_exam 1,true_exam 2,true_exam 3,pred_exam 1,pred_exam 2,pred_exam_3
0,XR Shoulder,RT -SH,No examination,XR Shoulder,RT -SH,No examination
1,XR Chest,Non Dedicated Unit -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination
2,XR Chest,Non Dedicated Unit -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination
3,XR Cervical Spine -VC,No examination,No examination,XR Cervical Spine -VC,No examination,No examination
4,XR Abdomen,1 View -AB,No examination,XR Abdomen,Multi View -AB,No examination
5,XR Chest,Non Dedicated Unit -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination
6,CT Head,Non Enhanced -NE,No examination,CT Head,Non Enhanced -NE,No examination
7,XR Chest,Non Dedicated Unit -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination
8,XR Thoracic Spine -VC,No examination,No examination,XR Thoracic Spine -VC,No examination,No examination
9,XR Chest,Non Dedicated Unit -CH,No examination,XR Chest,Non Dedicated Unit -CH,No examination


## Inference

Let's test the model on a new sentence:

In [None]:
text = '''
As compared to the 
previous radiograph, there is no relevant change. The monitoring and support 
devices are constant. No evidence of pneumothorax. No other acute interval 
changes.
'''
###
# CT Head, Non Enhanced -NE
##
encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}

outputs = trainer.model(**encoding)

The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.

In [None]:
logits = outputs.logits
logits.shape

torch.Size([1, 138])

To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a "probability" for how certain the model is that a given class belongs to the input text.

Next, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).

In [None]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(logits.squeeze().cpu())
predictions = np.zeros(probs.shape)
predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]
print(predicted_labels)

['XR Chest', ' Non Dedicated Unit -CH']


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
