<a href="https://colab.research.google.com/github/jlopetegui98/Creation-of-a-synthetic-dataset-for-French-NER-in-clinical-trial-texts/blob/main/NER-chia-dataset/model_ner_chia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Multilingual NER model trained over [Chia dataset](https://figshare.com/articles/dataset/Chia_Annotated_Datasets/11855817)**

We are going to train a BERT based multilingual language model over the Chia dataset in english and then we will use this model to create the synthetic version of the dataset in French. Our idea is supported by the experiments already done with [multiNERD](https://huggingface.co/datasets/Babelscape/multinerd) dataset for multilingual NER in English and French.

**Entities selection**

Among all the entities in the dataset, we are going to focus for this project on the most represented ones. Then, we are just going to consider those entities with more than 1000 samples in total.

In [1]:
# uncomment if working in colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U datasets
!pip install -q -U wandb
!pip install -q -U git+https://github.com/huggingface/accelerate.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
!pip install seqeval



In [4]:
!pip install -q -U evaluate

In [5]:
# imports
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
import os
# from preprocessing_dataset import *
import numpy as np
import pandas as pd
from datasets import Dataset, DatasetDict, load_metric
import json
from datasets.features import ClassLabel
import wandb

In [6]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mjlopetegui98[0m ([33mjavier-lopetegui-gonzalez[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [7]:
# dict for the entities (entity to int value)
sel_ent = {
    "O": 0,
    "B-Condition": 1,
    "I-Condition": 2,
    "B-Value": 3,
    "I-Value": 4,
    "B-Drug": 5,
    "I-Drug": 6,
    "B-Procedure": 7,
    "I-Procedure": 8,
    "B-Measurement": 9,
    "I-Measurement": 10,
    "B-Temporal": 11,
    "I-Temporal": 12,
    "B-Observation": 13,
    "I-Observation": 14,
    "B-Person": 15,
    "I-Person": 16
}
entities_list = list(sel_ent.keys())
sel_ent_inv = {v: k for k, v in sel_ent.items()}

In [8]:
# data paths
root_path = './' # comment if working on colab
root_path = './drive/MyDrive/HandsOn-NLP'
data_path = f'{root_path}/data'
chia_bio_path = f"{data_path}/chia_bio"
chia_prep_path = f"{data_path}/chia_prep"
models_path = f"{root_path}/models"

In [9]:
# preprocessing dataset to get the data in the right format for dataset entity creation
# preprocessing_dataset(chia_bio_path, output_path=chia_prep_path, labels2int = sel_ent, int2labels = sel_ent_inv)

In [10]:
# read the data after preprocessing
files = os.listdir(chia_prep_path)
files[:5]

['NCT02893293_exc.bio.json',
 'NCT03475589_exc.bio.json',
 'NCT01997580_inc.bio.json',
 'NCT01993836_exc.bio.json',
 'NCT02589691_inc.bio.json']

In [11]:
sentences = []

for file in files:
    with open(f"{chia_prep_path}/{file}", "r") as f:
        stc = json.load(f)
        sentences.extend(stc["sentences"])

In [12]:
# create the dataset
chia_eng_dataset = Dataset.from_pandas(pd.DataFrame(sentences))

In [13]:
chia_eng_dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 12423
})

In [14]:
chia_eng_train_test = chia_eng_dataset.train_test_split(test_size=0.2)
chia_eng_test_val = chia_eng_train_test["test"].train_test_split(test_size=0.5)
chia_eng_dataset = DatasetDict({
    "train": chia_eng_train_test["train"],
    "test": chia_eng_test_val["test"],
    "validation": chia_eng_test_val["train"]
})

In [15]:
chia_eng_dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 9938
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1243
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1242
    })
})

**Model Implementation**

In [16]:
model_name = 'xlm-roberta-base'

In [17]:
# get xlm-roberta tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# check the tokenizer
tokens_ = tokenizer("The AI master at Université Paris-Saclay is very good").tokens()
print(tokens_)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


['<s>', '▁The', '▁AI', '▁master', '▁at', '▁', 'Université', '▁Paris', '-', 'S', 'ac', 'lay', '▁is', '▁very', '▁good', '</s>']


In [18]:
# tokenize and align the labels in the dataset
def tokenize_and_align_labels(sentence, flag = 'I'):
    """
    Tokenize the sentence and align the labels
    inputs:
        sentence: dict, the sentence from the dataset
        flag: str, the flag to indicate how to deal with the labels for subwords
            - 'I': use the label of the first subword for all subwords but as intermediate (I-ENT)
            - 'B': use the label of the first subword for all subwords as beginning (B-ENT)
            - None: use -100 for subwords
    outputs:
        tokenized_sentence: dict, the tokenized sentence now with a field for the labels
    """
    tokenized_sentence = tokenizer(sentence['tokens'], is_split_into_words=True, truncation=True)

    labels = []
    for i, labels_s in enumerate(sentence['ner_tags']):
        word_ids = tokenized_sentence.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # if the word_idx is None, assign -100
            if word_idx is None:
                label_ids.append(-100)
            # if it is a new word, assign the corresponding label
            elif word_idx != previous_word_idx:
                label_ids.append(labels_s[word_idx])
            # if it is the same word, check the flag to assign
            else:
                if flag == 'I':
                    if entities_list[labels_s[word_idx]].startswith('I'):
                      label_ids.append(labels_s[word_idx])
                    else:
                      label_ids.append(labels_s[word_idx] + 1)
                elif flag == 'B':
                    label_ids.append(labels_s[word_idx])
                elif flag == None:
                    label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_sentence['labels'] = labels
    return tokenized_sentence

In [19]:
type(chia_eng_dataset)

datasets.dataset_dict.DatasetDict

In [20]:
# apply the function to the dataset
chia_eng_dataset = chia_eng_dataset.map(tokenize_and_align_labels, batched=True)
chia_eng_dataset

Map:   0%|          | 0/9938 [00:00<?, ? examples/s]

Map:   0%|          | 0/1243 [00:00<?, ? examples/s]

Map:   0%|          | 0/1242 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 9938
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1243
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1242
    })
})

In [21]:
# import the model
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(entities_list), label2id=sel_ent, id2label=sel_ent_inv)
print(model)

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


XLMRobertaForTokenClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bi

In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [23]:
model.to(device)

XLMRobertaForTokenClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bi

In [24]:
# define the training arguments
args = TrainingArguments(
    report_to = 'wandb',
    run_name = 'chia_multilingual_ner',
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    overwrite_output_dir = True,
    eval_steps=50,
    save_steps=1000,
    output_dir = 'chia_multilingual_ner'
)

In [25]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [26]:
metric = load_metric("seqeval")

  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [27]:
def compute_metrics(p):
    """
    Compute the metrics for the model
    inputs:
        p: tuple, the predictions and the labels
    outputs:
        dict: the metrics
    """
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [entities_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [entities_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [28]:
# define the trainer
trainer = Trainer(
    model,
    args,
    train_dataset=chia_eng_dataset["train"],
    eval_dataset=chia_eng_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [29]:
wandb.init(project = "Multilingual-NER-Chia_dataset")

In [30]:
outputs_train = trainer.train()

Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
50,1.8394,1.351312,0.477763,0.449352,0.463122,0.634022
100,1.1527,0.985659,0.568227,0.581708,0.574888,0.717313
150,0.9782,0.837526,0.625247,0.642422,0.633718,0.763859
200,0.8848,0.764733,0.65117,0.690807,0.670403,0.781874
250,0.7616,0.701565,0.671458,0.707195,0.688864,0.797191
300,0.6984,0.701523,0.674905,0.72452,0.698833,0.800762
350,0.6982,0.648963,0.669412,0.710473,0.689331,0.803024
400,0.6257,0.642735,0.667887,0.74044,0.702295,0.807666
450,0.6403,0.606399,0.685415,0.756985,0.719424,0.81715
500,0.6201,0.59457,0.698856,0.753395,0.725101,0.820007


  _warn_prf(average, modifier, msg_start, len(result))
Checkpoint destination directory chia_multilingual_ner/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


In [31]:
print(outputs_train)

TrainOutput(global_step=1866, training_loss=0.5896146777869549, metrics={'train_runtime': 791.996, 'train_samples_per_second': 37.644, 'train_steps_per_second': 2.356, 'total_flos': 1201190963469432.0, 'train_loss': 0.5896146777869549, 'epoch': 3.0})


In [32]:
# outputs_eval = trainer.evaluate(chia_eng_dataset["test"])

In [33]:
# print(outputs_eval)

In [34]:
model.to('cpu')

XLMRobertaForTokenClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bi

In [35]:
torch.save(model, f"{models_path}/chia-multilingual-ner.pt")

In [36]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▆▇▇▇▇▇▇▇▇▇▇██████████████████████
eval/f1,▁▄▅▆▆▆▆▆▇▇▇▇▇▇▇█▇████████████████████
eval/loss,█▅▄▃▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/precision,▁▃▅▅▆▆▆▆▆▇▆▇▇▇▇▇▇██▇█▇███████████████
eval/recall,▁▄▅▆▆▆▆▇▇▇▇▇▇▇██████▇████████████████
eval/runtime,▁▂▅▂▆▂▃▂▃▅▂▆▂▆▂▅▂▄▂▃▅▃▃▂▂▆▂▆▂▃▂▃█▂▄▂▃
eval/samples_per_second,█▇▄▆▃▇▆▇▆▃▆▃▇▃▇▃▇▄▇▆▄▆▆▇▆▃▇▃▇▆▆▆▁▆▄▇▆
eval/steps_per_second,█▇▄▆▃▇▆▇▆▃▆▃▇▃▇▃▇▄▇▆▄▆▆▇▆▃▇▃▇▆▆▆▁▆▄▇▆
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████

0,1
eval/accuracy,0.84199
eval/f1,0.77282
eval/loss,0.53065
eval/precision,0.74683
eval/recall,0.80069
eval/runtime,5.1891
eval/samples_per_second,239.349
eval/steps_per_second,30.063
train/epoch,3.0
train/global_step,1866.0
