<a href="https://colab.research.google.com/github/jlopetegui98/Creation-of-a-synthetic-dataset-for-French-NER-in-clinical-trial-texts/blob/main/Multilingual-NER-Model/french_corpus_training_multinerd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiments training over french corpus to compare with cross-lingual approach**

We are going to train the same model (*xml-roberta-base*) as we did for english multinerd corpus, now over the french corpus, increasing the size of the training set each time. Then we are going to compare the results obtained over the test dataset in french for each case.

In [1]:
# uncomment to install required dependencies in colab
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets
!pip install seqeval
!pip install -q -U wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import load_dataset, load_metric
import torch
import accelerate
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import wandb

In [4]:
labels_vocab = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-ANIM": 7,
    "I-ANIM": 8,
    "B-BIO": 9,
    "I-BIO": 10,
    "B-CEL": 11,
    "I-CEL": 12,
    "B-DIS": 13,
    "I-DIS": 14,
    "B-EVE": 15,
    "I-EVE": 16,
    "B-FOOD": 17,
    "I-FOOD": 18,
    "B-INST": 19,
    "I-INST": 20,
    "B-MEDIA": 21,
    "I-MEDIA": 22,
    "B-MYTH": 23,
    "I-MYTH": 24,
    "B-PLANT": 25,
    "I-PLANT": 26,
    "B-TIME": 27,
    "I-TIME": 28,
    "B-VEHI": 29,
    "I-VEHI": 30,
}

label_list = list(labels_vocab.keys())
labels_vocab_reverse = {v:k for k,v in labels_vocab.items()}

In [5]:
model_name = "xlm-roberta-base"

In [6]:
dataset = load_dataset("Babelscape/multinerd")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/20 [00:00<?, ?it/s]

In [7]:
# get split of the dataset
data_train = dataset['train']
data_test = dataset['test']
data_val = dataset['validation']

In [8]:
# check the format of the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'lang'],
        num_rows: 2678400
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'lang'],
        num_rows: 334800
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'lang'],
        num_rows: 335986
    })
})

In [9]:
# now we are going to take just the french part of the dataset
# French
data_train_fr = data_train.filter(lambda example: example['lang'] == 'fr')
data_test_fr = data_test.filter(lambda example: example['lang'] == 'fr')
data_val_fr = data_val.filter(lambda example: example['lang'] == 'fr')
print(f"Distribution of French data:\nTrain: {len(data_train_fr)}\nTest: {len(data_test_fr)}\nVal: {len(data_val_fr)}")

Distribution of French data:
Train: 281760
Test: 35390
Val: 35220


In [10]:
# get xlm-roberta tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [11]:
# tokenize and align the labels in the dataset
def tokenize_and_align_labels(sentence, flag = 'I'):
    """
    Tokenize the sentence and align the labels
    inputs:
        sentence: dict, the sentence from the dataset
        flag: str, the flag to indicate how to deal with the labels for subwords
            - 'I': use the label of the first subword for all subwords but as intermediate (I-ENT)
            - 'B': use the label of the first subword for all subwords as beginning (B-ENT)
            - None: use -100 for subwords
    outputs:
        tokenized_sentence: dict, the tokenized sentence now with a field for the labels
    """
    tokenized_sentence = tokenizer(sentence['tokens'], is_split_into_words=True, truncation=True)

    labels = []
    for i, labels_s in enumerate(sentence['ner_tags']):
        word_ids = tokenized_sentence.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # if the word_idx is None, assign -100
            if word_idx is None:
                label_ids.append(-100)
            # if it is a new word, assign the corresponding label
            elif word_idx != previous_word_idx:
                label_ids.append(labels_s[word_idx])
            # if it is the same word, check the flag to assign
            else:
                if flag == 'I':
                    if label_list[labels_s[word_idx]].startswith('I'):
                      label_ids.append(labels_s[word_idx])
                    else:
                      label_ids.append(labels_s[word_idx] + 1)
                elif flag == 'B':
                    label_ids.append(labels_s[word_idx])
                elif flag == None:
                    label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_sentence['labels'] = labels
    return tokenized_sentence

In [12]:
# tokenize the dataset and align the labels
tokenized_train_fr = data_train_fr.map(tokenize_and_align_labels, batched=True)
tokenized_test_fr = data_test_fr.map(tokenize_and_align_labels, batched=True)
tokenized_val_fr = data_val_fr.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/35220 [00:00<?, ? examples/s]

In [13]:
# import the model
# model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list), label2id=labels_vocab, id2label=labels_vocab_reverse)
# print(model)

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [15]:
device

device(type='cuda')

In [16]:
# model.to(device)

In [17]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mjlopetegui98[0m ([33mjavier-lopetegui-gonzalez[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

In [18]:
wandb.init(project = "Multilingual-NER-multinerd_french_tr")

In [19]:
args = TrainingArguments(
    report_to = 'wandb',
    run_name = "multinerd-multilingual-ner_french_training",
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=False,
    logging_steps=100,
    eval_steps=100,
    save_steps=10000,
    output_dir = "multinerd-multilingual-ner_french_training"
)

In [20]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [21]:
metric = load_metric("seqeval")

  metric = load_metric("seqeval")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [22]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [23]:
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list), label2id=labels_vocab, id2label=labels_vocab_reverse)
model.to(device)
# get 5 % of the training data in french
data_train =  tokenized_train_fr.train_test_split(test_size=0.95)['train']
print(data_train)
trainer = Trainer(
    model,
    args,
    train_dataset=data_train,
    eval_dataset=tokenized_test_fr,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
outputs_train = trainer.train()
print(outputs_train)
outputs_eval = trainer.evaluate()
print(outputs_eval)
del model
del trainer
del data_train
torch.cuda.empty_cache()

Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Dataset({
    features: ['tokens', 'ner_tags', 'lang', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14088
})


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,0.8522,0.266788,0.902543,0.929292,0.915722,0.945001
200,0.227,0.191619,0.936209,0.942775,0.939481,0.958392
300,0.1805,0.153811,0.94113,0.95041,0.945747,0.963969
400,0.1495,0.139461,0.937323,0.947143,0.942208,0.964206
500,0.1349,0.127956,0.941746,0.961347,0.951446,0.967682
600,0.1151,0.118365,0.94843,0.955808,0.952105,0.968926
700,0.1043,0.119919,0.947129,0.952982,0.950047,0.96622
800,0.1055,0.107856,0.952645,0.957354,0.954994,0.970399
900,0.0983,0.103902,0.952442,0.95906,0.955739,0.971682
1000,0.09,0.103816,0.953978,0.960746,0.95735,0.972052


  _warn_prf(average, modifier, msg_start, len(result))


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,0.8522,0.266788,0.902543,0.929292,0.915722,0.945001
200,0.227,0.191619,0.936209,0.942775,0.939481,0.958392
300,0.1805,0.153811,0.94113,0.95041,0.945747,0.963969
400,0.1495,0.139461,0.937323,0.947143,0.942208,0.964206
500,0.1349,0.127956,0.941746,0.961347,0.951446,0.967682
600,0.1151,0.118365,0.94843,0.955808,0.952105,0.968926
700,0.1043,0.119919,0.947129,0.952982,0.950047,0.96622
800,0.1055,0.107856,0.952645,0.957354,0.954994,0.970399
900,0.0983,0.103902,0.952442,0.95906,0.955739,0.971682
1000,0.09,0.103816,0.953978,0.960746,0.95735,0.972052


TrainOutput(global_step=1762, training_loss=0.14714499337177947, metrics={'train_runtime': 4532.5802, 'train_samples_per_second': 6.216, 'train_steps_per_second': 0.389, 'total_flos': 1071519995359632.0, 'train_loss': 0.14714499337177947, 'epoch': 2.0})


{'eval_loss': 0.09033374488353729, 'eval_precision': 0.9568052685950413, 'eval_recall': 0.9638181099565237, 'eval_f1': 0.9602988861555792, 'eval_accuracy': 0.974618782339586, 'eval_runtime': 232.9246, 'eval_samples_per_second': 151.938, 'eval_steps_per_second': 18.993, 'epoch': 2.0}


In [24]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▆▇▆▇▇▇▇▇▇▇████
eval/f1,▁▅▆▅▇▇▆▇▇█▇▇▇▇████
eval/loss,█▅▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁
eval/precision,▁▅▆▅▆▇▇▇▇█▇███████
eval/recall,▁▄▅▅▇▆▆▇▇▇█▇▇▇▇▇██
eval/runtime,█▂▂▂▂▂▄▂▄▄▃▂▂▁▂▁▁▂
eval/samples_per_second,▁▇▇▇▇▇▅▇▅▅▆▇▇█▇██▇
eval/steps_per_second,▁▇▇▇▇▇▅▇▅▅▆▇▇█▇██▇
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇████

0,1
eval/accuracy,0.97462
eval/f1,0.9603
eval/loss,0.09033
eval/precision,0.95681
eval/recall,0.96382
eval/runtime,232.9246
eval/samples_per_second,151.938
eval/steps_per_second,18.993
train/epoch,2.0
train/global_step,1762.0


In [25]:
print("Training finished")

Training finished
