# DeBERTa

DeBERTa est un modèle de langage pré-entraîné proposé par Microsoft en 2020. Il s'agit d'une amélioration des modèles BERT et RoBERTa développés par Google et Facebook.

[Lien de la documentation de DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)

In [None]:
!pip install transformers torch -q

In [None]:
from transformers import AutoTokenizer, DebertaV2ForQuestionAnswering
import torch

In [None]:
model_names = {
    "base": "microsoft/deberta-base",
    "large": "microsoft/deberta-large",
    "xlarge": "microsoft/deberta-xlarge",
    "xxlarge": "microsoft/deberta-xxlarge",
}

In [None]:
# Initialize tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_names['base'])
model = DebertaV2ForQuestionAnswering.from_pretrained(model_names["base"])

In [None]:
contexts = [
    "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ('Norman' comes from 'Norseman') raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
    "The further decline of Byzantine state-of-affairs paved the road to a third attack in 1185, when a large Norman army invaded Dyrrachium, owing to the betrayal of high Byzantine officials. Some time later, Dyrrachium—one of the most important naval bases of the Adriatic—fell again to Byzantine hands.",
    "Many locals and tourists frequent the southern California coast for its popular beaches, and the desert city of Palm Springs is popular for its resort feel and nearby open spaces."
]

questions = [
    "In what country is Normandy located?",
    "When did the Normans attack Dyrrachium?",
    "Which region of California is Palm Springs located in?"
]

In [None]:
# Process each question-context pair
for question, text in zip(questions, contexts):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Find the start and end of the answer in the input sequence
    answer_start_index = outputs.start_logits.argmax()  # Get the most likely beginning of answer
    answer_end_index = outputs.end_logits.argmax()  # Get the most likely end of answer

    # Convert tokens to the answer string
    answer_tokens = inputs.input_ids[0, answer_start_index:answer_end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)  # Use skip_special_tokens to clean up the answer

    print(f"Question: {question}")
    print(f"Predicted answer: {answer}\n")

# DeBERTa Fine Tuning

[Lien de la documentation du Fine-Tunning de DeBERTa](https://github.com/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)

In [None]:
! pip install datasets accelerate -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/510.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m501.8/510.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m70.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━

In [None]:
squad_v2 = False
model_checkpoint = "microsoft/deberta-large"
epochs_num = 3
batch_size = 64
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = 128  # The allowed overlap between two part of the context when splitting is performed.

In [None]:
from datasets import load_dataset
from google.colab import drive
from transformers import DebertaForQuestionAnswering, TrainingArguments, Trainer, AutoTokenizer

In [None]:
datasets = load_dataset("squad_v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
# from transformers import DebertaForQuestionAnswering, AutoTokenizer

# model_path = "/content/drive/MyDrive/MyModel/" + model_checkpoint

# # Charger le modèle
# model = DebertaForQuestionAnswering.from_pretrained(model_path)

# # Charger le tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/475 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    examples["question"] = [q.lstrip() for q in examples["question"]]

    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",  # Tronquer seulement le contexte, pas la question
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
tokenized_datasets = datasets.map(
    prepare_train_features, batched=True, remove_columns=datasets["train"].column_names
)

Map:   0%|          | 0/130319 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
# from transformers import DebertaForQuestionAnswering, AutoTokenizer

# # drive.mount('/content/drive')

# model_path = "/content/drive/MyDrive/MyModel/" + model_checkpoint

# # Charger le modèle
# model = DebertaForQuestionAnswering.from_pretrained(model_path)

# # Charger le tokenizer
# tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = "deberta-large-finetuned-squad-v2-ep-3"

from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model = DebertaForQuestionAnswering.from_pretrained(model_checkpoint)

warmup_steps = len(tokenized_datasets['train']) / batch_size * epochs_num * 0.08

training_args = TrainingArguments(
    output_dir=push_to_hub_model_id,
    push_to_hub=True,
    num_train_epochs=epochs_num,
    warmup_steps=warmup_steps,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    gradient_accumulation_steps=4,  # Accumule les gradients pour simuler un plus grand batch size
    per_device_train_batch_size=batch_size,  # Réduit pour économiser la mémoire GPU
    per_device_eval_batch_size=batch_size,  # Réduit pour économiser la mémoire GPU
    fp16=True,  # Activez si votre GPU supporte FP16 pour une formation plus rapide et moins gourmande en mémoire
    gradient_checkpointing=True,  # Active le checkpointing pour économiser la mémoire
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

trainer.train()

Some weights of DebertaForQuestionAnswering were not initialized from the model checkpoint at microsoft/deberta-large and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
100,2.6928
200,0.8358
300,0.7137
400,0.6758
500,0.6256




TrainOutput(global_step=515, training_loss=1.094858652874104, metrics={'train_runtime': 4350.3655, 'train_samples_per_second': 30.296, 'train_steps_per_second': 0.118, 'total_flos': 1.0708513180267622e+17, 'train_loss': 1.094858652874104, 'epoch': 1.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.678312361240387,
 'eval_runtime': 143.6207,
 'eval_samples_per_second': 75.115,
 'eval_steps_per_second': 1.177,
 'epoch': 1.0}

In [None]:
contexts = [
    "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ('Norman' comes from 'Norseman') raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
    "The further decline of Byzantine state-of-affairs paved the road to a third attack in 1185, when a large Norman army invaded Dyrrachium, owing to the betrayal of high Byzantine officials. Some time later, Dyrrachium—one of the most important naval bases of the Adriatic—fell again to Byzantine hands.",
    "Many locals and tourists frequent the southern California coast for its popular beaches, and the desert city of Palm Springs is popular for its resort feel and nearby open spaces."
]

questions = [
    "In what country is Normandy located?",
    "When did the Normans attack Dyrrachium?",
    "Which region of California is Palm Springs located in?"
]

In [None]:
import torch

model.eval()

for question, context in zip(questions, contexts):
    # Tokenize the input question-context pair
    inputs = tokenizer.encode_plus(question, context, return_tensors='pt', max_length=512, truncation=True, padding='max_length')

    # Send inputs to the same device as your model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        # Forward pass, get model outputs
        outputs = model(**inputs)

    # Extract the start and end positions of the answer in the tokens
    answer_start_scores, answer_end_scores = outputs.start_logits, outputs.end_logits
    answer_start_index = torch.argmax(answer_start_scores)  # Most likely start of answer
    answer_end_index = torch.argmax(answer_end_scores) + 1  # Most likely end of answer; +1 for inclusive slicing

    # Convert token indices to the actual answer text
    answer_tokens = inputs['input_ids'][0, answer_start_index:answer_end_index]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

    print(f"Question: {question}")
    print(f"Predicted answer: {answer}\n")

Question: In what country is Normandy located?
Predicted answer:  France

Question: When did the Normans attack Dyrrachium?
Predicted answer:  1185

Question: Which region of California is Palm Springs located in?
Predicted answer:  desert



In [None]:
# drive.mount('/content/drive')

# trainer.save_model("/content/drive/MyDrive/MyModel/" + model_checkpoint)  # Sauvegarde le modèle au chemin spécifié
# tokenizer.save_pretrained("/content/drive/MyDrive/MyModel" + model_checkpoint)  # Sauvegarde le tokenizer au même chemi

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


('/content/drive/MyDrive/MyModelmicrosoft/deberta-large/tokenizer_config.json',
 '/content/drive/MyDrive/MyModelmicrosoft/deberta-large/special_tokens_map.json',
 '/content/drive/MyDrive/MyModelmicrosoft/deberta-large/vocab.json',
 '/content/drive/MyDrive/MyModelmicrosoft/deberta-large/merges.txt',
 '/content/drive/MyDrive/MyModelmicrosoft/deberta-large/added_tokens.json',
 '/content/drive/MyDrive/MyModelmicrosoft/deberta-large/tokenizer.json')

# Deploy DeBERTa-v2 Fine Tuned to Hugging Face

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Samuel-DD/deberta-large-finetuned-squad-v2/commit/c5bfdf006db3b7725f1512729a51150d393d73ff', commit_message='End of training', commit_description='', oid='c5bfdf006db3b7725f1512729a51150d393d73ff', pr_url=None, pr_revision=None, pr_num=None)