# 1.  Loading a dataset for fine-tuning

<h3>Objectif du Dataset :</h3>
MS MARCO est conçu pour évaluer et améliorer les systèmes de lecture et de compréhension automatique des machines. Il s'agit d'un dataset à grande échelle contenant des questions posées par des utilisateurs sur Bing, avec des réponses générées à partir de documents réels. Il est souvent utilisé pour :

<li>La recherche d'informations (IR - Information Retrieval).</li>
<li>La compréhension de texte (RC - Reading Comprehension).</li>
<li>Les tâches de question/réponse (Q/A).</li>
<li>Le développement et le fine-tuning de modèles de langage comme GPT-2, BERT, T5, etc.</li>

In [2]:
!pip install datasets



In [3]:
from datasets import load_dataset

# Charger MS MARCO
dataset = load_dataset("ms_marco", "v2.1")

README.md:   0%|          | 0.00/9.48k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

train-00000-of-00007.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

train-00001-of-00007.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

train-00002-of-00007.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00003-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

train-00004-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

train-00005-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

train-00006-of-00007.parquet:   0%|          | 0.00/244M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/101093 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/808731 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/101092 [00:00<?, ? examples/s]

In [4]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101093
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 808731
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101092
    })
})

In [5]:
# Afficher un exemple
example = dataset['train'][0]
print(example)

{'answers': ['The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'], 'passages': {'is_selected': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'passage_text': ['The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.', 'Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this projec

In [6]:
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

In [7]:
train_dataset

Dataset({
    features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
    num_rows: 808731
})

In [8]:
# Afficher les 3 premières lignes
for i in range(3):
    print(f"Ticket answers: {train_dataset[i]['answers']}")
    print(f"passages: {train_dataset[i]['passages']}")
    print(f"query: {train_dataset[i]['query']}")
    print(f"query_id: {train_dataset[i]['query_id']}")
    print(f"query_type: {train_dataset[i]['query_type']}")
    print(f"wellFormedAnswers: {train_dataset[i]['wellFormedAnswers']}\n")

Ticket answers: ['The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.']
passages: {'is_selected': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'passage_text': ['The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.', 'Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this proje

In [9]:
# Partitionner le jeu de données d'entraînement en 4 parties et prendre la première partie (index 0)
train_dataset = train_dataset.shard(num_shards=4, index=0)

# Partitionner le jeu de données de validation en 4 parties et prendre la première partie (index 0)
validation_dataset = validation_dataset.shard(num_shards=4, index=0)

# Partitionner le jeu de données de test en 4 parties et prendre la première partie (index 0)
test_dataset = test_dataset.shard(num_shards=4, index=0)

In [10]:
def preprocess_data(example):
    # Extraire le contexte, la question et la réponse à partir des colonnes du dataset
    context = " ".join(example['passages']['passage_text'])  # Combine les passages en un seul contexte
    question = example['query']  # La colonne 'query' contient la question
    answer = example['answers'][0] if example['answers'] else "No answer available"  # La première réponse disponible ou un texte par défaut

    # Construire les textes d'entrée et de sortie
    input_text = f"Question: {question}\nContext: {context}\nAnswer:"
    output_text = answer  # La réponse est utilisée comme texte de sortie
    return {"input_text": input_text, "output_text": output_text}


# Appliquer la transformation sur le jeu de données d'entraînement partitionné
train_dataset = train_dataset.map(preprocess_data, remove_columns=train_dataset.column_names)

# Appliquer la transformation sur le jeu de données de validation partitionné
validation_dataset = validation_dataset.map(preprocess_data, remove_columns=validation_dataset.column_names)

# Appliquer la transformation sur le jeu de données de test partitionné
test_dataset = test_dataset.map(preprocess_data, remove_columns=test_dataset.column_names)

Map:   0%|          | 0/202183 [00:00<?, ? examples/s]

Map:   0%|          | 0/25274 [00:00<?, ? examples/s]

Map:   0%|          | 0/25273 [00:00<?, ? examples/s]

In [11]:
print(train_dataset)
print(validation_dataset)
print(test_dataset)

Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 202183
})
Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 25274
})
Dataset({
    features: ['input_text', 'output_text'],
    num_rows: 25273
})


# Load the Model From Hugginface

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Charger le tokenizer et le modèle pré-entraîné (par ex., GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

<h3>Exemple d'utilisation

In [13]:
import torch

# Exemple de question et de contexte
context = "I'm having an issue with the Microsoft Surface. The screen is flickering intermittently, and I have tried restarting it, but the problem persists."
question = "What is the product having an issue?"

# Préparer l'entrée comme une prompt de génération de texte
input_text = f"Context: {context}\nQuestion: {question}\nAnswer:"

# Tokenisation du texte
inputs = tokenizer(input_text, return_tensors="pt")

# Générer la réponse
with torch.no_grad():
    outputs = model.generate(inputs['input_ids'], max_length=50)

# Décoder la réponse générée
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Afficher la réponse
print(f"Generated answer: {generated_text}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated answer: Context: I'm having an issue with the Microsoft Surface. The screen is flickering intermittently, and I have tried restarting it, but the problem persists.
Question: What is the product having an issue?
Answer: The Surface Pro 3


<h3>Tokenization

In [14]:
# Ajouter un token de remplissage si non défini
if tokenizer.pad_token is None:
    print('Non defini !')
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Redimensionner les embeddings du modèle pour inclure le nouveau token
model.resize_token_embeddings(len(tokenizer))

Non defini !


Embedding(50258, 768)

In [15]:
def tokenize_function(examples):
    inputs = tokenizer(
        examples["input_text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
    outputs = tokenizer(
        examples["output_text"],
        padding="max_length",
        truncation=True,
        max_length=512
    )
    inputs["labels"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in output]
        for output in outputs["input_ids"]
    ]
    return inputs



tokenized_train_datasets = train_dataset.map(tokenize_function, batched=True)
tokenized_val_datasets = validation_dataset.map(tokenize_function, batched=True)
tokenized_test_datasets = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/202183 [00:00<?, ? examples/s]

Map:   0%|          | 0/25274 [00:00<?, ? examples/s]

Map:   0%|          | 0/25273 [00:00<?, ? examples/s]

In [16]:
print(tokenized_train_datasets)
print(tokenized_val_datasets)
print(tokenized_test_datasets)

Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 202183
})
Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 25274
})
Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 25273
})


In [17]:
# Exemple de taille fixe
subset_train_size = 1000  # Par exemple, réduire à 1000 lignes
subset_val_size = 500     # Réduire à 500 lignes pour validation
subset_test_size = 500    # Réduire à 500 lignes pour test

reduced_train_dataset = tokenized_train_datasets.select(range(subset_train_size))
reduced_val_dataset = tokenized_val_datasets.select(range(subset_val_size))
reduced_test_dataset = tokenized_test_datasets.select(range(subset_test_size))

print(f"Reduced Train Dataset: {len(reduced_train_dataset)} rows")
print(f"Reduced Validation Dataset: {len(reduced_val_dataset)} rows")
print(f"Reduced Test Dataset: {len(reduced_test_dataset)} rows")

Reduced Train Dataset: 1000 rows
Reduced Validation Dataset: 500 rows
Reduced Test Dataset: 500 rows


In [18]:
print(reduced_train_dataset)
print(reduced_val_dataset)
print(reduced_test_dataset)

Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})
Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 500
})
Dataset({
    features: ['input_text', 'output_text', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 500
})


In [None]:
for batch in tokenized_train_datasets:
    print("Input IDs shape:", len(batch["input_ids"]))
    print("Labels shape:", len(batch["labels"]))
    break

<h3>Training Arguments

In [20]:
from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    save_total_limit=2,
    fp16=torch.cuda.is_available(),  # Activer FP16 si possible
)



<h3>Trainer class

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=reduced_train_dataset,
    eval_dataset=reduced_val_dataset,
    tokenizer=tokenizer,
)

# Lancer l'entraînement
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>