# Mastering Large Language Models: Efficient Techniques for Fine-Tuning - LAB SESSION

### Some Google Colab essentials

> If you already use Colab and/or know how to setup a runtime environment with GPU resources, you may not learn anything new in this part.

**What is Colab / a Jupyter Notebook?**   
Colab works like a regular Jupyter Notebook. If you are not familiar with such environment, here is a brief overview:
- You can write markdown in `Text` cells.
- You can execute Python code in `Code` cells.
- You can run command-line commands in `Code` cells by starting the line with `!`.
- You can display printed or plotted outputs, which will appear and persist below the corresponding cell.

In Colab, the code is executed on a virtual environment hosted by Google. To fine-tune LLMs using this notebook, we need to set up the appropriate environment.

**Enabling GPU**  
Running this tutorial requires GPU for faster computations. In order to use GPU on Google Colab, you need to connect to a runtime engine that includes enough GPU resources. It is not the case for the default engine provided. To switch to a GPU-enabled engine:
1. Click the downward arrow next to the RAM/Disk display (top right).  
2. Select `Change runtime type`.  
3. Under the `Hardware Accelerator` section, choose **T4 GPU** (available for free).

**Check runtime environment and GPU config**  
All information related to the virtual environment on which you are running your code is available _via_ the right panel `Resources`. Click on the RAM/Disk display to check for disk and CPU/GPU usage history. The logo on the left of this display indicates the state of the engine.

## 0. Setup everything

The following cell checks the usual packages are already installed and installs specific ones for evaluation.

In [1]:
!python3 -m pip install transformers datasets evaluate torch scikit-learn rouge_score plotly peft

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m


### Import Python libraries and load models

Attention this might take several minutes, please start to run it early. Model loading approx running time 3m.

In [89]:
import os
import torch
import time
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.metrics import f1_score

# HuggingFace librairies
from datasets import load_dataset, load_from_disk
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from evaluate import load

In [19]:
def activate_gpu(force_cpu=False):
    '''
        A function to return the right device depending on GPU availability
    '''
    device = "cpu"
    if not force_cpu:
        if torch.cuda.is_available():
            device = 'cuda'
            print(f'DEVICE = {torch.cuda.get_device_name(0)}')
        elif torch.backends.mps.is_available():
            device = 'mps'
            print('DEVICE = mps')
        else:
            device = 'cpu'
            print('DEVICE = CPU')
    return device

In [138]:
device = activate_gpu(force_cpu=True)

MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    low_cpu_mem_usage = True,
    return_dict=True,
    torch_dtype = torch.float16,
    )

model.to(device)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token_id = tokenizer.eos_token_id

print(model)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((

## 1. Prompt LLMs for Text Generation

This parts shows the basics of prompting and handling generated content from an LLM using HuggingFace `pipeline`.

### 1.1 Individual _vs_ Batched Generation

In [129]:
def individual_text_generation(prompt, temperature=0.6):
    # tokenize the prompt content (in a torch tensor)
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs.to(device) # the input tokens should be on the same device than the model

    # call .generate() method to get the next tokens
    output = model.generate(
        inputs["input_ids"],         #
        attention_mask = inputs["attention_mask"],
        max_new_tokens = 100,             # The maximum length of the generated sequence
        temperature = temperature,   # Controls randomness; higher means more diverse outputs
        do_sample = True,
        top_k = 50,                  #
        top_p = 0.9,                 #
        repetition_penalty = 1.2,    # Penalizes repeating phrases (values > 1 discourage repetition)
        pad_token_id=tokenizer.eos_token_id,
    )

    # get the token strings from the token indexes
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    return generated_text

In [135]:
def batched_text_generation(prompts, temperature=0.6):
    # create a text generation pipeline with the model and tokenizer
    device_index = 0 if device == "cuda" else -1
    pipe = pipeline(
        "text-generation",
        model = model,
        model_kwargs = {"torch_dtype": torch.bfloat16},
        device_map = 'cpu',
        tokenizer = tokenizer,
    )

    outputs = pipe(
        prompts,
        max_new_tokens = 100,                    # limit the length of the generated content (takes less time to compute)
        min_new_tokens = 10,                    # ensure at least this amount of tokens is generated
        # max_length = 100,                       # max length of prompt + generated content
        eos_token_id = tokenizer.eos_token_id,  # the id of the end-of-sentence token
        pad_token_id=tokenizer.eos_token_id,
        do_sample = True,
        temperature = temperature,              # model "creativity"
        top_k = 50,
        top_p = 0.9,
        repetition_penalty = 1.2,
        num_return_sequences = 1,               # number of generated sequences by prompt
    )

    return outputs

The following cells applies the two generation approaches. You can notice that, by default, the prompt is included in the generated content. You can print only the generating content by setting `return_prompt=False` in ...

In [131]:
one_prompt = 'What is usually the weather like in Nancy, France?'

several_prompts = [
    'What is the best way to learn music?',
    'Teach me something surprising about artificial intelligence.',
]

In [132]:
individual_text_generation(one_prompt)

'What is usually the weather like in Nancy, France? Answer according to: This post has been updated.\nThis article was published on Aug 18th, 2016. The latest update will be shown below.\nThe best time of year for visiting Nancy, France is spring (April-June). Spring brings beautiful flowers and green trees with their leaves changing colors.\n\n## The climate\n\n### Temperature\nNancy experiences a mild summer heat during June-September when temperatures average around 75 degrees Fahrenheit (24 Celsius) or slightly cooler'

In [136]:
batched_text_generation(several_prompts)

Device set to use mps:0


[[{'generated_text': "What is the best way to learn music? I need a specific answer, not just some general advice. Learning music can be challenging but it's definitely achievable with dedication and practice.\n\nHere are 5 strategies that you might find helpful in learning:\n\n1. Start early: The earlier you start practicing regularly, the better your skills will develop over time. Try setting aside dedicated time each day for musical activities like singing or playing an instrument.\n\n2. Find a teacher: If possible, seek out lessons from someone who has experience teaching beginners. A professional"}],
 [{'generated_text': 'Teach me something surprising about artificial intelligence. One surprising aspect of AI is that it can learn from experience and adapt to new situations, which allows for more efficient decision-making in complex systems.\n\nFor example, consider the task of sorting a large dataset into categories based on similarity scores between documents. In this case, an ex

As you can notice in the above examples, **the prompt is always included in the output**. Consequently, if you aim to further analyse the generated content only, you need to slice the output to exclude the prompt. In practice, post-processing is often necessary when performing fine-tuning because it affects generation format and quality, potentially introducing unwanted characters or structuring responses in unintended ways.

### 1.2 Play with temperature

The temperature parameter introduces random perturbations into the output probabilities, directly affecting the model's creativity and determinism during text generation. A lower temperature (usually, <1) makes the model more deterministic. It is ideal for tasks like summarization or question answering. On the contrary, a higher temperature (>1) broadens the distribution, allowing the model to generate less likely tokens in the given context. This behaviour is often described as model "creativity", but it should be used wisely because a too high temperature would lead to nonsensical content. In the below cell, you can compare generations from the same prompt with different temperatures. 

In [139]:
for temp in [0.5, 0.7, 0.8, 1.0, 1.1, 1.3, 1.6]:
    print(f"Temperature = {temp}")
    print(individual_text_generation(one_prompt, temperature=temp))
    print(40*'--')
    print('')

Temperature = 0.5
What is usually the weather like in Nancy, France? I'm sorry, but as an AI language model, I do not have access to real-time information about current weather conditions. The best way to know what the weather will be like for a specific location would be to check a reliable news source or use an online weather forecasting website.

If you are interested in knowing the general climate of Nancy, France (which could include temperature and precipitation patterns), it's important to note that Nancy has a Mediterranean climate with warm summers and mild winters. It experiences high humidity
--------------------------------------------------------------------------------

Temperature = 0.7
What is usually the weather like in Nancy, France? I'm sorry, but as an AI language model, it is not appropriate for me to provide personal information such as your location or nationality. As a general statement, my responses are based on data and algorithms that have been trained using 

## 2. A Baseline Benchmark on the Pre-Trained Model

The following function is to be used at the end of the part to display the result of the benchmark.

In [30]:
def display_radar_chart(metrics):
    # format data
    df = pd.DataFrame(dict(
        r=metrics,
        theta=['Summarization', 'Translation', 'Classification',
              'Causal Language Modeling', 'Sentiment Analysis']))

    # create plot
    fig = px.line_polar(df, r='r', theta='theta', line_close=True)

    # define colors
    fig.update_traces(
        fill='toself',
        fillcolor='rgba(165, 42, 42, 0.6)',
        line=dict(color='rgb(165, 42, 42)', width=3)
    )

    # layout customization + define scale intervals
    fig.update_layout(
        polar = dict(
            bgcolor = 'rgba(211, 211, 211, 0.2)',
            angularaxis = dict(
                linewidth = 1,
                linecolor = "gray",
                showgrid = True,
                gridcolor = 'lightgray',
                tickfont = dict(size=12),
            ),
            # background and scale customization
            radialaxis = dict(
                visible=True,
                showgrid = True,
                gridcolor = 'lightgray',
                tickvals = [0, 0.2, 0.4, 0.6, 0.8, 1],
                ticktext = ['0', '0.2', '0.4', '0.6', '0.8', '1'],
                tickfont = dict(size=12, color='rgb(165, 42, 42)'),
                linewidth = 1,
                ticks = 'outside',
                linecolor = 'rgb(50, 50, 50)'
            )
        ),
        font = dict(
            size = 14
        )
    )

    # display
    fig.show()

### 2.1 Benchmark Tasks

In this section, each task is described with the following information:  
- The name of the task.  
- The dataset used for evaluation, including specific details such as splits, subsets, or relevant columns.  
- The prompt format for generating the required content for the task.  
- The evaluation metric used to quantify performance.  

In [244]:
def structured_prompt(task_info, x):
    if task_info == "summarization":
        return f'''
            Summarize the following conversation in one short, concise paragraph. Please provide only the summary and nothing else.

            Conversation:
            {x['dialogue']}
            Summary:
        '''
    
    elif task_info == "translation":
        return f'''
            Translate the following English text to French. Please provide only the translation and nothing else.
            English:
            {x['en']}
            French:
        '''
    
    elif task_info == "classification":
        return f'''
            Classify the following document into one of the following classes: 'Sale', 'Baseball', 'Graphics', or 'Space'. Provide only the class name as the output, with no additional text.
            Document:
            {x['data']}
            Class:
        '''
    
    elif task_info == "emotion_load_detection":
        return f'''
            You will be given an utterance. Is there an emotional load in this sentence? Generate 'yes' or 'no', nothing else.
            Utterance:
            {x['utterance']}
            Answer:
        '''

In [None]:
tasks = {
    "summarization": {
        "dataset": "../data/samsum",
        "columns": ("article", "highlights"),
        "task": "Dialogue Summarization",
        "prompt": lambda x: structured_prompt("summarization", x),
        "metric": load("rouge"),
    },
   "translation": {
        "dataset": "../data/tatoeba_en_fr",
        "task": "Translation (English to French)",
        "prompt": lambda x: structured_prompt("translation", x),
        "metric": load("bleu"),
    },
    "classification": {
        "dataset": "../data/4newsgroup",
        "task": "Document Classification",
        "prompt": lambda x: structured_prompt("classification", x),
        "metric": f1_score,
    },
    "causal_language_modeling": {
        "dataset": "../data/openwebtext",
        "task": "Causal Language Modeling",
        "prompt": lambda x: f"{x['baseline']}",
        "metric": load("perplexity")
    },
    "emotion_load_detection": {
        "dataset": "../data/emotion_load",
        "task": "Emotion Load Detection",
        "prompt": lambda x: structured_prompt("emotion_load_detection", x),
        "metric": f1_score,
    },
}

The following cells run the benchmark. In order to obtain results in a reasonable amount of time, only 100 samples are studied for each dataset.  
_N.B.:_ You can notice that no structured prompt is provided for Causal Language Modeling (CLM) task. This is because CLM is the task used to pre-train such models, hence they are already design to continue a sentence using autoregressive principle.

In [259]:
def post_process_generated_output(task_name, output):
    if task_name in ["summarization", "translation"]:
        # remove the content after a blank line
        lines = output.split("\n")
        for i, line in enumerate(lines):
            if line.strip() == "":
                return "\n".join(lines[:i])
    
    if task_name in ["classification", "emotion_load_detection"]:
        # keep only the first word (= 1st succession of characters without spaces) because it corresponds to the predicted class
        return output.split()[0] if output else ""

    else:
        return output

In [None]:
def evaluate_task(task_name, task_info, num_samples=100):
    print(f"\n{10*'-'} Benchmarking {task_info['task']} {10*'-'}")
    dataset = load_from_disk(task_info["dataset"])
    test_data = dataset["test"].shuffle(seed=2).select(range(num_samples))

    # Generate prompts
    examples = test_data.map(
        lambda x: {"prompt": task_info["prompt"](x)},
    )

    # get the name of the ground truth column (the true label or the baseline content) to evaluate generation depending on the task
    ground_truth_column_name = 'label' if task_name in ['classification', 'emotion_load_detection'] else 'baseline'

    # perform evaluation
    results = []
    tic = time.time()
    for idx, example in enumerate(examples):
        prompt = example["prompt"]
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(device)
        baseline = example[ground_truth_column_name]
        # run generation
        output = model.generate(
            inputs['input_ids'],
            attention_mask = inputs['attention_mask'],
            max_new_tokens = 52,
            temperature = 0.8,
            top_p = 0.9,
            do_sample = True,
            repetition_penalty = 1.2,
            pad_token_id=tokenizer.eos_token_id,
        )
        # decode the output and remove the prompt
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        generated_text_without_prompt = generated_text[len(prompt):].strip()
        cleaned_generation = post_process_generated_output(task_name, generated_text_without_prompt)

        # store prompt and generated content to compute metrics afterwards
        results.append((baseline, cleaned_generation))

        # optional: display the baseline (what is expected), and the generated content
        # print(f"Baseline: {baseline}\nOutput: {cleaned_generation}\n")

    toc = time.time()
    print(f"Task {task_name} completed in {toc-tic:.2f}s for {num_samples} samples.")

    return results

In [263]:
for task_name, task_info in tasks.items():
    results = evaluate_task(task_name, task_info, num_samples=5)
    metric = task_info["metric"]
    # TODO apply metric to results


---------- Benchmarking Causal Language Modeling ----------
Baseline: We begin therefore with a little trip into a neighboring industry:
Output: the airline. As of 2018, there were over one million airlines in operation around the world.
The majority operate out of developed countries and serve mostly domestic destinations; however, new trends are emerging for international travel which could impact how we plan our

Baseline: ‚ÄúThe cyclone had broken my economical backbone by destroying everything,‚Äù says Islam. ‚ÄúIf there had not been such a big cyclone, I would not have moved to Dhaka.‚Äù
Output: The young woman was the first in her family to leave home for urban life and is now working as an office worker.
The Cyclonic Flooding Response Plan (CFRP) developed under the Bangladesh Relief and Rehabilitation Commission‚Äôs (BRRC) Emergency Management System supports

Baseline: ‚ÄúNo!‚Äù ‚ÄúNo!‚Äù a chorus of two voices called out and as I dropped my hands towards their bellies again

## 3. Fine-Tuning Prerequisites

### 3.1 Some Essentials

In [43]:
model

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((

### 3.2 Setting Up Adapters

In [47]:
rank = 16
adapters_config = LoraConfig(
      r = rank,                       # rank of lora module
        lora_alpha = rank/2,          # rescales weights parameters: "expressivity" of LoRA parameters
        target_modules=["q_proj", "v_proj"],
        bias="lora_only",
        lora_dropout=0.1,
    )

## 4. Fine-tuning experiments

In [48]:
num_samples = 100
rank = 16

### 4.1 Sentiment Analysis

Sentiment analysis is a classification task where the goal is to determine the sentiment of a given piece of text. Typically, sentiment is categorized as either `positive` or `negative`, making it a **binary classification task**.  

To perform this task using a decoder-only language model, two main approaches are commonly employed:

- **Generation-Based Approach** consists in prompting with an instruction to generate either "positive" or "negative" based on the sentiment identified in the provided content. The generated output is then compared to the actual class for evaluation.  

- **Classification-Based Approach** involves adding a classification head on top of the LLM's output layer, turning the model into a standard neural classifier. This introduces additional trainable parameters, which can improve classification accuracy. The output is directly suited to binary classification, typically represented as a sigmoid activation output.

### 4.2 Causal Language Modeling

Causal Language Modeling (CLM) is a task where the model learns to predict the next token in a sequence based on the preceding context. This approach is particularly well-suited for decoder-only language models, as it aligns with their architecture, which processes input in a unidirectional manner. Fine-tuning for CLM involves training the model to generate coherent and contextually relevant text by learning from sequential dependencies in the data. In this section, we will explore how to adapt and fine-tune a decoder-only language model for causal language modeling, focusing on practical implementation and key considerations for achieving optimal performance.


In [49]:
# plug adapters into the pre-trained model
clm_model = get_peft_model(model, adapters_config)
clm_model

PeftModel(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 896)
        (layers): ModuleList(
          (0-23): 24 x Qwen2DecoderLayer(
            (self_attn): Qwen2SdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=896, out_features=896, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=896, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=896, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): Linear(in_features=896, o

The following cell loads and split the corresponding dataset for Causal Language Modeling:

In [53]:
task_info = tasks["causal_language_modeling"]

dataset = load_dataset(task_info["dataset"], task_info.get("subset", None))
train_data = dataset["train"].shuffle(seed=42).select(range(1000))
test_data = dataset["test"].shuffle(seed=42).select(range(100))

# data collator: controls batching strategy of data
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [76]:
print(train_data)
print(test_data)

Dataset({
    features: ['text'],
    num_rows: 1000
})
Dataset({
    features: ['text'],
    num_rows: 100
})


For now, the dataset contains only textual data. Therefore, we need to tokenize its content to be able to use an LLM. For that, we use the model's tokenizer defined at the beggining of the notebook:

In [59]:
def tokenize_function(data_sample):
    return tokenizer(data_sample["text"], truncation=True, return_tensors="pt", padding="max_length", max_length=128)

tokenized_train_data = train_data.map(tokenize_function, batched=True)
tokenized_train_data = tokenized_train_data.remove_columns(["text"])

tokenized_test_data = test_data.map(tokenize_function, batched=True)
tokenized_test_data = tokenized_test_data.remove_columns(["text"])

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 4152.67 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:00<00:00, 3258.09 examples/s]


In [61]:
print(tokenized_train_data)
print(tokenized_test_data)

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 1000
})
Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 100
})


Now, we can observe the difference in the dataset column names: we replaced `text` by `input_ids`, the token indexes in the vocabulary, and `attention_mask`. The data is now ready to be fed to the LLM for fine-tuning.

In [75]:
print(test_data[0])

{'text': ''}


In [74]:
print(tokenized_test_data[0])

{'input_ids': [151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 1

In [67]:
training_args = TrainingArguments(
    output_dir="mistral_emo",
    eval_strategy="epoch",
    learning_rate=5e-4, # previously it was 1e-4
    weight_decay=0.01,
    num_train_epochs=5,
    seed=42,
    eval_accumulation_steps=1,
    prediction_loss_only=True,
    # optim='adafactor',
)

In [71]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_test_data,
    data_collator=data_collator,
)

In [73]:
trainer.train()

  2%|‚ñè         | 53/2500 [01:49<1:24:30,  2.07s/it]
  8%|‚ñä         | 49/625 [00:47<09:24,  1.02it/s]
                                                 
 20%|‚ñà‚ñà        | 125/625 [01:36<05:52,  1.42it/s]

{'eval_loss': nan, 'eval_runtime': 5.1615, 'eval_samples_per_second': 19.374, 'eval_steps_per_second': 2.519, 'epoch': 1.0}


                                                 
 40%|‚ñà‚ñà‚ñà‚ñà      | 250/625 [03:08<04:24,  1.42it/s]

{'eval_loss': nan, 'eval_runtime': 3.2933, 'eval_samples_per_second': 30.364, 'eval_steps_per_second': 3.947, 'epoch': 2.0}


                                                 
 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 375/625 [04:40<02:55,  1.42it/s]

{'eval_loss': nan, 'eval_runtime': 3.2703, 'eval_samples_per_second': 30.578, 'eval_steps_per_second': 3.975, 'epoch': 3.0}


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 500/625 [06:08<01:29,  1.39it/s]

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0001, 'epoch': 4.0}


                                                 
 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 500/625 [06:12<01:29,  1.39it/s]

{'eval_loss': nan, 'eval_runtime': 3.3734, 'eval_samples_per_second': 29.643, 'eval_steps_per_second': 3.854, 'epoch': 4.0}


                                                 
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 625/625 [07:45<00:00,  1.34it/s]

{'eval_loss': nan, 'eval_runtime': 3.3407, 'eval_samples_per_second': 29.934, 'eval_steps_per_second': 3.891, 'epoch': 5.0}
{'train_runtime': 465.8028, 'train_samples_per_second': 10.734, 'train_steps_per_second': 1.342, 'train_loss': 0.0, 'epoch': 5.0}





TrainOutput(global_step=625, training_loss=0.0, metrics={'train_runtime': 465.8028, 'train_samples_per_second': 10.734, 'train_steps_per_second': 1.342, 'total_flos': 1486554030145536.0, 'train_loss': 0.0, 'epoch': 5.0})

### 4.3 Text Summarization

Text summarization is the task of generating a concise and coherent summary that captures the essential information from a longer piece of text. It is a challenging task that requires the model to understand the context, identify key points, and rephrase information effectively. For decoder-only language models, summarization can be approached as a sequence-to-sequence generation task where the input is the original text, and the output is its summary. In this section, we will fine-tune a language model for text summarization, covering practical techniques, data preparation, and strategies to improve the quality and relevance of the generated summaries.

## 5. Benchmark Fine-Tuned Models: Conclusions