<a href="https://colab.research.google.com/github/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/7_1_knowledge_distillation_Llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
    <h1>Large Language Models Projects</a></h1>
    <h3>Apply and Implement Strategies for Large Language Models</h3>
    <h2>Knowledge Distillation Llama 3.2.</h2>
    <h3>KD is used to extract knowledge from a superior model.</h3>
</div>

by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

_______
Models: meta-llama/Llama-3.2-1B

Colab Environment: GPU A100.

Keys:
* Knowledge Distillation
* Pruning

_______
**disclaimer: The pruning / knowledge distillation section has been created after the first edition of the book was published. They are not included in the book’s original content but are intended to supplement and expand on the topics covered.**

This is the unofficial repository for the book:
        <a href="https://amzn.to/4eanT1g"> <b>Large Language Models:</b> Apply and Implement Strategies for Large Language Models</a> (Apress).
        The book is based on the content of this repository, but the notebooks are being updated, and I am incorporating new examples and chapters.
        If you are looking for the official repository for the book, with the original notebooks, you should visit the
        <a href="https://github.com/Apress/Large-Language-Models-Projects">Apress repository</a>, where you can find all the notebooks in their original format as they appear in the book.

______
# INTRODUCTION
This notebook is part of the pruning section in the Large Language Models course. We will use Knowledge Distillation to restore the reasoning ability lost by a pruned model.



Starting with the Llama-3.2-1B model, three pruned versions were created with different pruning percentages.

In this notebook, we will work with the model that had 40% of the neurons in its MLP layers removed through pruning. This reduction resulted in a smaller model but led to a noticeable performance loss in some areas, as observed in the various benchmarks conducted.

One of the most affected benchmarks was Lambada, a test where the model is asked to guess the last word of a paragraph. A complex task where tthe model’s capability in complex language modeling is tested.

To address this, we will employ Knowledge Distillation, a technique where the pruned model (the "student") learns from the larger, unpruned model (the "teacher"). By transferring the teacher’s knowledge, we aim to help the pruned model regain its lost capabilities and improve its performance on tasks like Lambada.

* Teacher Model: [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
* Student Model: [oopere/pruned40-llama-3.2-1B](https://huggingface.co/oopere/pruned40-llama-3.2-1B)

___________

* Previous notebook: [6_3_pruning_structured_llama3.2-1b_OK](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb)

* Article Explaining the pruning process: [How to Prune LLaMA 3.2 and Similar Large Language Models](https://medium.com/towards-data-science/how-to-prune-llama-3-2-and-similar-large-language-models-cf18e9a2afb6?sk=af4c5e40e967437325050f019b3ae606)

* Paper: [Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models](https://doi.org/10.31219/osf.io/qgxea)
______

# Install libraries & Configure variables.

In [None]:
!pip install -q transformers==4.47.1
!pip install -q datasets==3.2.0
!pip install -q torch==2.5.1
!pip install -q lm-eval==0.4.7

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/134.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW
from datasets import load_dataset
from torch.nn import functional as F
from torch.utils.data import DataLoader

# Download the Models.
The teacher model will be the same model used as the base to create the pruned model we are going to train.

* Teacher model: "meta-llama/Llama-3.2-1B"
* Student Model: "oopere/pruned40-llama-3.2-1B"

We could have chosen any other larger model, like Llama-3.2-3B, but since both models need to fit into memory and this is an example notebook running on Google Colab, I decided not to use a larger model.

There are also scenarios where the teacher model must be the same model used to create the pruned version. Imagine you have a model that works perfectly and has been trained with proprietary company data, thus containing specific knowledge. In this case, if the goal is to replicate the behavior of this model in a smaller one, it wouldn’t make sense to use a larger model that hasn’t been trained on the same data.

In [None]:
# Load teacher and student models and their tokenizers
teacher_model_name = "meta-llama/Llama-3.2-1B"
student_model_name = "oopere/pruned40-llama-3.2-1B"

In [None]:
# Initialize tokenizer - we can use the same tokenizer for both models since they're both Llama-based
tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
tokenizer.pad_token = tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
# Load models
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name)
student_model = AutoModelForCausalLM.from_pretrained(student_model_name)

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/883 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.83G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

# Load the DATA


The dataset to be used will largely depend on the results we aim to achieve through the Knowledge Distillation process.

During the pruning process, the model inevitably lost some capabilities, as expected, which was reflected in the benchmarks. For more details, I recommend reading the paper: [Exploring GLU expansion ratios: Structured pruning in Llama-3.2 models](https://doi.org/10.31219/osf.io/qgxea).

One of the benchmarks that showed the most degradation was *Lambada*, both in its standard and OpenAI versions. This benchmark evaluates the model's ability to predict the last word of a text. However, these are not simple texts; the model must pay close attention since the last word needs to be inferred by considering the entire story, requiring understanding of broader context, coherence, and fluency.

I have decided to use only a small portion of the `ptb_text_only` dataset. It may not be the best dataset for improving performance on a benchmark like *Lambada*, but constraints on both time and memory led me to choose this dataset for the example. Other options, likely more suitable, could include the Lambada dataset or BookCorpus, among others.


In [None]:
# Data Loading
dataset = load_dataset("ptb_text_only", "penn_treebank", split="train")
# Take a subset for faster training/testing
original_dataset = dataset
dataset = dataset.select(range(1000))

README.md:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

ptb_text_only.py:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/5.10M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/400k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/450k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/42068 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3761 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3370 [00:00<?, ? examples/s]

The `tokenize_function` is where the real preprocessing magic happens. This function transforms raw text into a format that our models can understand. Let's break down its components:

The function expects a dictionary input with a key `sentence`that contains the text to be tokenized.
The text is procesed using the tokenizer previously loaded. Usint the parameters:

* `padding`="max_length": Ensures all sequences have the same length by adding padding tokens
* `truncation`=True: Cuts off sequences that are too long
* `max_length`=128: Sets the maximum sequence length, suitable for Llama models
* `return_tensors`="pt": Returns PyTorch tensors instead of lists

Then the function prepares output:
* Creates input_ids: The numerical representation of our tokens
* Creates labels: In this case, identical to input_ids (clone) for language modeling
* Returns attention masks to indicate which tokens are padding vs. real content

In knowledge distillation, we're trying to transfer knowledge from a larger teacher model to a smaller student model. The quality of this process heavily depends on how we prepare our data. The careful padding and truncation ensure that:

All sequences are properly formatted for both teacher and student models
The attention masks help models focus on relevant parts of the input
The consistent sequence length (128 tokens) optimizes memory usage while maintaining enough context for learning

A key detail to note is that we're setting up for language modeling specifically, which is why our labels are identical to our inputs. In language modeling, the task is to predict the next token given the previous tokens, so each input sequence serves as its own target.

In [None]:
# Create a tokenization function
def tokenize_function(examples):
    # Tokenize with padding and truncation
    tokenized = tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=128,  # Adjusted for Llama models
        return_tensors="pt"
    )

    # Create input_ids and labels for language modeling
    input_ids = tokenized["input_ids"]
    labels = input_ids.clone()

    return {
        "input_ids": input_ids,
        "attention_mask": tokenized["attention_mask"],
        "labels": labels
    }

Time to use the `tokenize_function` to tokenize the Dataset. This code uses the datasets `map` function, which is specially designed for processing large datasets. Let's understand each parameter:

tokenize_function: The previously defined function that tokenizes and formats data.
* batched=True: Processes multiple examples in batches for efficiency.
* batch_size=32: Specifies the size of each batch during mapping. A smaller batch size ensures compatibility with memory constraints.
* remove_columns=dataset.column_names: Removes original columns after tokenization to avoid redundancy and reduce memory usage.
* num_proc=4: Enables parallel processing with four processes, speeding up the operation on large datasets.
* desc="Processing examples": Displays a description in the progress bar for better clarity.
* load_from_cache_file=False: Disables caching to ensure fresh processing of the dataset, which is helpful during debugging.

The `tokenized_datasets` object contains the preprocessed data with input_ids, attention_mask, and labels. These fields are now ready for use in model training.

In [None]:
# Process the dataset with progress bar
print("Tokenizing dataset...")
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=32,  # Smaller batch size for mapping
    remove_columns=dataset.column_names,
    desc="Processing examples",
    load_from_cache_file=False  # Disable caching for debugging
)

Tokenizing dataset...


Processing examples (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Convert to PyTorch format
# tokenized_datasets.set_format("torch")

This is where we set up how data will be fed into our models during training. The DataLoader is a PyTorch utility that efficiently handles batching and iteration over our dataset.
* `batch_size=4`: Specifies the number of samples per batch. A smaller batch size is used here due to the memory constraints of large models like Llama.
* `shuffle=True`: Randomizes the order of data samples in each epoch. This improves the model’s generalization by reducing the likelihood of learning spurious patterns from data order.

Using such a small batch size (4) might make training slower but ensures we can run both models without running into memory issues, maybe in a A100 GPU you can use a `batch_size`of 6.

In [None]:
# Create DataLoader
dataloader = DataLoader(
    tokenized_datasets,
    batch_size=4,  # Reduced batch size due to model size
    shuffle=True
)

# Knowledge Distillation.

Start moving the models to a cuda device (GPU), if available.

In [None]:
# Move models to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
teacher_model.to(device)
student_model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=4916, bias=False)
          (up_proj): Linear(in_features=2048, out_features=4916, bias=False)
          (down_proj): Linear(in_features=4916, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

The next line is crucial for knowledge distillation. It puts the teacher model into evaluation mode, which:

* Disables dropout layers
* Freezes batch normalization statistics
* Ensures consistent outputs for the same inputs

Why It's Important for Knowledge Distillation:

* The teacher model should provide stable, consistent predictions to guide the student
* We're not training the teacher model anymore - it's only being used to generate "soft targets"
* Any randomness (like dropout) would make the knowledge transfer less reliable

In [None]:
# Set teacher model to evaluation mode
teacher_model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

## Optimizer & Training Loop.
The training process in knowledge distillation involves transferring knowledge from a larger teacher model to a smaller student model, with the idea that the student model mimics the behaviour of the teacher model.

The optimizer updates the student model's parameters to minimize the loss, improving its ability to replicate the teacher's outputs. AdamW is kind of a standard for trasformers based models.

Hyperparameters' Role:

* `temperature`: Controls how "soft" the teacher's predictions are made. Higher temperature (2.0) smooths out the probability distributions.
* `alpha`: Balances the importance of matching the teacher's predictions versus ground truth.
* `accumulation_steps`: Allows for larger effective batch sizes without increasing memory usage.



In [None]:
# Define optimizer for student model
optimizer = AdamW(student_model.parameters(), lr=1e-5)  # Reduced learning rate for Llama

# Training loop
num_epochs = 10
temperature = 2.0  # Increased temperature for Llama
alpha = 1  # Weight for soft loss

accumulation_steps = 8  # Gradient accumulation for larger effective batch size




In [None]:

for epoch in range(num_epochs):
    ### 1 - Model Preparation ###
    #initializes each training epoch,
    #putting the student model in training mode
    student_model.train()
    #Initializes total_loss to track the cumulative loss for the epoch.
    total_loss = 0

    for batch_idx, batch in enumerate(dataloader):
        ### 2 - Data procesing.  ###
        #Moves the batch data to the appropriate device (CPU/GPU) for processing.
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        ### 3 - Teacher Model Inference ###
        #Disables gradient computation to save memory and speed up inference.
        with torch.no_grad():
            teacher_outputs = teacher_model(
                input_ids,
                attention_mask=attention_mask,
                output_hidden_states=True
            )
            #Applies temperature scaling to soften the teacher's predictions
            teacher_logits = teacher_outputs.logits / temperature

        ### 4 - Student Model Inference. ###
        #The student model generates logits for the same input data.
        student_outputs = student_model(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        student_logits = student_outputs.logits

        ### 5 - Compute loss ###
        # Converts logits to probabilities using softmax
        teacher_probs = F.softmax(teacher_logits, dim=-1)
        #Computes the KL Divergence between the teacher's probabilities and the student's log probabilities.
        #KL Divergence measures how well the student's predictions match the teacher's.
        student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)
        loss = F.kl_div(student_log_probs, teacher_probs, reduction='batchmean')
        #The loss is divided by accumulation_steps to balance gradient updates across accumulated batches.
        loss = loss / accumulation_steps

        ### 6- Backward pass ###
        loss.backward()

        ### 7 - Optimization Gradient Accumulation ###
        # Accumulates gradients over multiple batches
        # Updates model parameters when enough gradients are accumulated
        # Resets gradients after update
        if ((batch_idx + 1) % accumulation_steps == 0) or (batch_idx + 1 == len(dataloader)):
            optimizer.step()
            optimizer.zero_grad()

        ### 8 - Loss Tracking ###
        # Scales the loss back up by multiplying it with accumulation_steps to reflect the actual batch contribution.
        total_loss += loss.item() * accumulation_steps

        #if (batch_idx + 1) % 100 == 0:
        #    print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx+1}, Loss: {loss.item():.4f}")

    ### 9 - Epoch-Level Reporting
    # Computes the average loss for the epoch by dividing the total loss by the number of batches.
    avg_loss = total_loss / len(dataloader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")


Epoch 1/10, Average Loss: 27.5265
Epoch 2/10, Average Loss: 7.8998
Epoch 3/10, Average Loss: 6.0441
Epoch 4/10, Average Loss: 5.1618
Epoch 5/10, Average Loss: 4.6350
Epoch 6/10, Average Loss: 4.3303
Epoch 7/10, Average Loss: 4.0552
Epoch 8/10, Average Loss: 3.8007
Epoch 9/10, Average Loss: 3.6275
Epoch 10/10, Average Loss: 3.4588


# Store the Model.
At the end of the training Loop we have a model, tahat we can store, and even upload it to Hugging Face.

In [None]:
student_model_name = "pruned_distilgpt2_kd_gem"

In [None]:
# Save the fine-tuned student model
student_model.save_pretrained(student_model_name)
tokenizer.save_pretrained(student_model_name)

('pruned_distilgpt2_kd_gem/tokenizer_config.json',
 'pruned_distilgpt2_kd_gem/special_tokens_map.json',
 'pruned_distilgpt2_kd_gem/tokenizer.json')

In [None]:
student_model.push_to_hub(student_model_name,
                  private=False,
                  use_temp_dir=False)

model.safetensors:   0%|          | 0.00/3.66G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2_kd_gem/commit/040512093d2fa0e228f747ddb5121047922c96d9', commit_message='Upload LlamaForCausalLM', commit_description='', oid='040512093d2fa0e228f747ddb5121047922c96d9', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/pruned_distilgpt2_kd_gem', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/pruned_distilgpt2_kd_gem'), pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub(student_model_name,
                      private=False,
                      use_temp_dir=False)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/oopere/pruned_distilgpt2_kd_gem/commit/6c8f0fd1253fdc2e09758261484d3c7c1d4b06ea', commit_message='Upload tokenizer', commit_description='', oid='6c8f0fd1253fdc2e09758261484d3c7c1d4b06ea', pr_url=None, repo_url=RepoUrl('https://huggingface.co/oopere/pruned_distilgpt2_kd_gem', endpoint='https://huggingface.co', repo_type='model', repo_id='oopere/pruned_distilgpt2_kd_gem'), pr_revision=None, pr_num=None)

# Evaluating the model
Once the model is ready we can evaluate it using lm-eval lybrary and check if the KD has had any positive influence.

In [None]:
!pip install -q lm-eval
from lm_eval import evaluator, tasks, models

In [None]:
def evaluate_hf_model(model_name, tasks=['arc_easy'], num_fewshot=0):
    """
    It calls the evaluator to evaluate a model available on Hugging Face.

    Args:
    - model_name: The model name in hugging Face.
    - tasks: Tasks to evaluate.
    - num_fewshot: Number of examples of few-shot learning

    Returns:
    - metrics.
    """
    model_args = f"pretrained={model_name},device=cuda"
    tasks = tasks

    results = evaluator.simple_evaluate(
      model="hf",
      model_args=model_args,
      tasks=tasks,
      num_fewshot=0,  # Number of few-shot smaples.
      limit=None,  # Use all the samples in the Evaluate Dataset.
      bootstrap_iters=10
    )

    metrics = results.get('results', {})
    return metrics

In [None]:
# Select tasks to evaluate.
tasks = ['lambada']

In [None]:
metrics_pruned_kd = evaluate_hf_model("oopere/pruned_distilgpt2_kd_gem", tasks=tasks)

INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'oopere/pruned_distilgpt2_kd_gem', 'device': 'cuda'}
INFO:lm-eval:Using device 'cuda'


config.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}


model.safetensors:   0%|          | 0.00/3.66G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

lambada_openai.py:   0%|          | 0.00/4.82k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]



README.md:   0%|          | 0.00/7.32k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/269M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/281M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2662 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4869 [00:00<?, ? examples/s]

INFO:lm-eval:Building contexts for lambada_standard on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 492.64it/s]
INFO:lm-eval:Building contexts for lambada_openai on rank 0...
100%|██████████| 5153/5153 [00:10<00:00, 497.94it/s]
INFO:lm-eval:Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 10306/10306 [03:45<00:00, 45.71it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 143.84it/s]


bootstrapping for stddev: perplexity


100%|██████████| 1/1 [00:00<00:00, 149.96it/s]


In [None]:
metrics_pruned_kd

{'lambada_openai': {'alias': 'lambada_openai',
  'perplexity,none': 51.62386404069125,
  'perplexity_stderr,none': 2.791449257565601,
  'acc,none': 0.3021540849990297,
  'acc_stderr,none': 0.00639743788967891},
 'lambada_standard': {'alias': 'lambada_standard',
  'perplexity,none': 81.67148592401875,
  'perplexity_stderr,none': 3.9495010691528614,
  'acc,none': 0.25383271880457986,
  'acc_stderr,none': 0.006063229044159063}}

# Conclusions.
We can obtain the results of the pruned model from the previous notebook:

* Lambada-OpenAI: 0.299
* Lambada Standard: 0.248.

The actual results, pruning + KD are:
* Lambada-OpenAI: 0.302
* Lambada Standard: 0.253.

A slight improvement, but one that shows that the KD process works. Keep in mind that only a small portion of a very small dataset has been used, which is not ideal for improving the Lambada benchmark, but even so, the model's results have improved.

## When to use KD versus other forms of fine-tuning?
If you have followed the entire [Large Language Models  course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/tree/main), or at lest the part dedicated to [fine-tuning models](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/tree/main), you will have seen that there are different efficient ways to introduce knowledge into a model: LoRA and QLoRA. Their use compared to KD serves different purposes.

KD helps us imitate a model:  we might have a model that has already been fine-tuned with our data and gone through a Pruning process. To recover the lost capacity, the best approach is to perform a KD process from the base model.

If we just want to include general information, we could use LoRA or QLoRA to improve the model's performance, and we would benefit from the reduction in trainable weights that these two techniques bring.

# Authors Note.

In addition to creating content like this notebook and offering it under the MIT license, I have also contributed to repositories such as those of Hugging Face and Google Gemini.

I am especially proud of my book: [Large Language Models: Apply and Implement Strategies for Large Language Models (Apress)(https://amzn.to/3DSepLb).

You can find it on both [Amazon](https://amzn.to/3DSepLb) and [Springer](https://link.springer.com/book/10.1007/979-8-8688-0515-8), where they often have good deals on the purchase price.

If you take a look and end up purchasing it, keep in mind that you can reach out with any questions via the Discussions section of this same repository or on any of my social media channels. I’ll do my best to respond as quickly as possible.