# SmolLM: Implementation, Fine-tuning and Alignment

In this work, three main topics are covered:

1. Defining a custom implementation of the SmolLM-135M model to do text generation.
2. Fine-tuning the SmolLM-135M model for grammatical error correction using the Grammarly CoEdIT dataset.
3. Aligning the fine-tuned model through RLAIF using Direct Preference Optimization.

---

## Model Overview

- **SmolLM-135M**: A small language model available on [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

---



## Introduction

This notebook demonstrates the process of implementing and  fine-tuning the **SmolLM-135M** language model for **grammatical error correction (GEC)** using the Grammarly CoEdIT dataset. The notebook is divided into several parts:

1. **Setup**: Preparing the environment and loading necessary libraries and resources.
2. **Custom SmolLM Implementation**: Defining the custom model architecture and components.
3. **Testing the Model**: Verifying the correctness of the implementation.
4. **Supervised Fine-Tuning (SFT)**: Fine-tuning the model on the GEC task.
5. **Creating a Preference Optimization Dataset**: Generating and annotating outputs for preference learning.
6. **Direct Preference Optimization (DPO)**: Further training the model using preference optimization.

---

## 1. Setup and Configuration

First, we set up the environment by installing Git Large File Storage (LFS), cloning the required repository containing the pre-trained model weights (bare-bones implementation of the SmolLM-135M model), and importing essential libraries.



In [27]:
!nvidia-smi

Sat Oct 19 13:49:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0              64W / 400W |  11697MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [1]:
# Install Git Large File Storage
!git lfs install

# Clone the repository containing the pre-trained model weights
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135

# Move the model weights to the current directory
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./

# List the files in the current directory to verify
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 2.11 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  sample_data


In [2]:
# Importing necessary libraries for model development and evaluation
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model checkpoint
checkpoint = "HuggingFaceTB/SmolLM-135M"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Load the reference model for comparison
__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

# Display the reference model architecture
print(__reference_model)

# Configuration class for SmolLM
class smolConfig:
    vocab_size = 49152
    hidden_size = 576
    intermediate_size = 1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads = 3
    max_batch_size = 1
    max_seq_len = 512

# Instantiate the configuration
config = smolConfig

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm



---

## 2. Custom SmolLM Implementation

In this section, we define the custom components of the SmolLM model, including functions for rotary embeddings, multi-layer perceptrons (MLP), normalization layers, and attention mechanisms.

### 2.1 Model Components

#### **Rotary Embedding Functions**

Rotary embeddings are used to incorporate positional information into the model without using absolute positional embeddings.



In [3]:
def rotate_half(x):
    """
    Rotates the last dimension of the tensor by splitting it in half and swapping the halves.
    """
    x1 = x[..., :x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2:]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
    """
    Applies rotary positional embeddings to the query and key tensors.
    """
    cos = cos[None, None, :, :]  # Reshape for broadcasting
    sin = sin[None, None, :, :]  # Reshape for broadcasting
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

#### **Key-Value Repetition Function**

This function repeats key and value tensors for multi-head attention.

In [4]:
def repeat_kv(hidden_states, n_rep):
    """
    Repeats the key and value tensors to match the number of attention heads.
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    result = hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
    return result

#### **Multi-Layer Perceptron (MLP)**

The MLP is used within the transformer block for processing the hidden states.

In [5]:
class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size

        # Linear transformations
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)

        # Activation function
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        """
        Forward pass through the MLP.
        """
        gate_output = self.act_fn(self.W_gate(x))
        down_proj = self.W_down(gate_output * self.W_up(x))
        return down_proj

#### **RMSNorm Layer**

Root Mean Square Layer Normalization is used to normalize the hidden states.

In [6]:
class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        """
        Forward pass through the RMSNorm layer.
        """
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states / torch.sqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states

#### **Rotary Embedding Class**

Handles the creation of sinusoidal embeddings for positional encoding.

In [7]:
class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()
        self.dim = dim
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

    def forward(self, x, seq_len=None, position_ids=None):
        """
        Generates cosine and sine embeddings based on input positions.
        """
        if position_ids is None:
            seq_len = seq_len or x.shape[-2]
            position_ids = torch.arange(seq_len, dtype=self.inv_freq.dtype, device=x.device)
        else:
            position_ids = position_ids.to(dtype=self.inv_freq.dtype, device=x.device)
        freqs = torch.einsum('i,j->ij', position_ids, self.inv_freq)  # Compute frequencies
        emb = torch.cat([freqs, freqs], dim=-1)  # Concatenate for full dimension
        cos_emb = emb.cos()
        sin_emb = emb.sin()
        return cos_emb, sin_emb

#### **RopeAttention Class**

Implements rotary positional attention, integrating the rotary embeddings into the attention mechanism.

In [8]:
class RopeAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size // self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0

        # Linear projections for query, key, and value
        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=False)

        # Rotary embedding layer
        self.rotary_emb = RotaryEmbedder(dim=self.head_dim, base=self.rope_theta)

    def forward(self, hidden_states, past_key_value=None, use_cache=False):
        """
        Forward pass through the attention mechanism.
        """
        bsz, q_len, _ = hidden_states.size()

        # Linear projections
        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        # Reshape and transpose for multi-head attention
        q_states = q_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(bsz, q_len, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(bsz, q_len, self.kv_heads, self.head_dim).transpose(1, 2)

        # Compute position_ids considering past length
        if past_key_value is None:
            past_length = 0
        else:
            past_length = past_key_value[0].size(2)
        position_ids = torch.arange(past_length, past_length + q_len, dtype=torch.long, device=hidden_states.device)

        # Apply rotary embeddings with correct position_ids
        cos, sin = self.rotary_emb(v_states, position_ids=position_ids)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin)

        # Repeat key and value for multiple heads
        __kv_groups = self.num_heads // self.kv_heads
        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        # Concatenate past key and value states if they exist
        if past_key_value is not None:
            k_states = torch.cat([past_key_value[0], k_states], dim=2)
            v_states = torch.cat([past_key_value[1], v_states], dim=2)

        # Update present key and value for KV cache
        if use_cache:
            present_key_value = (k_states, v_states)

        # Compute attention weights
        attn_weights = torch.matmul(q_states, k_states.transpose(-1, -2)) / math.sqrt(self.head_dim)

        # Create causal mask to prevent attending to future tokens
        key_len = k_states.size(2)
        causal_mask = torch.triu(
            torch.full((q_len, key_len), float('-inf'), device=hidden_states.device), diagonal=1 + past_length
        )
        attn_weights = attn_weights + causal_mask.unsqueeze(0).unsqueeze(0)

        attn_weights = F.softmax(attn_weights, dim=-1)

        # Compute attention output
        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(bsz, q_len, -1)

        attn_output = self.W_output(attn_output)

        if use_cache:
            return attn_output, present_key_value
        return attn_output

### 2.2 Model Architecture

#### **LlamaDecoder Class**

Defines a single transformer decoder layer, consisting of self-attention and MLP components with residual connections and normalization.


In [9]:
class LlamaDecoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-5)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-5)

    def forward(self, hidden_states, past_key_value=None, use_cache=False):
        """
        Forward pass through a single decoder layer.
        """
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)

        # Self-attention layer
        attention_output = self.self_attn(
            hidden_states=hidden_states,
            past_key_value=past_key_value,
            use_cache=use_cache,
        )

        if use_cache:
            attention_output, present_key_value = attention_output
        else:
            present_key_value = None

        # Residual connection after attention
        hidden_states = attention_output + residual
        residual = hidden_states

        # MLP layer with normalization
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

#### **smolModel Class**

Stacks multiple decoder layers to form the full transformer model.

In [10]:
class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(
            num_embeddings=config.vocab_size,
            embedding_dim=config.hidden_size
        )
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-5)

    def forward(self, input_ids=None, past_key_values=None, use_cache=False):
        """
        Forward pass through the entire model.
        """
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds

        if past_key_values is None:
            past_key_values = [None] * len(self.layers)

        presents = () if use_cache else None

        for idx, decoder_layer in enumerate(self.layers):
            layer_outputs = decoder_layer(
                hidden_states,
                past_key_value=past_key_values[idx],
                use_cache=use_cache,
            )

            hidden_states = layer_outputs[0]

            if use_cache:
                presents += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        if use_cache:
            return [hidden_states, presents]
        return [hidden_states]

#### **smolLM Class**

Adds a language modeling head on top of the transformer model for generating logits over the vocabulary.

In [11]:
class smolLM(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.model = smolModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self, input_ids, past_key_values=None, use_cache=False):
        """
        Forward pass through the language model.
        """
        outputs = self.model(
            input_ids=input_ids,
            past_key_values=past_key_values,
            use_cache=use_cache,
        )

        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits.float()

        if use_cache:
            return {'logits': logits, 'past_key_values': outputs[1]}
        return {'logits': logits}

### 2.3 Model Initialization and Loading

We initialize the custom SmolLM model and load the pre-trained weights.

In [12]:
# Initialize the custom SmolLM model
__test_model = smolLM(config)

# Load the pre-trained weights into the model
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)

# Align the language model head with the reference model for consistency
__test_model.lm_head = __reference_model.lm_head
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=False)
          (W_key): Linear(in_features=576, out_features=192, bias=False)
          (W_value): Linear(in_features=576, out_features=192, bias=False)
          (W_output): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=False)
          (W_up): Linear(in_features=576, out_features=1536, bias=False)
          (W_down): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)



---


## 3. Testing the Model

This section includes helper functions to generate text using the model and compare its outputs with the reference model. The testing ensures that the custom implementation behaves as expected.

### 3.1 Helper Functions for Generation and Evaluation

#### **Generation Function**

Generates text token by token using the model.

In [13]:
def __generate(model, inputs, num_tokens, use_cache=True):
    collect = []
    past_key_values = None
    input_ids = inputs['input_ids']

    for _ in range(num_tokens):
        outputs = model(
            input_ids=input_ids,
            past_key_values=past_key_values,
            use_cache=use_cache
        )
        logits = outputs['logits']

        if use_cache:
            past_key_values = outputs.get('past_key_values')

        # Select the token with the highest probability
        output_id = torch.argmax(logits[0, -1]).item()
        collect.append(output_id)

        if output_id == tokenizer.eos_token_id:
            break

        if use_cache:
            # When using cache, input only the last generated token
            input_ids = torch.tensor([[output_id]], device=input_ids.device)
        else:
            # When not using cache, append the generated token to the sequence
            input_ids = torch.cat(
                [input_ids, torch.tensor([[output_id]], device=input_ids.device)], dim=-1
            )

    return tokenizer.decode(collect)

#### **Solution Checker Function**

Compares the outputs of the custom model and the reference model.

In [14]:
def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")

    # Model A Generation (Reference Model)
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation (Reference Model)\n{'<'*30}")
    print(__generate(model_A, model_inputs, num_tokens, use_cache=False))
    print("\n\n")

    # Model B Generation without KV-Cache
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation without KV-Cache (Custom Model)\n{'<'*30}")
    print(__generate(model_B, model_inputs, num_tokens, use_cache=False))
    print("\n\n")

    # Model B Generation with KV-Cache
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation with KV-Cache (Custom Model)\n{'<'*30}")
    print(__generate(model_B, model_inputs, num_tokens, use_cache=True))

### 3.2 Testing with Sample Prompts

We perform tests using sample prompts to ensure that the custom model's outputs align with the reference model.

In [16]:
# Testing Prompts

# Single-Token Quick Test
check_solution(
    prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n10/10 would not recommend!\n\n",
    num_tokens=1,
    model_A=__reference_model,
    model_B=__test_model
)

# Multi-Token Quick Test
check_solution(
    prompt="Where is the Nile located?",
    num_tokens=50,
    model_A=__reference_model,
    model_B=__test_model
)


>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Given the following film movie by a critic, rate it out of 10. Respond in a single number.

The movie started off extremely well, but just got worse after that.
The storyline was all over the place and everyone acted terribly.
10/10 would not recommend!




>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation (Reference Model)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation without KV-Cache (Custom Model)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation with KV-Cache (Custom Model)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1

>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Where is the Nile located?


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation (Reference Model)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The Nile River is located in the Nile Delta in the Nile River Basin, which is a region of Africa. It is the longest river in the world, with a length of 4,330 miles (6,900 k

---

## 4. Fine-Tuning SmolLM for Grammatical Error Correction

In this section, we fine-tune the SmolLM-135M model using the Grammarly CoEdIT dataset to perform grammatical error correction.

### 4.1 Installing Required Libraries

We begin by installing the necessary libraries for data handling, model training, and evaluation.

In [17]:
# Install required libraries
!pip install datasets transformers evaluate trl tqdm --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m471.0/471.6 kB[0m [31m15.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/316.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.6/316.6 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 4.2 Importing Libraries and Setting Up the Environment

In [18]:
import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import evaluate
from tqdm import tqdm
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

# Check for GPU Availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


### 4.3 Loading and Filtering the Dataset

We load the Grammarly CoEdIT dataset and filter it to focus solely on grammatical error correction tasks.

In [19]:
# Load the full dataset
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_val_ds = load_dataset("grammarly/coedit", split="validation")

# Filter the dataset for GEC tasks
def filter_gec(example):
    return example['task'] == 'gec'

train_ds = full_train_ds.filter(filter_gec)
val_ds = full_val_ds.filter(filter_gec)

print(f"Number of training samples: {len(train_ds)}")
print(f"Number of validation samples: {len(val_ds)}")

README.md:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/692k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

Filter:   0%|          | 0/69071 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1712 [00:00<?, ? examples/s]

Number of training samples: 19823
Number of validation samples: 485


### 4.4 Loading the Pre-trained Model and Tokenizer

In [20]:
# Load the pre-trained model and tokenizer
model_name = "HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'

### 4.5 Defining Templates for Training

We define templates to structure the input and output for the model during training.

In [21]:
# Define task and response templates
task_template = ""
response_template = "### Response:"

### 4.6 Data Preparation and Tokenization

We format the dataset to align with the task and tokenize it for model ingestion.

In [22]:
# Data formatting function
def format_data(example):
    text = f"{task_template}\n{example['src']}\n{response_template}\n{example['tgt']}{tokenizer.eos_token}"
    return {"text": text}

# Apply formatting to the datasets
train_ds_formatted = train_ds.map(format_data, remove_columns=train_ds.column_names)
val_ds_formatted = val_ds.map(format_data, remove_columns=val_ds.column_names)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    )

# Tokenize the formatted datasets
tokenized_train_ds = train_ds_formatted.map(
    tokenize_function,
    batched=True,
    remove_columns=train_ds_formatted.column_names
)
tokenized_val_ds = val_ds_formatted.map(
    tokenize_function,
    batched=True,
    remove_columns=val_ds_formatted.column_names
)

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]

### 4.7 Custom Data Collator

We define a custom data collator to handle the specific formatting required for training.

In [23]:
# Custom Data Collator for language modeling
class CustomDataCollatorForLanguageModeling(DataCollatorForCompletionOnlyLM):
    def __call__(self, examples):
        batch = super().__call__(examples)
        labels = batch['labels']
        eos_token_id = self.tokenizer.eos_token_id
        labels[labels == -100] = eos_token_id
        batch['labels'] = labels
        return batch

collator = CustomDataCollatorForLanguageModeling(response_template, tokenizer=tokenizer)

### 4.8 Training Configuration and Execution

We configure the training parameters and initiate the fine-tuning process using the `SFTTrainer`.

In [26]:
# Define the path to save/load the fine-tuned model
FINE_TUNED_MODEL_PATH = "./gec_smollm_finetuned"

# Check if Fine-Tuned Model Exists
if os.path.exists(FINE_TUNED_MODEL_PATH):
    print(f"Loading fine-tuned model from {FINE_TUNED_MODEL_PATH}")
    model = AutoModelForCausalLM.from_pretrained(FINE_TUNED_MODEL_PATH).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = 'left'
else:
    # Training Configuration
    training_args = SFTConfig(
        output_dir="./gec_smollm_finetuned_checkpoints",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        eval_strategy="steps",
        eval_steps=100,
        logging_steps=100,
        gradient_accumulation_steps=4,
        num_train_epochs=1,  # You may increase this for better performance
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        learning_rate=3e-5,
        save_steps=100,
        push_to_hub=False,
        max_seq_length=256,
        report_to="none"
    )

    # Initialize Trainer
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_ds,
        eval_dataset=tokenized_val_ds,
        data_collator=collator
    )

    # Fine-tune the model
    trainer.train()

    # Save the fine-tuned model
    trainer.save_model(FINE_TUNED_MODEL_PATH)
    tokenizer.save_pretrained(FINE_TUNED_MODEL_PATH)

Step,Training Loss,Validation Loss
100,1.6714,0.233148
200,0.0375,0.219506
300,0.0362,0.216986


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


### 4.9 Inference Function

We define a function to perform inference using the fine-tuned model.

In [28]:
# Function to format input text
def format_sample(text):
    formatted_text = f"{task_template}\n{text}\n{response_template}\n"
    return formatted_text

# Inference function
def infer(model, tokenizer, text):
    text = format_sample(text)

    # Tokenize the input text
    inputs = tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        padding=True,
        max_length=256,
        add_special_tokens=False
    ).to(device)

    # Generate the output
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            do_sample=False,        # Use greedy decoding
            early_stopping=True
        )

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the corrected text
    split_generated = generated_text.split(response_template)
    if len(split_generated) >= 2:
        corrected_text = split_generated[1].strip()
    else:
        corrected_text = generated_text.strip()

    return corrected_text

# Test the model with a sample input
test_sentence_with_prefix = "Correct grammar mistakes: I likes turtles"
corrected_sentence = infer(model, tokenizer, test_sentence_with_prefix)
print(f"Original: I likes turtles")
print(f"Corrected: {corrected_sentence}")

# Expected output: I like turtles.



Original: I likes turtles
Corrected: I like turtles.


### 4.10 Evaluation Function

We implement an evaluation function to assess the model's performance using the BLEU score on the validation dataset.

In [29]:
# Function for formatting prompts during evaluation
def formatting_prompts_func_evaluation(batch):
    formatted_prompts = []
    for instruction in batch['instruction']:
        formatted = f"{task_template}\n{instruction}\n{response_template}\n"
        formatted_prompts.append(formatted)
    return formatted_prompts

# Evaluation function
def evaluate_model(model, tokenizer, dataset, batch_size=32):
    model.eval()
    bleu = evaluate.load('bleu')
    all_preds = []
    all_refs = []

    # Iterate over the dataset in batches
    for i in tqdm(range(0, len(dataset), batch_size), desc="Evaluating"):
        batch = dataset[i:i+batch_size]
        instructions = batch['src']
        references = batch['tgt']

        # Format prompts
        formatted_prompts = formatting_prompts_func_evaluation({
            'instruction': instructions,
            'output': references
        })

        # Tokenize the prompts
        inputs = tokenizer(
            formatted_prompts,
            return_tensors='pt',
            padding=True,
            truncation=True,
            max_length=256,
            add_special_tokens=False
        ).to(device)

        # Generate predictions
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                num_return_sequences=1,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
                do_sample=False,        # Use greedy decoding
                early_stopping=True
            )

        # Decode the generated outputs
        preds = tokenizer.batch_decode(outputs, skip_special_tokens=True)

        # Extract the corrected text
        truncated_preds = []
        for pred in preds:
            split_pred = pred.split(response_template)
            if len(split_pred) >= 2:
                corrected_text = split_pred[1].strip()
            else:
                corrected_text = pred.strip()
            truncated_preds.append(corrected_text)

        # Append predictions and references
        all_preds.extend(truncated_preds)
        all_refs.extend(references)

    # Compute BLEU score
    results = bleu.compute(predictions=all_preds, references=all_refs)
    return results['bleu']

# Evaluate the model on the validation dataset
bleu_score = evaluate_model(model, tokenizer, val_ds)
print(f"BLEU score: {bleu_score}")

# Expected BLEU score after 1 epoch SFT is ~ 0.48.

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Evaluating: 100%|██████████| 16/16 [01:54<00:00,  7.17s/it]


BLEU score: 0.47223217090476366


---

## 5. Creating a Preference Optimization Dataset

To further enhance the model's performance, we create a preference optimization dataset by generating multiple output variants and annotating them based on their closeness to the ground truth correction.

### 5.1 Installing Additional Libraries

In [30]:
# Install the fast_edit_distance library
!pip install -q fast_edit_distance

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/117.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m112.6/117.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.3/117.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

### 5.2 Importing Libraries

In [31]:
import os
from fast_edit_distance import edit_distance
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset
import pandas as pd

### 5.3 Defining Paths and Device

In [32]:
# Define paths for saving/loading models and datasets
FINE_TUNED_MODEL_PATH = "./gec_smollm_finetuned"
PREFERENCE_TRAIN_DATASET_PATH = "./preference_optimization_train_dataset"
PREFERENCE_VAL_DATASET_PATH = "./preference_optimization_val_dataset"

# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


### 5.4 Loading the Fine-Tuned Model and Tokenizer


In [33]:
# Load the fine-tuned model and tokenizer
if os.path.exists(FINE_TUNED_MODEL_PATH):
    print(f"Loading fine-tuned model from {FINE_TUNED_MODEL_PATH}")
    fine_tuned_model = AutoModelForCausalLM.from_pretrained(FINE_TUNED_MODEL_PATH).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    fine_tuned_model.config.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = 'left'
else:
    raise FileNotFoundError(f"Fine-tuned model not found at {FINE_TUNED_MODEL_PATH}")

Loading fine-tuned model from ./gec_smollm_finetuned


### 5.5 Defining the Inference Function with Decoding Parameters


In [34]:
# Function to format the input as required by the model
def format_sample(text):
    formatted_text = f"{task_template}\n{text}\n{response_template}\n"
    return formatted_text

# Inference function with decoding parameters
def infer_batch(model, tokenizer, texts, do_sample=False, temperature=1.0):
    formatted_texts = [format_sample(text) for text in texts]

    # Tokenize the input texts
    inputs = tokenizer(
        formatted_texts,
        return_tensors='pt',
        truncation=True,
        padding=True,
        max_length=256,
        add_special_tokens=False
    ).to(device)

    # Generate the outputs
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            do_sample=do_sample,        # Sampling or greedy decoding
            temperature=temperature,    # Temperature for sampling
            early_stopping=True
        )

    # Decode the generated texts
    generated_texts = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Extract the corrected texts
    corrected_texts = []
    for gen_text in generated_texts:
        split_generated = gen_text.split(response_template)
        if len(split_generated) >= 2:
            corrected_text = split_generated[1].strip()
        else:
            corrected_text = gen_text.strip()
        corrected_texts.append(corrected_text)

    return corrected_texts

### 5.6 Generating the Preference Dataset


In [35]:
# Function to generate the preference dataset
def generate_preference_dataset(model, tokenizer, input_sentences, ground_truths, batch_size=64):

    # Function to generate two variants for each input
    def generate_variants_batch(model, tokenizer, input_texts, batch_size=32):
        num_samples = len(input_texts)
        variants1 = []
        variants2 = []

        for i in tqdm(range(0, num_samples, batch_size), desc="Generating variants"):
            batch_texts = input_texts[i:i+batch_size]

            # Variant 1: Greedy decoding
            variant1 = infer_batch(
                model, tokenizer,
                batch_texts,
                do_sample=False,
                temperature=1.0
            )

            # Variant 2: Sampling with temperature
            variant2 = infer_batch(
                model, tokenizer,
                batch_texts,
                do_sample=True,
                temperature=0.7
            )

            variants1.extend(variant1)
            variants2.extend(variant2)

        return variants1, variants2

    # Generate variants
    print("Generating variants in batches...")
    all_variant1, all_variant2 = generate_variants_batch(
        model, tokenizer,
        input_sentences,
        batch_size=batch_size
    )

    # Compute edit distances
    print("Computing edit distances...")
    preference_data = []

    for i in tqdm(range(0, len(input_sentences), batch_size), desc="Annotating preferences"):
        batch_inputs = input_sentences[i:i+batch_size]
        batch_ground_truths = ground_truths[i:i+batch_size]
        batch_variant1 = all_variant1[i:i+batch_size]
        batch_variant2 = all_variant2[i:i+batch_size]

        # Compute edit distances for the batch
        batch_distances1 = [edit_distance(v1, gt) for v1, gt in zip(batch_variant1, batch_ground_truths)]
        batch_distances2 = [edit_distance(v2, gt) for v2, gt in zip(batch_variant2, batch_ground_truths)]

        # Determine chosen and rejected variants
        for j in range(len(batch_inputs)):
            if batch_distances1[j] < batch_distances2[j]:
                chosen = batch_variant1[j]
                rejected = batch_variant2[j]
            else:
                chosen = batch_variant2[j]
                rejected = batch_variant1[j]

            # Append the annotated data
            preference_data.append({
                'prompt': batch_inputs[j],
                'chosen': chosen,
                'rejected': rejected
            })

    # Convert to a Dataset
    preference_dataset = Dataset.from_pandas(pd.DataFrame(preference_data))

    return preference_dataset

### 5.7 Saving the Preference Dataset


In [36]:
# Function to generate and save the preference dataset
def generate_and_save_preference_dataset(model, tokenizer, dataset, dataset_type='train', batch_size=64):
    input_sentences = dataset['src']
    ground_truths = dataset['tgt']

    # Generate the preference dataset
    preference_dataset = generate_preference_dataset(
        model, tokenizer,
        input_sentences,
        ground_truths,
        batch_size=batch_size
    )

    # Define the path based on dataset type
    if dataset_type == 'train':
        save_path = PREFERENCE_TRAIN_DATASET_PATH
    elif dataset_type == 'val':
        save_path = PREFERENCE_VAL_DATASET_PATH
    else:
        raise ValueError("dataset_type must be either 'train' or 'val'.")

    # Save the preference dataset
    preference_dataset.save_to_disk(save_path)
    print(f"Preference optimization dataset for {dataset_type} saved to {save_path}")

    return preference_dataset

### 5.8 Loading or Generating Preference Datasets


In [37]:
# Load or generate preference optimization dataset for training
if os.path.exists(PREFERENCE_TRAIN_DATASET_PATH):
    print(f"Loading training preference optimization dataset from {PREFERENCE_TRAIN_DATASET_PATH}")
    preference_train_dataset = Dataset.load_from_disk(PREFERENCE_TRAIN_DATASET_PATH)
else:
    print("Training preference optimization dataset not found. Generating the dataset...")
    preference_train_dataset = generate_and_save_preference_dataset(
        fine_tuned_model, tokenizer,
        train_ds,
        dataset_type='train',
        batch_size=64
    )

# Load or generate preference optimization dataset for validation
if os.path.exists(PREFERENCE_VAL_DATASET_PATH):
    print(f"Loading validation preference optimization dataset from {PREFERENCE_VAL_DATASET_PATH}")
    preference_val_dataset = Dataset.load_from_disk(PREFERENCE_VAL_DATASET_PATH)
else:
    print("Validation preference optimization dataset not found. Generating the dataset...")
    preference_val_dataset = generate_and_save_preference_dataset(
        fine_tuned_model, tokenizer,
        val_ds,
        dataset_type='val',
        batch_size=64
    )

# Visualize samples from both preference datasets
print("\nSample of the Training Preference Optimization Dataset:")
print(preference_train_dataset.select(range(5)))
print(preference_train_dataset[0])

print("\nSample of the Validation Preference Optimization Dataset:")
print(preference_val_dataset.select(range(5)))
print(preference_val_dataset[0])

Training preference optimization dataset not found. Generating the dataset...
Generating variants in batches...


Generating variants: 100%|██████████| 310/310 [19:30<00:00,  3.78s/it]


Computing edit distances...


Annotating preferences: 100%|██████████| 310/310 [00:04<00:00, 68.67it/s] 


Saving the dataset (0/1 shards):   0%|          | 0/19823 [00:00<?, ? examples/s]

Preference optimization dataset for train saved to ./preference_optimization_train_dataset
Validation preference optimization dataset not found. Generating the dataset...
Generating variants in batches...


Generating variants: 100%|██████████| 8/8 [02:08<00:00, 16.12s/it]


Computing edit distances...


Annotating preferences: 100%|██████████| 8/8 [00:00<00:00,  9.80it/s]


Saving the dataset (0/1 shards):   0%|          | 0/485 [00:00<?, ? examples/s]

Preference optimization dataset for val saved to ./preference_optimization_val_dataset

Sample of the Training Preference Optimization Dataset:
Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 5
})
{'prompt': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'chosen': 'For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'rejected': 'For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.'}

Sample of the Validation Preference Optimization Dataset:
Dataset({
    features: ['prompt', 'chosen', 'rejected'],
    num_rows: 5
})
{'prompt': 'Fix grammaticality: First of all, from you read jus

---

## 6. Direct Preference Optimization (DPO)

In the final section, we apply Direct Preference Optimization (DPO) to further train the model using the preference optimization dataset.

### 6.1 Importing Necessary Libraries

In [38]:
import os
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import pandas as pd

### 6.2 Running Direct Preference Optimization (DPO)


In [40]:
# Define the path to save/load the aligned model
ALIGNED_MODEL_PATH = "./gec_smollm_aligned"

# Check if Aligned Model Exists
if os.path.exists(ALIGNED_MODEL_PATH):
    print(f"Loading aligned model from {ALIGNED_MODEL_PATH}")
    aligned_model = AutoModelForCausalLM.from_pretrained(ALIGNED_MODEL_PATH).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    aligned_model.config.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = 'left'
else:
    # Load the fine-tuned model
    FINE_TUNED_MODEL_PATH = "./gec_smollm_finetuned"
    if os.path.exists(FINE_TUNED_MODEL_PATH):
        print(f"Loading fine-tuned model from {FINE_TUNED_MODEL_PATH}")
        aligned_model = AutoModelForCausalLM.from_pretrained(FINE_TUNED_MODEL_PATH).to(device)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        tokenizer.pad_token = tokenizer.eos_token
        aligned_model.config.pad_token_id = tokenizer.eos_token_id
        tokenizer.padding_side = 'left'

    # Training Configuration for DPO
    dpo_config = DPOConfig(
        output_dir="./gec_smollm_aligned_checkpoints",
        per_device_train_batch_size=16,
        logging_steps=100,
        evaluation_strategy="steps",
        eval_steps=100,
        max_prompt_length=256,
        gradient_accumulation_steps=4,
        num_train_epochs=1,  # You may increase this for better performance
        weight_decay=0.1,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        learning_rate=5e-6,
        save_steps=100,
        push_to_hub=False,
        max_length=512,
        remove_unused_columns=False,
        report_to="none"
    )

    # Initialize DPO Trainer
    dpo_trainer = DPOTrainer(
        model=aligned_model,
        ref_model=fine_tuned_model,
        args=dpo_config,
        train_dataset=preference_train_dataset,
        eval_dataset=preference_val_dataset,
        tokenizer=tokenizer
    )

    # Fine-tune the model using DPO
    dpo_trainer.train()

    # Save the fine-tuned model
    dpo_trainer.save_model(ALIGNED_MODEL_PATH)
    tokenizer.save_pretrained(ALIGNED_MODEL_PATH)

Loading fine-tuned model from ./gec_smollm_finetuned




Tokenizing train dataset:   0%|          | 0/19823 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/485 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
100,0.6427,0.598079,0.049319,-0.513273,0.612295,0.562593,-96.790024,-79.553162,8.505701,8.527115
200,0.5993,0.611641,0.756953,-0.041106,0.612295,0.798059,-92.068344,-72.476837,8.278921,8.29547
300,0.5827,0.616646,0.235696,-0.72099,0.62459,0.956686,-98.86718,-77.689392,8.188497,8.203693


### 6.3 Evaluating the DPO-Optimized Model


In [41]:
# Evaluate the DPO model
print("Evaluating DPO-optimized model")
bleu_score_dpo = evaluate_model(aligned_model, tokenizer, val_ds)

# Compare with SFT baseline
print(f"BLEU score with SFT: {bleu_score}")
print(f"BLEU score with DPO: {bleu_score_dpo}")

# Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

Evaluating DPO-optimized model


Evaluating: 100%|██████████| 16/16 [01:38<00:00,  6.15s/it]


BLEU score with SFT: 0.47223217090476366
BLEU score with DPO: 0.4807825080493354


---

## Conclusion

This notebook provides a comprehensive guide to fine-tuning and optimizing the **SmolLM-135M** model for grammatical error correction. By following the steps outlined, you can enhance the model's ability to correct grammatical errors effectively, leveraging both supervised fine-tuning and preference optimization techniques.
