# **Background**


### SmolLM Implementation and Fine-tuning

- Fine-tuned SmolLM-135M for grammatical error correction (GEC) using the Grammarly CoEdIT dataset, employing Supervised Fine-Tuning (SFT), BLEU score evaluation, and hyperparameter optimization to enhance model performance.
- Developed and applied advanced preference optimization techniques, including Direct Preference Optimization (DPO) and Contrastive Preference Optimization (CPO), to improve GEC performance by leveraging diverse output variants and human-like preferences.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).


## 1. Setup [Helper Functions]

In [None]:
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 1.06 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  sample_data


In [None]:

# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## 2. Custom SmolLM (for BugFixes)

In [None]:
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()
        self.freq = 1/(base ** (torch.arange(0, dim, 2, dtype=torch.int64).float()/dim))

    @torch.no_grad()
    def forward(self,x):
        pos = torch.arange(x.shape[-2],dtype=torch.long)
        ### BUG FIX ###
        angles = torch.einsum('f,p->fp', pos.float(), self.freq).unsqueeze(dim=0)
        emb = torch.cat((angles, angles), dim=-1)
        return emb.cos(), emb.sin()


class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        ### BUG FIX ###
        down_proj = self.W_down(self.act_fn(self.W_gate(x)) * self.W_up(x))
        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        ### BUG FIX ###
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states


class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0

        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta,
                                         dim=config.hidden_size//self.num_heads)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        cos, sin = self.rotary_emb(v_states)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin)

        ### BUG FIX ###
        __kv_groups = self.num_heads // self.kv_heads
        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)

        ### BUG FIX ###
        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        ### BUG FIX ###
        if attention_mask is not None:
            attn_weights = attn_weights + attention_mask
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        ### BUG FIX ### (Remove dropout)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)

        ### BUG FIX ###
        attn_output = self.W_output(attn_output)

        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)
        attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1)

        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual

        ### BUG FIX ###
        residual = hidden_states
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual

        outputs = (hidden_states,)

        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]

class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        ### BUG FIX ###
        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}


In [None]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=False)
          (W_key): Linear(in_features=576, out_features=192, bias=False)
          (W_value): Linear(in_features=576, out_features=192, bias=False)
          (W_output): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=False)
          (W_up): Linear(in_features=576, out_features=1536, bias=False)
          (W_down): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

# 3. Test

In [None]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)


In [None]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=50,
               model_A=__reference_model,
               model_B=__test_model)

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

# **Teach SmolLM to do grammatical error correction**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions. Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate.

In [None]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/474.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompat

In [None]:
from datasets import load_dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")

README.md:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/692k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

In [None]:
# TODO: Filter examples, keeping only GEC task
def is_gec_task(example):
    return example['task'] == 'gec'

train_ds = full_train_ds.filter(is_gec_task)
test_ds = full_test_ds.filter(is_gec_task)

print(f"Original train dataset size: {len(full_train_ds)}")
print(f"Original test dataset size: {len(full_test_ds)}")

print(f"Filtered train dataset size: {len(train_ds)}")
print(f"Filtered test dataset size: {len(test_ds)}")



Filter:   0%|          | 0/69071 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1712 [00:00<?, ? examples/s]

Original train dataset size: 69071
Original test dataset size: 1712
Filtered train dataset size: 19823
Filtered test dataset size: 485


Expected number of train and test samples are 19823 and 485, respectively.

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM-135M"

# TODO: Load the model and the tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")


In [None]:
!pip install trl -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/280.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.1/280.1 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/105.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.7/105.7 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.model_max_length = 350


def format_text(example):
    messages = f"{example['src']} ###>{example['tgt']}"
    example["text"] = messages

    return example

column_names = list(train_ds.features)

train_ds_formated = train_ds.map(
    format_text,
    remove_columns=column_names,
    desc="Applying chat template"
)

test_ds_formated = test_ds.map(
    format_text,
    remove_columns=column_names,
    desc="Applying chat template"
)


split_dataset = train_ds_formated.train_test_split(test_size=0.2)

train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

Applying chat template:   0%|          | 0/19823 [00:00<?, ? examples/s]

Applying chat template:   0%|          | 0/485 [00:00<?, ? examples/s]

In [None]:
# TRL - Transformer Reinforcement Learning -- https://huggingface.co/docs/trl/en/index
from trl import SFTConfig, SFTTrainer

# TODO: Run SFT
sft_config = SFTConfig(output_dir="./sft_out/",
                       packing = True,
                       dataset_batch_size = 12,
                       learning_rate=7e-5,
                       logging_steps=100,
                       )

trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field='text',
        tokenizer=tokenizer,
        args=sft_config,
        max_seq_length=tokenizer.model_max_length,
)

train_result = trainer.train()



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



Step,Training Loss
100,1.8096
200,1.7556
300,1.7337
400,1.623
500,1.6212
600,1.6108
700,1.5524
800,1.5532


In [None]:
# Quick test if your model works properly
def format_text(text: str) -> str:
    prompt = text + " ###>"
    return prompt


# Example of how to run inference on a single example
text = "Fix grammatically: I likes turtles"
inputs = tokenizer(format_text(text), return_tensors="pt", padding=True, truncation=True, max_length=128).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.0)
print(tokenizer.decode(outputs[0]))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Fix grammatically: I likes turtles ###>I like turtles.<|endoftext|>


Expected output: I like turtles.

In [None]:
!pip install evaluate -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import evaluate

def process_in_batches(test_ds, model, tokenizer, generate_params=None, initial_batch_size=16):
    if generate_params is None:
        generate_params = {"max_new_tokens": 128, "temperature": 0.0}

    batch_size = initial_batch_size
    start = 0
    outputs = []

    while start < len(test_ds["src"]):
        end = min(start + batch_size, len(test_ds["src"]))
        batch = test_ds["src"][start:end]

        try:
            model_inputs = tokenizer(
                [format_text(text) for text in batch],
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=128
            ).to("cuda")

            generated_ids = model.generate(**model_inputs, **generate_params, pad_token_id=tokenizer.eos_token_id)
            decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            batch_outputs = [item.split("###>")[-1].strip() for item in decoded_outputs]

            outputs.extend(batch_outputs)
            print(f"Batch processed successfully with batch size: {batch_size}")
            start = end

        except torch.cuda.OutOfMemoryError:
            print(f"Out of memory at batch size {batch_size}. Reducing batch size.")
            torch.cuda.empty_cache()
            batch_size = max(1, batch_size // 2)  # Reduce by half, but ensure batch size is at least 1

    return outputs


# BLEU Score
def evaluate_model(model, tokenizer, ds):
    generate_params = {
          "max_new_tokens": 128,
          "do_sample": True,
          "num_beams": 3,
      }

    preds = process_in_batches(ds, model, tokenizer, generate_params, initial_batch_size=200)
    targets = ds["tgt"]

    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=preds, references=targets)
    return results["bleu"]

In [None]:
# TODO: Evaluate model, use the function given above
score = evaluate_model(model, tokenizer, test_ds)
score

Out of memory at batch size 1000. Reducing batch size.
Out of memory at batch size 500. Reducing batch size.
Out of memory at batch size 250. Reducing batch size.
Out of memory at batch size 125. Reducing batch size.
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

0.476827187159293

In [None]:
score

0.476827187159293

Expected BLEU score after 1 epoch SFT is ~ 0.48.

## **2.2 Create a preference optimization dataset**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Selection is based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."


In [None]:
!pip install fast_edit_distance -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/115.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.8/115.8 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from fast_edit_distance import edit_distance

# TODO: Create preference optimization dataset
variant_1_params = {
    "max_new_tokens": 128,
    "temperature": 1.0,
    "do_sample": True,

}
variant_2_params = {
    "max_new_tokens": 128,
    "temperature": 0.5,
    "do_sample": True,
}



variation_1_preds = process_in_batches(train_ds, model, tokenizer, variant_1_params, initial_batch_size=200)
variation_2_preds = process_in_batches(train_ds, model, tokenizer, variant_2_params, initial_batch_size=200)

Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200
Batch processed successfully with batch size: 200


In [None]:
from datasets import Dataset

def create_dpo_dataset(prompts, targets, variants1, variants2):
    data = {
        "prompt": [],
        "chosen": [],
        "rejected": [],
        "target": [],
    }

    for prompt, target, var1, var2 in zip(prompts, targets, variants1, variants2):
        dist1 = edit_distance(var1, target)
        dist2 = edit_distance(var2, target)


        # Skip if distances are equal
        if dist1 == dist2:
            continue

        if dist1 < dist2:
            chosen, rejected = var1, var2
        else:
            chosen, rejected = var2, var1


        # Add to data dictionary
        data["prompt"].append(prompt)
        data["chosen"].append(chosen)
        data["rejected"].append(rejected)
        data["target"].append(target)

    dataset = Dataset.from_dict(data)
    return dataset


ground_truth = train_ds["tgt"]
prompts = train_ds["src"]

dpo_dataset = create_dpo_dataset(prompts, ground_truth, variation_1_preds, variation_2_preds)

In [None]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

df = dpo_dataset.to_pandas()

df.sample(5)

Unnamed: 0,prompt,chosen,rejected,target
9406,"Fix grammaticality: Access to sevreal public transport services (such as underground, buses, tram, etc.)","Access to public transport services (such as underground, buses, tram, etc.)","Access to public transport services (such as the underground, buses, tram, etc.)","Access to different public transport services (such as underground, buses, tram, etc.)"
14136,"Fix grammaticality of the sentence: Frankly, asking god is no stranger to me but i seldom have a thankful heart to him.","Frankly, asking God is no stranger to me, but I seldom have a thankful heart to him.","Frankly, asking for God is no stranger to me, but I seldom have a thankful heart to him.","Frankly, asking God is no stranger to me, but I seldom have a thankful heart for him."
13599,"Update to remove grammar errors: Compare with new songs nowadays, old songs were rather great and meaningful.","Compared with new songs nowadays, old songs were rather great and meaningful.","Comparing with new songs nowadays, old songs were rather great and meaningful.","Compared with new songs nowadays, old songs were rather great and meaningful."
8602,"Fix grammar errors in this sentence: Kitesurfing-this sport discipline is getting more and more popular not only in countries such as Australia, Turkey or Egypt, but also in other ones, just like Poland or Germany!","Kitesurfing-this sport is now getting more and more popular not only in countries such as Australia, Turkey, or Egypt, but also in other ones, just like Poland or Germany!","Kitesurfing-this sport discipline is getting more and more popular not only in countries such as Australia, Turkey, or Egypt, but also in other ones, just like Poland or Germany!","Kitesurfing-this sport is getting more and more popular, not only in countries such as Australia, Turkey or Egypt but also in other ones, like Poland or Germany!"
13177,"Fix grammatical errors: Though brand perception of segment 7 is lower than other categories, products from manufacturers in this group have particular function.","Though the brand perception of segment 7 is lower than other categories, products from manufacturers in this group have particular functions.","Though brand perception of segment 7 is lower than that of the other categories, products from manufacturers in this group have particular functions.","Though the brand perception of segment 7 is lower than other categories, products from manufacturers in this group have a particular function."


### If not edit distance?
Here are some other metrics or methods for preference dataset annotation beyond edit distance.

1. **Task-specific performance metrics**:
Depending on the intended use of the text, we could develop task-specific metrics that align with the ultimate goal. For example:

  - For summaries: ROUGE scores to measure overlap with reference summaries
  - For translations: BLEU scores to assess translation quality
  - For dialogue responses: Perplexity or response relevance scores
  - For content generation: Factual accuracy checks against a knowledge base


2. **Semantic similarity**:
Instead of focusing on surface-level text differences, semantic similarity measures how close two pieces of text are in meaning. This can be more useful for preference annotation as it captures conceptual alignment rather than just textual similarity.

## **2.3 Run Direct Preference Optimization (DPO)**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs.

In [None]:
import os
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM
from datasets import Dataset
import pandas as pd

# TODO: Run Direct Preference Optimization (DPO)
training_args = DPOConfig(
    beta=0.05,
    output_dir="/content/data/dpo/",
    max_length=tokenizer.model_max_length,
    learning_rate=5.0e-7,
    gradient_accumulation_steps=2,
    max_prompt_length=350,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    remove_unused_columns=True
)
dpo_trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
    max_prompt_length=tokenizer.model_max_length,
    loss_type="sigmoid"
)

dpo_trainer.train()


Deprecated positional argument(s) used in DPOTrainer, please use the DPOConfig to set these arguments instead.


Tokenizing train dataset:   0%|          | 0/15437 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
500,0.6749


TrainOutput(global_step=965, training_loss=0.6653812388681994, metrics={'train_runtime': 1238.6318, 'train_samples_per_second': 12.463, 'train_steps_per_second': 0.779, 'total_flos': 0.0, 'train_loss': 0.6653812388681994, 'epoch': 1.0})

In [None]:
# TODO: Evaluate model, use evaluate_model function
sft_dpo_score = evaluate_model(model, tokenizer, test_ds)

Out of memory at batch size 1000. Reducing batch size.
Out of memory at batch size 500. Reducing batch size.
Out of memory at batch size 250. Reducing batch size.
Out of memory at batch size 125. Reducing batch size.
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62


In [None]:
sft_dpo_score

0.4922954302290912

Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

### Contrastive Preference Optimization (CPO)

Contrastive Preference Optimization (CPO), as introduced by Xu et al. in the paper [Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation](https://huggingface.co/papers/2401.08417), advances Large Language Models (LLMs) by minimizing suboptimal outputs, particularly in Machine Translation (MT). While primarily applied to MT, CPO is a general approximation to DPO and can be extended to other domains, such as chat.

CPO employs contrastive learning, comparing outputs to optimize for more preferred results. Its training format mirrors DPO, requiring three inputs:
- `prompt`
- `chosen`
- `rejected`

This makes CPO a strong alternative to DPO for preference optimization tasks.


DPO has notable drawbacks. Firstly, DPO is **memory-inefficient**: it necessitates twice the memory capacity to simultaneously store both the parameterized policy and the reference policy. Secondly, it is **speed-inefficient**: executing the model sequentially for two policies doubles the processing time. To address these inefficiencies, contrastive preference optimization.

In [None]:
torch.cuda.empty_cache()

In [None]:
from trl import CPOTrainer, CPOConfig

model_sft = AutoModelForCausalLM.from_pretrained("./sft_out/checkpoint-879", device_map="auto")

training_args = CPOConfig(
    output_dir="./cpo_out/",
    max_length=tokenizer.model_max_length,
    max_prompt_length=tokenizer.model_max_length,
    num_train_epochs=1,
    remove_unused_columns=True,
    label_pad_token_id = tokenizer.eos_token_id,
    per_device_train_batch_size=8,
    logging_steps=100,
    save_steps=500,

    beta=0.05,


)
cpo_trainer = CPOTrainer(
    model=model_sft,
    args=training_args,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

cpo_trainer.train()



Map:   0%|          | 0/15437 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss
100,1.2328
200,0.5804
300,0.5376
400,0.5444
500,0.5281
600,0.5305
700,0.5123
800,0.5015
900,0.5402
1000,0.5259


TrainOutput(global_step=1930, training_loss=0.5629409562738448, metrics={'train_runtime': 1054.5467, 'train_samples_per_second': 14.639, 'train_steps_per_second': 1.83, 'total_flos': 0.0, 'train_loss': 0.5629409562738448, 'epoch': 1.0})

In [None]:
sft_cpo_score = evaluate_model(model_sft, tokenizer, test_ds)

Out of memory at batch size 1000. Reducing batch size.
Out of memory at batch size 500. Reducing batch size.
Out of memory at batch size 250. Reducing batch size.
Out of memory at batch size 125. Reducing batch size.
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62
Batch processed successfully with batch size: 62


In [None]:
sft_cpo_score

0.4658978175984099

## Comparison of Results

Comp

In [None]:
data = {
    "Model": ["SmolLM", "SmolLM + SFT"],
    "DPO": ["-", sft_dpo_score],
    "CPO": ["-", sft_cpo_score],
    "SFT": [score, "-"]
}

df_result = pd.DataFrame(data)
df_result

Unnamed: 0,Model,DPO,CPO,SFT
0,SmolLM,-,-,0.476827
1,SmolLM + SFT,0.492295,0.465898,-


In [None]:
sft_dir ="/content/sft_out/checkpoint-879"
dpo_dir = "/content/data/dpo/checkpoint-965"
cpo_dir = "/content/cpo_out/checkpoint-1930"


model_sft = AutoModelForCausalLM.from_pretrained(sft_dir, device_map="auto")
model_dpo = AutoModelForCausalLM.from_pretrained(dpo_dir, device_map="auto")
model_cpo = AutoModelForCausalLM.from_pretrained(cpo_dir, device_map="auto")

In [None]:
dpo_sample = dpo_dataset.shuffle(seed=10).select(range(10))


Let's compare text generated from these models


In [None]:
models = [model_sft, model_dpo, model_cpo]
results = []

def format_text(text: str) -> str:
    prompt = text + " ###>"
    return prompt

for model in models:
    model_inputs = tokenizer(
        [format_text(text) for text in dpo_sample["prompt"]],
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to("cuda")
    generated_ids = model.generate(**model_inputs, max_new_tokens=128, temperature=0.5)
    outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

    outputs = [item.split("###>")[-1].strip() for item in outputs]

    results.append(outputs)


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


In [None]:
results = pd.DataFrame(results).T
results.columns = ["SmolLM + SFT", "SmolLM + DPO", "SmolLM + CPO"]
results

Unnamed: 0,SmolLM + SFT,SmolLM + DPO,SmolLM + CPO
0,"Anyway, I hope I'll be able to receive some appraisal of my writing.","Anyway, I hope I'll be able to receive some appraisal of my writing.","Anyway, I hope I'll be able to receive some appraisal of my writing."
1,"If you find this magazine, please check it out.","If you find this magazine, please check it.","If you find this magazine, please check it."
2,"After all those letters we sent each other, now we can meet for real.","After all those letters we sent each other, now we can meet for real.",After all those letters we sent each other to now we can meet for real.
3,"The princess was a little scared of talking to someone she didn't know because she never got to talk to anyone before, but still, she answered him with a big smile-You are excused, my name is Sophie.- It truly was love at first sight.","The princess was a little scared of talking to someone she didn't know because she never got to talk to anyone before, but still she answered him with a big smile-You are excused, my name is Sophie.- It truly was love at first sight.","The princess was a little scared of talking to someone she didn't know because she never got to talk to anyone before, but still, she answered him with a big smile-You are excused, my name is Sophie.- It truly was love at first sight."
4,"But I quit my job last month, and I have much time to study English on weekdays than on weekends because my family is at home all day on weekends.","But I quit my job last month, I have much time to study English on weekdays than on weekends because my family are at home all day on weekends.","But I quit my job last month, I have much time to study English on weekdays than on weekends because my family are at home all day on weekends."
5,"For the most of men, shopping is so boring and exhausting; but for women, it can be the best way to make their mood better because they are so happy when they buy many clothes on sale, cheaper than in regular price.","For the most of men, shopping is so boring and exhausting, but for women, it can be the best way to make their mood better because they are so happy when they buy many clothes on sale, cheaper than in regular price.","For the most of men, shopping is so boring and exhausting; but for women, it can be the best way to make their mood better because they are so happy when they buy many clothes on sale, cheaper than in regular price."
6,"I am so sad for this, as you know, in China, we have learned English for more than 10 years.","I am so sad for this, as you know, in China, we have learned English more than 10 years.","I am so sad for this, as you know, in China, we have learned English more than 10 years."
7,"Compared to the plane, the train is relatively safe because it can only move on the land.","Compared to the plane, the train is relatively safe because it can only move on the land.","Compared to the plane, the train is relatively safe because it can move only on the land."
8,I have been studying English since junior high school!,I have been studying English since junior high school!,I have been studying English since junior high!
9,"Being raised in an environment where personal autonomy is the core, children with disabilities were growing very independent and strong despite all the challenges they could face on the way.","Being raised in an environment where personal autonomy is the core, children with disabilities were growing very independent and strong despite all the challenges they could face on the way.","Being raised in an environment where personal autonomy is the core, children with disabilities were growing very independent and strong despite all the challenges they could face on the way."


I think I prefere response from `SmolLM + DPO`