# **Background**

Welcome to the C4AI Scholars Program Take-Home Challenge! This exercise is designed to allow you to showcase your engineering and problem solving skills. The Challenge consists of different challenges including:

*   Identifying bugs, and getting the code working. This is designed to test your ability to grapple with real world engineering challenges.
*   Testing your ability to generate code for a specified problem.
*   An opportunity for you to attempt an optional challenge question that extends the original problem set.

These tasks were chosen as a setting to see how you think about problems, even if they are not in your own research field of interest. The tasks and dataset are not meant to be indicative of the research goals of the Scholar Program. We purposefully have selected a simple toy problem so the focus is on how you think, and does not require significant machine learning resources (can be run in this colab).

Good luck!

**How to Use and Submit this Document?**

*   **Make a copy of this document** and rename it **Firstname_Lastname_C4AIScholarsChallenge**
*   Once you have completed all tasks, save and pin your revisions
*   Submit the assignment by responding directly to this email with a link to your final document by Sunday, September 15th, 11 PM PDT.

## **Coding Challenge Part 1: Debugging custom SmolLM code [10 points]**

In this coding challenge, you are required to debug and fix a bare-bones implementation of the following model.

**Model** : SmolLM-135M can be found at [HuggingFace](https://huggingface.co/HuggingFaceTB/SmolLM-135M).

We have 10 bugs in the following implementation.
There is a `check_solution` function for your convenience to verify you have correctly identified all the bugs. If you have found all bugs, the generated outputs will match the reference model exactly.

**Rules**:
1. **Bug Definition:**
  - There are 10 bugs to be fixed.
  - A bug is *defined as **{incorrect, missing, unnecessary}** lines of code*.
  - You earn 1 point for each correctly identified and fixed bug.
2. **Fix Guidelines:**
  - You are encouraged to make the smallest possible fix, wherever possible (e.g. edit a line instead of replacing it entirely).
  - Do not optimize the code; only fix the bugs. The implementation is *intentionally* non-optimized but valid.
3. **Documentation:** Document each fix by adding a comment on the line above the fix: : `### BUG FIX ###`.
4. **Sections:** *1. Setup [Helper Functions]* and *3. Test* don't contain bugs and shouldn't be changed.
5. **Submission:** Your final submission should be the exact same file except with your proposed fixes and the respective comments as per Rule #3.

## 1. Setup [Helper Functions]

In [28]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################


# # Use gdown to get weights file(BareBones_SmolLM-135M.pt) at https://drive.google.com/file/d/1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU/view . gdown should be installed by default else use `pip install gdown`
!gdown 1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU


# [Recommended]Use HF to download the weights
!git lfs install
!git clone https://huggingface.co/dsouzadaniel/C4AI_SMOLLM135
!mv C4AI_SMOLLM135/BareBones_SmolLM-135M.pt ./
!ls

Failed to retrieve file url:

	Too many users have viewed or downloaded this file recently. Please
	try accessing the file again later. If the file you are trying to
	access is particularly large or is shared with many people, it may
	take up to 24 hours to be able to view or download the file. If you
	still can't access a file after 24 hours, contact your domain
	administrator.

You may still be able to access the file from the browser:

	https://drive.google.com/uc?id=1tY46FSJEhGYRrfKRQTjJ1Cc7q9psaKUU

but Gdown can't. Please check connections and permissions.
Git LFS initialized.
Cloning into 'C4AI_SMOLLM135'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (6/6), 2.11 KiB | 1.06 MiB/s, done.
BareBones_SmolLM-135M.pt  C4AI_SMOLLM135  drive  sample_data


In [23]:

# Libraries
import torch
import torch.nn.functional as F
from torch import nn
import math
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model initialization/settings
checkpoint="HuggingFaceTB/SmolLM-135M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

__reference_model = AutoModelForCausalLM.from_pretrained(checkpoint)
__reference_model.eval()

class smolConfig:
    vocab_size=49152
    hidden_size=576
    intermediate_size=1536
    num_hidden_layers = 30
    num_heads = 9
    kv_heads=3
config = smolConfig

# Helper Functions
def __generate(model, inputs, num_tokens):
    collect = []
    for _ in range(num_tokens):
        output = model(**inputs)
        output_id = torch.argmax(output['logits'][0,-1]).item()
        collect.append(output_id)
        if output_id==tokenizer.eos_token_id:
            break
        inputs['input_ids'] = torch.unsqueeze(torch.cat([inputs['input_ids'][0],torch.tensor([output_id])]),dim=0)
        inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
    return tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(collect))

def check_solution(prompt, num_tokens, model_A, model_B):
    print()
    print(f"{'>'*20}\n\tPrompt\n{'<'*20}\n{prompt}\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_A Generation\n{'<'*30}\n{__generate(model_A,  model_inputs, num_tokens)}")
    print("\n\n")
    model_inputs = tokenizer(prompt, return_tensors='pt')
    print(f"{'>'*30}\n\tModel_B Generation\n{'<'*30}\n{__generate(model_B,  model_inputs, num_tokens)}")

######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################

In [20]:
__reference_model.config

LlamaConfig {
  "_name_or_path": "HuggingFaceTB/SmolLM-135M",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "eos_token_id": 0,
  "hidden_act": "silu",
  "hidden_size": 576,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 2048,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 9,
  "num_hidden_layers": 30,
  "num_key_value_heads": 3,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.44.2",
  "use_cache": true,
  "vocab_size": 49152
}

### Setup for debugging

In [24]:
%pip install icecream



In [25]:
from icecream import ic
ic.configureOutput(includeContext=True)

## 2. Custom SmolLM (for BugFixes)

In [26]:
def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

def repeat_kv(hidden_states, n_rep):
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

class RotaryEmbedder(nn.Module):
    def __init__(self, dim, base):
        super().__init__()
        self.freq = 1/(base ** (torch.arange(0, dim, 2, dtype=torch.int64).float()/dim))

    @torch.no_grad()
    def forward(self,x):
        pos = torch.arange(x.shape[-2],dtype=torch.long)
        ### BUG FIX ###
        ## position and frequency dimensions were swapped. align properly with the sequence length and head dimension
        # angles = torch.einsum('f,p->fp', self.freq, pos.float()).unsqueeze(dim=0)
        angles = torch.einsum('p,f->pf', pos.float(), self.freq).unsqueeze(dim=0)
        emb = torch.cat([angles, angles], dim=-1)
        return emb.cos(), emb.sin()




class MLP(nn.Module):
    def __init__(self, hidden_size, intermediate_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.W_gate = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_up = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.W_down = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = torch.nn.modules.activation.SiLU()

    def forward(self, x):
        ### BUG FIX ###
        ## The activation function self.act_fn should be applied to the result of self.W_gate(x) and self.W_up(x) independently, not after multiplying them together.
        ## Apply the activation function to the output of the W_gate projection before multiplying.
        # down_proj = self.W_down(self.act_fn( (self.W_gate(x)) * self.W_up(x) ))
        down_proj = self.W_down(  self.act_fn(self.W_gate(x))   * self.W_up(x) )

        return down_proj

class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        variance = hidden_states.pow(2).mean(-1, keepdim=True)

        ### BUG FIX ###
        ## Scale hidden state by the reciprocal square root of the variance.
        # hidden_states = hidden_states * torch.sqrt(variance + self.variance_epsilon)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)

        return self.weight * hidden_states


class RopeAttention(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.hidden_size=config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = config.hidden_size//self.num_heads
        self.kv_heads = config.kv_heads
        self.rope_theta = 10000.0

        self.W_query = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.W_key = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_value = nn.Linear(config.hidden_size, self.kv_heads * self.head_dim, bias=False)
        self.W_output = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
        self.rotary_emb = RotaryEmbedder(base=self.rope_theta,
                                         dim=config.hidden_size//self.num_heads)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask= None,
    ):
        b, q, _ = hidden_states.size()

        q_states = self.W_query(hidden_states)
        k_states = self.W_key(hidden_states)
        v_states = self.W_value(hidden_states)

        q_states = q_states.view(b, q, self.num_heads, self.head_dim).transpose(1, 2)
        k_states = k_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)
        v_states = v_states.view(b, q, self.kv_heads, self.head_dim).transpose(1, 2)

        cos, sin = self.rotary_emb(v_states)
        q_states, k_states = apply_rotary_pos_emb(q_states, k_states, cos, sin)

        ### BUG FIX ###
        ## Avoid float number of groups. Substitut by integer division
        # __kv_groups = self.num_heads / self.kv_heads
        __kv_groups = self.num_heads // self.kv_heads

        k_states = repeat_kv(k_states, __kv_groups)
        v_states = repeat_kv(v_states, __kv_groups)


        ### BUG FIX ###
        ## Scaling factor changed to the head dimension (key vector)
        # attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.hidden_size)
        attn_weights = torch.matmul(q_states, k_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        attn_weights = attn_weights + attention_mask
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)

        ### [OPTIONAL] BUG FIX ###
        # This dropout could be removed in inference
        attn_weights = nn.functional.dropout(attn_weights)

        attn_output = torch.matmul(attn_weights, v_states)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(b, q, -1)

        ### BUG FIX ###
        ## Missing multiplication of Output weight matrix
        attn_output = self.W_output(attn_output)

        return attn_output

class LlamaDecoder(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.self_attn = RopeAttention(config)
        self.mlp = MLP(hidden_size=config.hidden_size, intermediate_size=config.intermediate_size)
        self.pre_attn_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)
        self.pre_mlp_rmsnorm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(self,hidden_states, attention_mask):
        #  Input -> RMSNorm -> Attention -> Residual Connection -> RMSNorm -> MLP -> Residual Connection
        residual = hidden_states
        hidden_states = self.pre_attn_rmsnorm(hidden_states)

        ### BUG FIX ###
        ## Causal Attention Mask to ensure utoregressive behaviour ensuring that
        ## each position can only attend to previous positions and not future ones
        ## Substitute attention_mask.shape[-1] by hidden_states.shape[1] to get the sequence length
        # attention_mask = torch.triu(torch.full((attention_mask.shape[-1],attention_mask.shape[-1]), fill_value=float('-inf')),diagonal=1)
        attention_mask = torch.triu(torch.full((hidden_states.shape[1],hidden_states.shape[1]), fill_value=float('-inf')),diagonal=1)


        # Self Attention
        hidden_states = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )
        hidden_states += residual

        # MLP
        ### BUG FIX ###
        ## Residual update to be the input of MLP
        residual_mlp = hidden_states
        hidden_states = self.pre_mlp_rmsnorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states += residual_mlp

        outputs = (hidden_states,)

        return outputs

class smolModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = nn.Embedding(num_embeddings=config.vocab_size,
                                         embedding_dim=config.hidden_size)
        self.layers = nn.ModuleList([LlamaDecoder(config) for _ in range(config.num_hidden_layers)])
        self.norm = RMSNorm(config.hidden_size, eps=1e-05)

    def forward(
        self,
        input_ids= None,
        attention_mask= None,
    ):
        inputs_embeds = self.embed_tokens(input_ids)
        hidden_states = inputs_embeds
        for decoder_layer in self.layers:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
            )
            hidden_states = layer_outputs[0]
        hidden_states = self.norm(hidden_states)
        return [hidden_states]

class smolLM(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.model = smolModel(config)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self,input_ids,attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )

        ### BUG FIX ###
        ## Squeeze loss the batch size and consider the seq_len as a batch
        ## (batch_size, seq_len, hidden_size)
        ## LM_Head applies linear transformation to the last dimension so it not raise an error
        # hidden_states = outputs[0].squeeze()
        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits.float()
        return {'logits':logits}

In [29]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('/content/drive/MyDrive/Colab Notebooks/Cohere/BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('/content/drive/MyDrive/Colab Notebooks/Cohere/BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=False)
          (W_key): Linear(in_features=576, out_features=192, bias=False)
          (W_value): Linear(in_features=576, out_features=192, bias=False)
          (W_output): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=False)
          (W_up): Linear(in_features=576, out_features=1536, bias=False)
          (W_down): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

In [30]:
__test_model = smolLM(config)
__test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)
__test_model.eval()

  __test_model.load_state_dict(torch.load('BareBones_SmolLM-135M.pt'), strict=False)


smolLM(
  (model): smolModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoder(
        (self_attn): RopeAttention(
          (W_query): Linear(in_features=576, out_features=576, bias=False)
          (W_key): Linear(in_features=576, out_features=192, bias=False)
          (W_value): Linear(in_features=576, out_features=192, bias=False)
          (W_output): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): RotaryEmbedder()
        )
        (mlp): MLP(
          (W_gate): Linear(in_features=576, out_features=1536, bias=False)
          (W_up): Linear(in_features=576, out_features=1536, bias=False)
          (W_down): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (pre_attn_rmsnorm): RMSNorm()
        (pre_mlp_rmsnorm): RMSNorm()
      )
    )
    (norm): RMSNorm()
  )
  (lm_head): Linear(in_features=576, out_features=49152, bias=False)
)

# 3. Test

In [31]:
######################################################################################################################
############################################## DO NOT CHANGE[START] ##################################################
######################################################################################################################

###### TESTING PROMPTS
# Single-Token Quick Test
check_solution(prompt="Given the following film movie by a critic, rate it out of 10. Respond in a single number.\n\nThe movie started off extremely well, but just got worse after that.\nThe storyline was all over the place and everyone acted terribly.\n 10/10 would not recommend! \n\n ",
               num_tokens=1,
               model_A=__reference_model,
               model_B=__test_model)



>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Given the following film movie by a critic, rate it out of 10. Respond in a single number.

The movie started off extremely well, but just got worse after that.
The storyline was all over the place and everyone acted terribly.
 10/10 would not recommend! 

 


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
1



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 palliative


In [32]:
# Multi-Token Quick Test
check_solution(prompt="Where is the Nile located?",
               num_tokens=30,
               model_A=__reference_model,
               model_B=__test_model)


######################################################################################################################
############################################### DO NOT CHANGE[END] ###################################################
######################################################################################################################


>>>>>>>>>>>>>>>>>>>>
	Prompt
<<<<<<<<<<<<<<<<<<<<
Where is the Nile located?


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_A Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

The Nile River is located in the Nile Delta in the Nile River Basin, which is a region of Africa. It is the longest river in the



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
	Model_B Generation
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
ically Rajaqueous innate misplacedBoolean killaceEdited monitoring IdeDiagnosisattaongo naked approved finance cact perpetualMoreoverrip transporting Liberation pian SpendNaturalNatural waspsNaturalreadlines


# **Coding Challenge Part 2: Teach SmolLM to do grammatical error correction [15 points]**

The goal of this part is to train the SmolLM-135M model to perform grammatical error correction (GEC) using the Grammarly CoEdIT dataset. This [dataset](https://huggingface.co/datasets/grammarly/coedit), derived from the [CoEdIT project](https://arxiv.org/abs/2305.09857), provides a rich collection of text editing instructions and examples. The task involves several key steps that mimic conventional alignment processes:




## **2.1 Supervised Fine-Tuning (SFT) on Training Data [5 points]**

* Fine-tune the [SmolLM-135M model](https://huggingface.co/HuggingFaceTB/SmolLM-135M) using the CoEdIT dataset, which includes input sentences with grammatical errors and their corrected versions.
* Use the training GEC portion of the CoEdIT dataset to teach the model how to correct grammatical errors effectively.
* Calculate the BLEU score on the validation set to evaluate the model's performance in generating grammatically correct sentences. Ensure that this evaluation process is reusable for later comparisons.
* Search for an optimal set of hyperparameters, such as the learning rate. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters. **Do not train for more than 3 epochs -- we do not expect extensive training time.**
* For Part 2, don't use additional libraries, if an imported library is missing, install it with **pip install**.

### Setup

In [1]:
%pip install -qU datasets trl wandb evaluate sacrebleu rouge_score

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.1/280.1 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m81.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m

In [2]:
from datasets import load_dataset

# Download the GEC data
full_train_ds = load_dataset("grammarly/coedit", split="train")
full_test_ds = load_dataset("grammarly/coedit", split="validation")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

validation.jsonl:   0%|          | 0.00/692k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/69071 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1712 [00:00<?, ? examples/s]

We can explore the dataset in [HuggingFace](https://huggingface.co/datasets/grammarly/coedit)

In [3]:
# TODO: Filter examples, keeping only GEC task
# Explore the structure of the dataset

print(f'--> Dataset strcutrue: \n {full_train_ds.features}\n')

# Explore the different task in the dataset
print(f'--> Tasks in the dataset {set(full_train_ds["task"])}\n')

train_gec_ds = full_train_ds.filter(lambda example: example['task'] == 'gec', )
test_gec_ds = full_test_ds.filter(lambda example: example['task'] == 'gec')

# Check size of the filter data is correct
assert len(train_gec_ds) == 19823, "Wrong number of train samples"
assert len(test_gec_ds) == 485, "Wrong number of test samples"

train_gec_ds, test_gec_ds

# select a subset of 10 instances for sake of computational limitations
# toy_train_data = train_gec_ds.select(range(10))
# toy_test_data = test_gec_ds.select(range(10))

--> Dataset strcutrue: 
 {'_id': Value(dtype='string', id=None), 'task': Value(dtype='string', id=None), 'src': Value(dtype='string', id=None), 'tgt': Value(dtype='string', id=None)}

--> Tasks in the dataset {'gec', 'neutralize', 'simplification', 'paraphrase', 'clarity', 'coherence'}



Filter:   0%|          | 0/69071 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1712 [00:00<?, ? examples/s]

(Dataset({
     features: ['_id', 'task', 'src', 'tgt'],
     num_rows: 19823
 }),
 Dataset({
     features: ['_id', 'task', 'src', 'tgt'],
     num_rows: 485
 }))

Create the function to process the `src` and `tgt` fileds into a `prompt`.

In [None]:
### PROMPT EXPLORED BUT WITH NO BETTER RESULTS THAN A SIMPLE PROMPT

# def generate_prompt(example):
#     output_texts = []
#     instruction =  "Fix incorrect text and ouput only the correct sentence after ### Correct"
#     for i in range(len(example['src'])):
#         text = f'''
#         ### Instruction:
#             {instruction}

#         ### Incorrect:
#             {example['src'][i]}

#         ### Correct:
#             {example['tgt'][i]}
#         '''
#         output_texts.append(text)
#     return output_texts

In [None]:
def generate_prompt(example):
    output_texts = []
    system_info =  ""
    for i in range(len(example['src'])):
        text = f"Fix grammatically: {example['src'][i]}\n ### Correct: {example['tgt'][i]} <|endoftext|>"
        output_texts.append(text)
    return output_texts

Expected number of train and test samples are 19823 and 485, respectively.

In [None]:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# I am using this instead of -insctruct because it was the one given in the initial code
model_name = "HuggingFaceTB/SmolLM-135M"

# TODO: Load the model and the tokenizer from huggingface
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, trust_remote_code=True,
                                          truncation=True, padding=True,
                                          return_tensors="pt")

# Check special tokens
print(f"EOS token --> {tokenizer.eos_token}")
print(f"BOS token --> {tokenizer.bos_token}")
print(f"PAD token --> {tokenizer.pad_token}")

# Adding EOS token as padding token
tokenizer.pad_token = tokenizer.unk_token
model.config.pad_token_id = tokenizer.pad_token_id
print(f"[UPDATE] PAD token --> {tokenizer.pad_token}")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

EOS token --> <|endoftext|>
BOS token --> <|endoftext|>
PAD token --> None
[UPDATE] PAD token --> <|endoftext|>


### Sweep Training to explore the best hyperparameters

In [None]:
def train():
    # Initialize a new run for WandB
    wandb.init()

    # Access sweep-configured hyperparameters from WandB config
    config = wandb.config

    # Load the pre-trained model and tokenizer
    model_name = "HuggingFaceTB/SmolLM-135M"
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', use_cache=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, trust_remote_code=True,
                                              truncation=True, padding=True,
                                              return_tensors="pt")
    tokenizer.pad_token = tokenizer.unk_token
    model.config.pad_token_id = tokenizer.pad_token_id


    # Configure SFT with hyperparameters from WandB config
    sft_config = SFTConfig(
        output_dir="./output",
        overwrite_output_dir=True,
        learning_rate=config.learning_rate,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        num_train_epochs=config.epochs,
        weight_decay = config.weight_decay,
        adam_beta1 = config.adam_beta1,
        adam_beta2= config.adam_beta2,
        adam_epsilon = config.adam_epsilon,
        max_grad_norm = config.max_grad_norm,
        lr_scheduler_type = config.lr_scheduler_type,
        warmup_steps = config.warmup_steps,
        packing =False,
        per_device_train_batch_size=16,
        save_strategy="epoch",
        logging_steps = 100
    )
    # Initialize the trainer with the model, datasets, and SFT configuration
    trainer = SFTTrainer(
          model=model,
          tokenizer=tokenizer,
          args=sft_config,
          train_dataset=train_gec_ds,
          eval_dataset=test_gec_ds,
          formatting_func=generate_prompt,
      )

    # Start training
    trainer.train()

    # Log any final metrics (you can log more metrics inside the training loop if needed)
    wandb.log({"final_eval_loss": trainer.evaluate()["eval_loss"]})

    # Finish the WandB run
    wandb.finish()

In [None]:
import wandb
from trl import SFTConfig, SFTTrainer
sweep_config = {
    "method": "bayes",  # You can also use 'grid' or 'bayes'
    "metric": {"name": "final_eval_loss", "goal": "minimize"},
    "parameters": {
        "learning_rate": {
            "values": [5e-5, 3e-5, 1e-4]  # Exploring different learning rates
        },
        "weight_decay": {
            "values": [0.0, 0.01, 0.1]  # Exploring weight decay
        },
        "epochs": {
            "values": [1]
        },
        "gradient_accumulation_steps": {
            "values": [1, 2, 4]  # Exploring gradient accumulation for smaller GPUs
        },
      "warmup_ratio": {"values": [0, 0.05, 0.1]},
        "adam_beta1": {"values": [0.9]},
        "adam_beta2": {"values": [0.999]},
        "adam_epsilon": {"values": [1e-8]},
        "max_grad_norm": {"values": [1.0, 0.3]},
        "lr_scheduler_type": {"values": ["linear", "cosine"]},
    }
}

# Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project="C4AI-Challenge-smollm_sft")

# Launch the sweep
wandb.agent(sweep_id, function=train, count=3)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Create sweep with ID: 7tyqfclk
Sweep URL: https://wandb.ai/huertas_97/C4AI-Challenge-smollm_sft/sweeps/7tyqfclk


[34m[1mwandb[0m: Agent Starting Run: rawvy955 with config:
[34m[1mwandb[0m: 	adam_beta1: 0.9
[34m[1mwandb[0m: 	adam_beta2: 0.999
[34m[1mwandb[0m: 	adam_epsilon: 1e-08
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 1
[34m[1mwandb[0m: 	learning_rate: 0.0001
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	warmup_steps: 100
[34m[1mwandb[0m: 	weight_decay: 0.1
[34m[1mwandb[0m: Currently logged in as: [33mhuertas_97[0m. Use [1m`wandb login --relogin`[0m to force relogin




Map:   0%|          | 0/19823 [00:00<?, ? examples/s]

Map:   0%|          | 0/485 [00:00<?, ? examples/s]



Step,Training Loss
100,1.914
200,1.5278
300,1.4975
400,1.5084
500,1.4874
600,1.4649
700,1.4835
800,1.4893
900,1.486
1000,1.4747


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


VBox(children=(Label(value='0.001 MB of 0.024 MB uploaded\r'), FloatProgress(value=0.05036145551472073, max=1.…

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
final_eval_loss,▁
train/epoch,▁▂▂▃▃▄▅▅▆▇▇███
train/global_step,▁▂▂▃▃▄▅▅▆▇▇████
train/grad_norm,█▇▆▃▆▂▃▅▁▄▁▅
train/learning_rate,█▇▇▆▅▅▄▄▃▂▂▁
train/loss,█▂▂▂▂▁▁▂▂▁▁▁

0,1
eval/loss,2.51222
eval/runtime,12.2531
eval/samples_per_second,39.582
eval/steps_per_second,4.978
final_eval_loss,2.51222
total_flos,1286615574161280.0
train/epoch,1.0
train/global_step,1239.0
train/grad_norm,1.17422
train/learning_rate,0.0


[34m[1mwandb[0m: Agent Starting Run: u9q77vsn with config:
[34m[1mwandb[0m: 	adam_beta1: 0.9
[34m[1mwandb[0m: 	adam_beta2: 0.999
[34m[1mwandb[0m: 	adam_epsilon: 1e-08
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 1
[34m[1mwandb[0m: 	learning_rate: 5e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	warmup_steps: 100
[34m[1mwandb[0m: 	weight_decay: 0.1




Map:   0%|          | 0/485 [00:00<?, ? examples/s]



Step,Training Loss
100,2.0368
200,1.5379
300,1.5036
400,1.514
500,1.4932
600,1.4707
700,1.4915
800,1.4964
900,1.4947
1000,1.4851


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
final_eval_loss,▁
train/epoch,▁▂▂▃▃▄▅▅▆▇▇███
train/global_step,▁▂▂▃▃▄▅▅▆▇▇████
train/grad_norm,▆█▇▃▄▂▂▃▁▄▂▆
train/learning_rate,█▇▇▆▅▅▄▄▃▂▂▁
train/loss,█▂▂▂▁▁▁▁▁▁▁▁

0,1
eval/loss,2.50889
eval/runtime,12.2903
eval/samples_per_second,39.462
eval/steps_per_second,4.963
final_eval_loss,2.50889
total_flos,1286615574161280.0
train/epoch,1.0
train/global_step,1239.0
train/grad_norm,1.29682
train/learning_rate,0.0


[34m[1mwandb[0m: Agent Starting Run: g363zrbj with config:
[34m[1mwandb[0m: 	adam_beta1: 0.9
[34m[1mwandb[0m: 	adam_beta2: 0.999
[34m[1mwandb[0m: 	adam_epsilon: 1e-08
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 5e-05
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	max_grad_norm: 1
[34m[1mwandb[0m: 	warmup_steps: 100
[34m[1mwandb[0m: 	weight_decay: 0.1




Step,Training Loss
100,2.0096
200,1.5244
300,1.4911
400,1.5024
500,1.4981
600,1.4726


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,▁
eval/runtime,▁
eval/samples_per_second,▁
eval/steps_per_second,▁
final_eval_loss,▁
train/epoch,▁▂▄▅▆███
train/global_step,▁▂▄▅▆████
train/grad_norm,█▁▂▂▃▂
train/learning_rate,█▇▅▄▂▁
train/loss,█▂▁▁▁▁

0,1
eval/loss,2.51003
eval/runtime,12.2194
eval/samples_per_second,39.691
eval/steps_per_second,4.992
final_eval_loss,2.51003
total_flos,1285554601635840.0
train/epoch,0.99919
train/global_step,619.0
train/grad_norm,0.87464
train/learning_rate,0.0


### Training with hyperparameters

In [None]:
# TRL - Transformer Reinforcement Learning -- https://huggingface.co/docs/trl/en/index
from trl import SFTConfig, SFTTrainer

# TODO: Run SFT
# For hyperparameter optimization and experiment tracking
import wandb
wandb.init(project="C4AI-Challenge-smollm_sft", entity="huertas_97", config={
    "epochs": 1,
    "learning_rate": 0.0001,
    "gradient_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "weight_decay": 0.01,
    "adam_beta1": 0.9,
    "adam_beta2": 0.999,
    "adam_epsilon" : 1e-08,
    "max_grad_norm": 1.0,
    "lr_scheduler_type": 'linear'
})

# Wandb hyperparameter configuration
config = wandb.config
output_dir_sftt = "smollm-gec-sftt"
# Define hyperparameters for fine-tuning using TRL's SFTConfig
sft_config = SFTConfig(
    output_dir=output_dir_sftt,
    overwrite_output_dir=True,
    learning_rate=config.learning_rate,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    num_train_epochs=config.epochs,
    weight_decay = config.weight_decay,
    adam_beta1 = config.adam_beta1,
    adam_beta2= config.adam_beta2,
    adam_epsilon = config.adam_epsilon,
    max_grad_norm = config.max_grad_norm,
    lr_scheduler_type = config.lr_scheduler_type,
    packing =False,

    per_device_train_batch_size=16,
    save_strategy="epoch",
)

# Initialize the trainer with the model, datasets, and SFT configuration
trainer = SFTTrainer(
      model=model,
      tokenizer=tokenizer,
      args=sft_config,
      train_dataset=train_gec_ds,
      eval_dataset=test_gec_ds,
      formatting_func=generate_prompt,
  )

# Now, run the training loop
trainer.train()

wandb.log({"final_eval_loss": trainer.evaluate()["eval_loss"]})

# End the WandB run after training
wandb.finish()





Step,Training Loss
500,1.5307


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁█
train/global_step,▁█
train/grad_norm,▁
train/learning_rate,▁
train/loss,▁

0,1
total_flos,1285554601635840.0
train/epoch,0.99919
train/global_step,619.0
train/grad_norm,0.91449
train/learning_rate,1e-05
train/loss,1.5307
train_loss,1.51986
train_runtime,676.3752
train_samples_per_second,29.308
train_steps_per_second,0.915


### Save Model

In [None]:
# Save locally
trainer.save_model(output_dir_sftt)

In [None]:
## push model to huggingface hub
from huggingface_hub import notebook_login
notebook_login()

# push model
trainer.model.push_to_hub("smollm-gec-sftt", use_auth_token=True)

# push the tokenizer
tokenizer.push_to_hub("smollm-gec-sftt", use_auth_token=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt")
best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
# Load model from /content/smollm-gec-sftt
best_model = AutoModelForCausalLM.from_pretrained("smollm-gec-sftt", device_map='auto',
                                             use_cache=True)


In [None]:
# Quick test if your model works properly
def format_text(text: str) -> str:
    # here you may have formatting of the input that you adopted for training
    # The "Fix grammatically" instruction is already in the user prompt so
    # there is no need to add it like we did in training
    text = f"{text} \n ### Correct:"

    return text

In [None]:
# Example of how to run inference on a single example
text = "Fix grammatically: I likes turtles"
# text = "Fix grammaticality: First of all, from you read just to found in the poems or novel what well-known critic have already found out, you looses the pleasures of reading something which is expecting to be a new experience to you."
inputs = tokenizer(format_text(text), return_tensors="pt", padding=True, truncation=True, max_length=128).to(best_model.device)
outputs = best_model.generate(**inputs, max_new_tokens=128, temperature=0.0,
                               pad_token_id=tokenizer.eos_token_id,
                               eos_token_id=tokenizer.eos_token_id,
                              )
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

Fix grammatically: I likes turtles 
 ### Correct: I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles.  I like turtles


Expected output: I like turtles.

### Evaluate SFTT

In [None]:
import evaluate
from tqdm.auto import tqdm

# BLEU Score
def evaluate_model(model, tokenizer, ds):
    # TODO - compute and call preds and targets for the bleu.compute in the following.
    preds = []
    targets = []

    # Iterate over the dataset and generate predictions with tqdm
    for example in tqdm(ds):
        # Format the text input
        input_text = f"Fix grammatically: {example['src']}"

        # Tokenize the input text and pass to the model
        inputs = tokenizer(input_text, return_tensors="pt", padding=True,
                           truncation=True, max_length=128).to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            num_beams=5,  # Use beam search with 5 beams
            temperature = 0.0, # Deterministic output
            length_penalty=-1.0,  # Adjust length penalty
            early_stopping=True,
            pad_token_id=tokenizer.eos_token_id,  # To handle padding
            eos_token_id=tokenizer.eos_token_id   # Stop at EOS token
        )

        # Decode the generated prediction
        pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        if "### Correct:" in pred_text:
          pred_text = pred_text.split("### Correct:")[1].strip()

        # Append the prediction and the reference (correct target) to the lists
        preds.append(pred_text.strip())
        targets.append([example['tgt']])

        # print(f"--> Input: {input_text}\n")
        # print(f"--> Prediction: {pred_text}\n")
        # print(f"--> Target: {example['tgt']}\n\n")


    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=preds, references=targets)

    return results["bleu"]

In [None]:
best_model.device

device(type='cuda', index=0)

In [None]:
# TODO: Evaluate model, use the function given above
bleu_score = evaluate_model(best_model, tokenizer, test_gec_ds)
print(f"BLEU Score: {bleu_score}")

  0%|          | 0/485 [00:00<?, ?it/s]



Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

BLEU Score: 0.34918218226529335


BLEU Score: 0.34918218226529335

Expected BLEU score after 1 epoch SFT is ~ 0.48.

## **2.2 Create a preference optimization dataset [5 points]**

* *Generate Output Variants* -- for each input sentence in the training set, use the fine-tuned model to generate two different output variants.
 * Consider using different decoding strategies, such as varying the temperature or beam size, to produce diverse outputs. Select an approach based on the desired balance between diversity and quality.

* *Preference Annotation* -- measure the edit distance between each **generated predicted variant** and **ground truth correction**. Label the variant with the lower edit distance as "chosen" and the one with the higher edit distance as "rejected."
 * Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?


### Setup

In [None]:
%pip install fast-edit-distance

Collecting fast-edit-distance
  Downloading fast_edit_distance-1.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Downloading fast_edit_distance-1.2.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (115 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/115.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.8/115.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fast-edit-distance
Successfully installed fast-edit-distance-1.2.1


### Generate Variants

For sake of feasible computing time
just use 10 instances from the training dataset
10 instances = 75 secs --> So 19825 training instances = 40 hrs

I have measured the time of the code, measuring spearately the inference phase (`generate_variants_batch` function) and the distance comparison and selection (`create_preference_dataset` function).
I have checked that the time execution bottleneck comes from the inference generation phase.

As Google Colab was limiting me the use of GPU, I have tried just with 10 instances so I can run the code for the rest of the challenges with CPU.

Obviously, increasing the `batch_size` parameter in `create_preference_dataset` speed the data generation. But I could not use GPU anymore in Google Colab and in my local desktop I just have 2GB VRAM. So I have tried my best, but I ackowledge the limitation of my experiments.  

In [None]:
import time

def timer(func):
    def wrapper(*args, **kwargs):
        # start the timer
        start_time = time.time()
        # call the decorated function
        result = func(*args, **kwargs)
        # remeasure the time
        end_time = time.time()
        # compute the elapsed time and print it
        execution_time = end_time - start_time
        print(f"Execution time: {execution_time} seconds")
        # return the result of the decorated function execution
        return result
    # return reference to the wrapper function
    return wrapper

In [None]:
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from fast_edit_distance import edit_distance

# TODO: Create preference optimization dataset

@timer
def generate_variants_batch(model, tokenizer, input_texts):
    # Tokenize the batch of input texts
    inputs = tokenizer(
        input_texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    ).to(model.device)

    # Variant 1: Beam search decoding
    beam_outputs = model.generate(
        **inputs,
        max_length=128,
        num_beams=5,  # beam search with 5 beams
        temperature=0.0,  # Deterministic output
        length_penalty=-1.0,  # adjust length penalty to be concise
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=1,  # return only the best sequence
    )
    variants_1 = tokenizer.batch_decode(beam_outputs, skip_special_tokens=True)

    # Variant 2: Sampling with temperature
    sampling_outputs = model.generate(
        **inputs,
        max_length=128,
        temperature=0.9,  # use temperature-based sampling
        top_k=50,  # control diversity using top-k sampling
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=1,
    )
    variants_2 = tokenizer.batch_decode(sampling_outputs, skip_special_tokens=True)

    return variants_1, variants_2


@timer
def create_preference_dataset(model, tokenizer, dataset, batch_size=32):
    variant_1_count = 0
    variant_2_count = 0
    preference_data = []

    # Create a DataLoader for batching
    dataloader = DataLoader(dataset, batch_size=batch_size)

    model.eval()  # Set the model to evaluation mode

    for batch in tqdm(dataloader):
        input_texts = [f"Fix grammatically: {src}" for src in batch['src']]
        ground_truths = batch['tgt']

        # Generate two variants for the batch of inputs
        variants_1, variants_2 = generate_variants_batch(model, tokenizer, input_texts)

        # Measure edit distances between the variants and the ground truths
        distances_1 = [edit_distance(v1, gt) for v1, gt in zip(variants_1, ground_truths)]
        distances_2 = [edit_distance(v2, gt) for v2, gt in zip(variants_2, ground_truths)]

        for i in range(len(input_texts)):
            dist_variant_1 = distances_1[i]
            dist_variant_2 = distances_2[i]
            variant_1 = variants_1[i]
            variant_2 = variants_2[i]
            input_text = input_texts[i]
            ground_truth = ground_truths[i]

            # Label based on the smaller edit distance
            if dist_variant_1 < dist_variant_2:
                chosen = variant_1
                rejected = variant_2
                variant_1_count += 1
            else:
                chosen = variant_2
                rejected = variant_1
                variant_2_count += 1

            # Add the comparison to the preference dataset
            preference_data.append({
                'input': input_text,
                'ground_truth': ground_truth,
                'variant_1': variant_1,
                'variant_2': variant_2,
                'chosen': chosen,
                'rejected': rejected
            })

    # Reporting statistics
    total_examples = len(dataset)
    norm_variant_1_count = variant_1_count / total_examples * 100
    norm_variant_2_count = variant_2_count / total_examples * 100
    time.sleep(5)  # Simulate a long computation
    print(f"Variant 1 chosen: {variant_1_count} ({norm_variant_1_count:.2f}%)")
    print(f"Variant 2 chosen: {variant_2_count} ({norm_variant_2_count:.2f}%)")

    return preference_data


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd


# Load the best model from huggingface
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt")
best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True)


# For sake of feasible time and not consuming all GPU
# just use 100 instances from the training dataset
# 10 instances = 75 secs --> So 19825 training instances = 40 hrs


toy_train_data = train_gec_ds.select(range(10))


preference_dataset = create_preference_dataset(best_model, tokenizer, toy_train_data)

# save preference_dataset to parqet to be loaded in pandas
df = pd.DataFrame(preference_dataset)
df.to_parquet("dpo_preference_dataset.parquet")

  0%|          | 0/10 [00:00<?, ?it/s]

Variant 1 count: 9 (0.90%)
Variant 2 count: 1 (0.10%)


### Visualize results

In [None]:
# TODO: (Load and) Visualize the created dataset -- display at least 5 lines of the dataset.


# Visualize a sample of the created dataset
def display_preference_samples(preference_data, num_samples=5):
    for i in range(num_samples):
        sample = random.choice(preference_data)
        print(f"\n--> Input: {sample['input']}")
        print(f"--> Ground Truth: {sample['ground_truth']}")
        print(f"--> Variant 1: {sample['variant_1']}")
        print(f"--> Variant 2: {sample['variant_2']}")
        print(f"--> Chosen: {sample['chosen']}")
        print(f"--> Rejected: {sample['rejected']}\n")
        print("-" * 50)


display_preference_samples(preference_dataset, num_samples=5)


--> Input: Fix grammatically: Update to remove grammar errors: The leakage of radioactive gas into the atmosphere prompted the many anti-nuclear demonstrations that sprung up across the America in the following months.
--> Ground Truth: The leakage of radioactive gas into the atmosphere prompted the many anti-nuclear demonstrations that sprang up across America in the following months.
--> Variant 1: Fix grammatically: Update to remove grammar errors: The leakage of radioactive gas into the atmosphere prompted the many anti-nuclear demonstrations that sprung up across the America in the following months.
--> Variant 2: Fix grammatically: Update to remove grammar errors: The leakage of radioactive gas into the atmosphere prompted the many anti-nuclear demonstrations that sprung up across the America in the following months.
 ### Correct: The leakage of radioactive gas into the atmosphere prompted many anti-nuclear demonstrations that sprung up across America in the following months. 2.

### Answer




---



> *Beyond using edit distance, what other metrics or methods could you consider to do preference dataset annotation?*




---

<br>


While edit distance is a good start for comparing the variants with the ground truth, there are other metrics or methods you could consider:




*   BLEU Score: Obviously the one used previously. BLEU isn typically used for machine trasnlation, to check the distance between input and output translation. So we could compute the BLEU score between the generated variant and the ground truth. ome benefits of BLEU is that it also captures n-gram overlaps and fluency.


* ROUGE Score: Another straightforward metrics similar to BLEU. ROUGE focuses on recall and is often used for summarization tasks but could also be adapted for GEC tasks to capture n-gram matches. However, if we consider that for GEC task exact match/overlap of tokens is desired, BLEU would be a better choice.

* Semantic Similarity: Use embedding-based similarity measures (e.g., cosine similarity of sentence embeddings like use din Sentence-Transformers) to compare the generated variants with the ground truth. Similarly, BERTScore implements this similarity score based on embeddings.

* Language Model Perplexity: Measuring the LLM ability to predict next word in the expected sequence can reflect if the model has undestood the task and predict the correct tokens with the grammar errors corrected. A lower perplexity will indicate that the model predicts next word more accurately. This will be very interesting since grammar correction can have different variations and measuring the confidence of the model also provide trustworthiness.

* Human Evaluation: This will require human annotators. Following a RLHF approach, it is more time-consuming but having a expert human to asses the similarity into a score will help the model to have a direct feedback.

## **2.3 Run Direct Preference Optimization (DPO) [5 points]**
* Use the preference optimization dataset to further train the model through DPO, a method that leverages human-like preferences for model training.
* After running DPO, measure the BLEU score on the test set. Compare this performance to the baseline established during the SFT phase.
* Search for an optimal set of hyperparameters, such as the learning rate and number of epochs. We provide an estimated BLEU score that you should aim to achieve after one epoch. However, you may achieve a better score by finding the most suitable hyperparameters.

### Format data for DPO training

In [None]:
import os
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset, load_dataset
import pandas as pd

# TODO: Run Direct Preference Optimization (DPO)


# Create DPO data with the required format with
# 3 entries: prompt, chosen, rejected
def return_prompt_and_responses(samples):
    return {
     "prompt": [
      f"### Input: ```{input}```\n ### Correct: "
      for input in samples["input"]
      ],
      "chosen": samples["chosen"],
      "rejected": samples["rejected"],
    }


# Load the data generated from parquet
dpo_preference_dataset = load_dataset("parquet", data_files={"train": "/content/dpo_preference_dataset.parquet"})
original_columns = dpo_preference_dataset["train"].column_names

# Apply the formatting
dpo_train_dataset = dpo_preference_dataset.map(
 return_prompt_and_responses,
 batched=True,
 remove_columns=original_columns
)["train"]

dpo_train_dataset

Dataset({
    features: ['chosen', 'rejected', 'prompt'],
    num_rows: 10
})

In [None]:
# Load the best model from huggingface (in case it was not loaded)
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt")
best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

best_model_ref = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Sweep Training for hyperparameter search

In [None]:
def train_dpo():
    # Initialize a new run for WandB
    wandb.init()

    # Access sweep-configured hyperparameters from WandB config
    config = wandb.config

    # Load the sftt trained model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt", use_fast=True, trust_remote_code=True,
                                              truncation=True, padding=True,
                                              return_tensors="pt")
    best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                      device_map='auto',
                                                      use_cache=True,
                                                      )
    best_model_ref = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                      device_map='auto',
                                                      use_cache=True)


    # Configure DPO with hyperparameters from WandB config
    output_dir_sftt = "smollm-gec-sftt"
    output_dir_dpo = "smollm-gec-sftt" + "-dpo"
    dpo_config = DPOConfig(
        output_dir = output_dir_dpo,
        beta=config.beta,
        learning_rate=config.learning_rate,
        num_train_epochs=config.epochs,
        weight_decay = config.weight_decay,
        lr_scheduler_type = config.lr_scheduler_type,
        loss_type=config.loss_type
    )
    # Initialize the DPOtrainer with the model, datasets, and SFT configuration
    toy_test_data = test_gec_ds.select(range(10))
    dpo_trainer = DPOTrainer(
        best_model,
        best_model_ref,
        args=dpo_config,
        train_dataset=dpo_train_dataset,
        eval_dataset=dpo_train_dataset,
        tokenizer=tokenizer,  # for visual language models, use tokenizer=processor instead
    )

    # Start training
    dpo_trainer.train()

    # Log any final metrics (you can log more metrics inside the training loop if needed)
    wandb.log({"final_eval_loss": dpo_trainer.evaluate()["eval_loss"]})

    # Finish the WandB run
    wandb.finish()

In [None]:
import wandb
sweep_config = {
    "method": "bayes",  # You can also use 'grid' or 'bayes'
    "metric": {"name": "final_eval_loss", "goal": "minimize"},
    "parameters": {
        "learning_rate": {
            "values": [5e-5, 3e-5, 1e-4]  # Exploring different learning rates
        },
        "weight_decay": {
            "values": [0.0, 0.01, 0.1]  # Exploring weight decay
        },
        "epochs": {
            "values": [1]
        },
        "gradient_accumulation_steps": {
            "values": [2, 4]  # Exploring gradient accumulation for smaller GPUs
        },
        "beta": {"values": [0.1]}, # Higher beta means less divergence from the initial policy.
        "loss_type": {"values": ["sigmoid", "robust"]},
        "lr_scheduler_type": {"values": ["linear", "cosine"]},
    }
}

# Initialize the sweep
sweep_id = wandb.sweep(sweep_config, project="C4AI-Challenge-smollm-sft-dpo")

# Launch the sweep
wandb.agent(sweep_id, function=train_dpo, count=3)

Create sweep with ID: dd4nasf8
Sweep URL: https://wandb.ai/huertas_97/C4AI-Challenge-smollm-sft-dpo/sweeps/dd4nasf8


[34m[1mwandb[0m: Agent Starting Run: euhxk6kz with config:
[34m[1mwandb[0m: 	beta: 0.1
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 5e-05
[34m[1mwandb[0m: 	loss_type: sigmoid
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	weight_decay: 0.01




Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/logits/chosen,▁
eval/logits/rejected,▁
eval/logps/chosen,▁
eval/logps/rejected,▁
eval/loss,▁
eval/rewards/accuracies,▁
eval/rewards/chosen,▁
eval/rewards/margins,▁
eval/rewards/rejected,▁
eval/runtime,▁

0,1
eval/logits/chosen,7.13893
eval/logits/rejected,4.01892
eval/logps/chosen,-41.8365
eval/logps/rejected,-150.73849
eval/loss,0.17035
eval/rewards/accuracies,0.9375
eval/rewards/chosen,0.29257
eval/rewards/margins,4.15474
eval/rewards/rejected,-3.86217
eval/runtime,43.1147


[34m[1mwandb[0m: Agent Starting Run: iyjfw6ls with config:
[34m[1mwandb[0m: 	beta: 0.1
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 4
[34m[1mwandb[0m: 	learning_rate: 3e-05
[34m[1mwandb[0m: 	loss_type: robust
[34m[1mwandb[0m: 	lr_scheduler_type: cosine
[34m[1mwandb[0m: 	weight_decay: 0.1




Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/logits/chosen,▁
eval/logits/rejected,▁
eval/logps/chosen,▁
eval/logps/rejected,▁
eval/loss,▁
eval/rewards/accuracies,▁
eval/rewards/chosen,▁
eval/rewards/margins,▁
eval/rewards/rejected,▁
eval/runtime,▁

0,1
eval/logits/chosen,7.12241
eval/logits/rejected,3.99359
eval/logps/chosen,-42.68489
eval/logps/rejected,-135.70238
eval/loss,0.21923
eval/rewards/accuracies,0.9375
eval/rewards/chosen,0.20773
eval/rewards/margins,2.56629
eval/rewards/rejected,-2.35856
eval/runtime,44.333


[34m[1mwandb[0m: Agent Starting Run: exvej4z3 with config:
[34m[1mwandb[0m: 	beta: 0.1
[34m[1mwandb[0m: 	epochs: 1
[34m[1mwandb[0m: 	gradient_accumulation_steps: 2
[34m[1mwandb[0m: 	learning_rate: 3e-05
[34m[1mwandb[0m: 	loss_type: robust
[34m[1mwandb[0m: 	lr_scheduler_type: linear
[34m[1mwandb[0m: 	weight_decay: 0




Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/logits/chosen,▁
eval/logits/rejected,▁
eval/logps/chosen,▁
eval/logps/rejected,▁
eval/loss,▁
eval/rewards/accuracies,▁
eval/rewards/chosen,▁
eval/rewards/margins,▁
eval/rewards/rejected,▁
eval/runtime,▁

0,1
eval/logits/chosen,7.12243
eval/logits/rejected,3.9936
eval/logps/chosen,-42.68505
eval/logps/rejected,-135.70288
eval/loss,0.21923
eval/rewards/accuracies,0.9375
eval/rewards/chosen,0.20771
eval/rewards/margins,2.56632
eval/rewards/rejected,-2.35861
eval/runtime,42.3329


### Train with selected hyperparameters

In [None]:
import wandb
# Initialize a new run for WandB
wandb.init(project="C4AI-Challenge-smollm-sft-dpo", entity="huertas_97", config={
    "epochs": 1,
    "learning_rate": 0.00005,
    "gradient_accumulation_steps": 4,
    "beta": 0.1,
    "weight_decay": 0.01,
    "lr_scheduler_type": 'linear',
    "loss_type": "sigmoid"
})
config = wandb.config

# Load the sftt trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt", use_fast=True, trust_remote_code=True,
                                          truncation=True, padding=True,
                                          return_tensors="pt")
best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )
best_model_ref = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True)


# Configure DPO with hyperparameters from WandB config
output_dir_sftt = "smollm-gec-sftt"
output_dir_dpo = "smollm-gec-sftt" + "-dpo"
dpo_config = DPOConfig(
    output_dir = output_dir_dpo,
    beta=config.beta,
    learning_rate=config.learning_rate,
    num_train_epochs=config.epochs,
    weight_decay = config.weight_decay,
    lr_scheduler_type = config.lr_scheduler_type,
    loss_type=config.loss_type
)
# Initialize the DPOtrainer with the model, datasets, and SFT configuration
toy_test_data = test_gec_ds.select(range(10))
dpo_trainer = DPOTrainer(
    best_model,
    best_model_ref,
    args=dpo_config,
    train_dataset=dpo_train_dataset,
    eval_dataset=dpo_train_dataset,
    tokenizer=tokenizer,  # for visual language models, use tokenizer=processor instead
)

# Start training
dpo_trainer.train()

# Log any final metrics (you can log more metrics inside the training loop if needed)
wandb.log({"final_eval_loss": dpo_trainer.evaluate()["eval_loss"]})

# Finish the WandB run
wandb.finish()

VBox(children=(Label(value='0.011 MB of 0.011 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))



Tokenizing train dataset:   0%|          | 0/10 [00:00<?, ? examples/s]

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


VBox(children=(Label(value='0.021 MB of 0.021 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/logits/chosen,▁
eval/logits/rejected,▁
eval/logps/chosen,▁
eval/logps/rejected,▁
eval/loss,▁
eval/rewards/accuracies,▁
eval/rewards/chosen,▁
eval/rewards/margins,▁
eval/rewards/rejected,▁
eval/runtime,▁

0,1
eval/logits/chosen,7.13893
eval/logits/rejected,4.01892
eval/logps/chosen,-41.8365
eval/logps/rejected,-150.73849
eval/loss,0.17035
eval/rewards/accuracies,0.9375
eval/rewards/chosen,0.29257
eval/rewards/margins,4.15474
eval/rewards/rejected,-3.86217
eval/runtime,43.7462


### Save DPO model

In [None]:
# push model to huggingface hub
from huggingface_hub import notebook_login
notebook_login()

# push model
dpo_trainer.model.push_to_hub("smollm-gec-sftt-dpo", use_auth_token=True)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…



README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Huertas97/smollm-gec-sftt-dpo/commit/fdc96357bc04ed1d174a7a8884e84d2097604231', commit_message='Upload LlamaForCausalLM', commit_description='', oid='fdc96357bc04ed1d174a7a8884e84d2097604231', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# push the tokenizer
tokenizer.push_to_hub("smollm-gec-sftt-dpo", use_auth_token=True)



README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Huertas97/smollm-gec-sftt-dpo/commit/3e16eae476f676e7e31fa162c09ef9a0a766054a', commit_message='Upload tokenizer', commit_description='', oid='3e16eae476f676e7e31fa162c09ef9a0a766054a', pr_url=None, pr_revision=None, pr_num=None)

### Evaluate DPO model

In [None]:
best_model_sftt_dpo = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt-dpo",
                                                  device_map='auto',
                                                  use_cache=True,
                                                           )
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt-dpo")

# TODO: Evaluate model, use evaluate_model function
# toy_test_data = test_gec_ds.select(range(1))
bleu_score = evaluate_model(best_model_sftt_dpo, tokenizer, test_gec_ds)
print(f"BLEU Score: {bleu_score}")

tokenizer_config.json:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/977 [00:00<?, ?B/s]

  0%|          | 0/485 [00:00<?, ?it/s]



BLEU Score: 0.3815442177502429


In [None]:
# TODO: Evaluate model, use evaluate_model function
bleu_score = evaluate_model(best_model_sftt_dpo, tokenizer, test_gec_ds)
print(f"BLEU Score: {bleu_score}")

  0%|          | 0/485 [00:00<?, ?it/s]

BLEU Score: 0.3815442177502429


BLEU Score: 0.3815442177502429 (with just 10 instances)

Expected BLEU score after 1 epoch SFT + DPO is ~ 0.50.

# **Coding Challenge Part 3: Explore Alternative DPO Variants for Improved Model Performance [10 points]**

Consider employing a different version or variant of DPO. Your task is to:

* Choose a variant of DPO or another preference-based optimization method that could potentially enhance the model's performance.
* Describe the specific differences in this approach compared to the initial DPO method used.
* Train the model using this alternative DPO method and measure its performance on the test set using the BLEU score.
* Compare these results with the baseline performance achieved during the initial Supervised Fine-Tuning (SFT) and the first DPO implementation.
* Select a few GEC example after SFT, DPO and this DPO variant phases and compare the quality of the corrections, which one you prefer as human?
* You are allowed to make changes in the preference data annotation to improve the score, e.g. apply different metrics or methods beyond edit distance.
* Discuss the role of any changes in achieving these results. Consider potential trade-offs or limitations introduced by the new approach.

### [OPTION 1] SFTT + KTO

https://huggingface.co/docs/trl/main/en/kto_trainer

#### Format Preferenced Data for KTO training

In [None]:
import random
# Create KTO data with the required format with
# 3 entries: prompt, chosen, rejected
# We obtain more data instances as rejected and chosen are used as separated instances


def return_prompt_and_responses_kto(samples):
    prompts = []
    completions = []
    labels = []

    print(samples['input'])

    # Loop through each sample
    for prompt, chosen, rejected in zip(samples['input'], samples['chosen'], samples['rejected']):
        # Add the "chosen" completions with label True
        prompts.append(f"### Input: ```{prompt}```\n ### Correct: ")
        completions.append(chosen)
        labels.append(True)  # Chosen responses are labeled as True (good)

        # Add the "rejected" completions with label False
        prompts.append(f"### Input: ```{prompt}```\n ### Correct: ")
        completions.append(rejected)
        labels.append(False)  # Rejected responses are labeled as False (bad)

    return {
        "prompt": prompts,
        "completion": completions,
        "label": labels
    }


# Load the data generated from parquet
kto_preference_dataset = load_dataset("parquet", data_files={"train": "/content/dpo_preference_dataset.parquet"})
original_columns = dpo_preference_dataset["train"].column_names

# Apply the formatting
kto_train_dataset = dpo_preference_dataset.map(
 return_prompt_and_responses_kto,
 batched=True,
 remove_columns=original_columns
)["train"]

kto_train_dataset

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

['Fix grammatically: Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.', 'Fix grammatically: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.', 'Fix grammatically: Improve the grammaticality of this sentence: Besides some technologically determinists that allow the development of biometric identification, this technology is also shaped by three social factors, namely, the desire of the society for safety, convenience and economy.', 'Fix grammatically: Remove all grammatical errors from this text: Safety is one of the crucial problems that many countries and companies concern.', 'Fix grammatically: Fix grammaticality in this sentence: On one hand more and more virus and hack can access personal computers, so the secret data and documents may be stolen.', "Fi

Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 20
})

In [None]:
kto_train_dataset[2]

{'prompt': '### Input: ```Fix grammatically: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.```\n ### Correct: ',
 'completion': 'Fix grammatically: Improve the grammaticality: As the number of people grows, the need of habitable environment is unquestionably essential.\n ### Correct: As the number of people grows, the need for a habitable environment is unquestionably essential. 2. ### Correct: As the number of people grows, the need for a habitable environment is unquestionably essential. 3. ### Correct: As the number of people grows, the need for a habitable environment is unquestionably essential. 4. ### Correct: As the number of people grows, the need for a habitable environment is unquestionably essential. 5. ### Correct: As',
 'label': True}

#### Training

In [None]:
# Load the best model from huggingface (in case it was not loaded)
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt")
best_model = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

best_model_ref = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True)

In [None]:
from trl import KTOConfig, KTOTrainer
output_dir_kto = "smollm-gec-sftt" + "-kto"
training_args = KTOConfig(
    output_dir=output_dir_kto,
    learning_rate=5e-5,
    num_train_epochs=1,
    gradient_accumulation_steps=4,
    beta=0.1,
    desirable_weight=1.0,
    undesirable_weight=1.0,
    logging_steps = 100
)

kto_trainer = KTOTrainer(
    best_model,
    best_model_ref,
    args=training_args,
    train_dataset=kto_train_dataset,
    tokenizer=tokenizer,
)



Tokenizing train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Extracting KL train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Processing tokenized train dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

Processing tokenized train KL dataset:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
kto_trainer.train()

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


TrainOutput(global_step=1, training_loss=0.375, metrics={'train_runtime': 230.443, 'train_samples_per_second': 0.087, 'train_steps_per_second': 0.004, 'total_flos': 0.0, 'train_loss': 0.375, 'epoch': 1.0})

#### Save Model

In [None]:
## push model to huggingface hub
from huggingface_hub import notebook_login
# notebook_login()

# push model
kto_trainer.model.push_to_hub("smollm-gec-sftt-kto", use_auth_token=True)
tokenizer.push_to_hub("smollm-gec-sftt-kto", use_auth_token=True)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/Huertas97/smollm-gec-sftt-kto/commit/e8c823752d21c323a9180efdb940b452e84cba64', commit_message='Upload tokenizer', commit_description='', oid='e8c823752d21c323a9180efdb940b452e84cba64', pr_url=None, pr_revision=None, pr_num=None)

#### Evaluate SFTT-KTO

I have added SacreBLEU and Rouge metrics

In [4]:
import evaluate
from tqdm.auto import tqdm

# BLEU Score
def evaluate_model_v2(model, tokenizer, ds):
    # TODO - compute and call preds and targets for the bleu.compute in the following.
    preds = []
    targets = []

    # Iterate over the dataset and generate predictions with tqdm
    for example in tqdm(ds):
        # Format the text input
        input_text = f"Fix grammatically: {example['src']}"

        # Tokenize the input text and pass to the model
        inputs = tokenizer(input_text, return_tensors="pt", padding=True,
                           truncation=True, max_length=128).to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            num_beams=5,  # Use beam search with 5 beams
            temperature = 0.0, # Deterministic output
            length_penalty=-1.0,  # Adjust length penalty
            early_stopping=True,
            pad_token_id=tokenizer.eos_token_id,  # To handle padding
            eos_token_id=tokenizer.eos_token_id   # Stop at EOS token
        )

        # Decode the generated prediction
        pred_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        if "### Correct:" in pred_text:
          pred_text = pred_text.split("### Correct:")[1].strip()

        # Append the prediction and the reference (correct target) to the lists
        preds.append(pred_text.strip())
        targets.append([example['tgt']])

        # print(f"--> Input: {input_text}\n")
        # print(f"--> Prediction: {pred_text}\n")
        # print(f"--> Target: {example['tgt']}\n\n")


    bleu = evaluate.load("bleu")
    sacrebleu = evaluate.load("sacrebleu")
    rouge_metric = evaluate.load("rouge")
    results_bleu = bleu.compute(predictions=preds, references=targets)
    results_sacrebleu = sacrebleu.compute(predictions=preds, references=targets)
    results_rouge = rouge_metric.compute(predictions=preds, references=targets)

    results = {
        "bleu": results_bleu,
        "sacrebleu": results_sacrebleu,
        "rouge": results_rouge
    }

    return results

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
best_model_sftt_kto = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt-kto",
                                                  device_map='auto',
                                                  use_cache=True,
                                                           )
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt-kto")

# TODO: Evaluate model, use evaluate_model function
# toy_test_data = test_gec_ds.select(range(5))
result_scores = evaluate_model_v2(best_model_sftt_kto, tokenizer, test_gec_ds)
print(f"BLEU Score: {result_scores['bleu']}")
print(f"SacreBLEU Score: {result_scores['sacrebleu']}")
print(f"ROUGE Score: {result_scores['rouge']}")

  0%|          | 0/485 [00:00<?, ?it/s]



BLEU Score: {'bleu': 0.38305905818897024, 'precisions': [0.6057244726156285, 0.438301013218815, 0.32799017768834626, 0.24726036898321543], 'brevity_penalty': 1.0, 'length_ratio': 1.2280964930062843, 'translation_length': 30291, 'reference_length': 24665}
SacreBLEU Score: {'score': 38.305905818897024, 'counts': [18348, 13064, 9617, 7130], 'totals': [30291, 29806, 29321, 28836], 'precisions': [60.57244726156284, 43.8301013218815, 32.799017768834624, 24.726036898321542], 'bp': 1.0, 'sys_len': 30291, 'ref_len': 24665}
ROUGE Score: {'rouge1': 0.6690840471208053, 'rouge2': 0.4965116394929687, 'rougeL': 0.6533155539118879, 'rougeLsum': 0.6533712604141934}


### [OPTION 2] Create KTO without SFTT

Instead of using SFTT model to generate variants, we can create a new dataset for KTO training directly from the original "src" and "tgt" pairs without relying on the SFT phase.

The logic is:

* Use the "src" (input with grammatical errors) as the prompt.
* Use the corresponding "tgt" (corrected text) as the completion and label it as True (since it's the correct completion).
* For False labels, you can pair the "src" with an incorrect completion by using the "tgt" from another random example (this introduces noise and simulates undesirable completions).

#### Format Data

In [None]:
import random
import pandas as pd
from datasets import Dataset

# Function to create KTO dataset from the original dataset with "src" and "tgt"
def create_kto_dataset(samples):
    prompts = []
    completions = []
    labels = []

    src_texts = samples["src"]
    tgt_texts = samples["tgt"]

    # First, create 'True' labeled pairs
    for src, tgt in zip(src_texts, tgt_texts):
        prompts.append(src)           # Prompt is the "src" text (input with errors)
        completions.append(tgt)        # Completion is the correct "tgt" text (corrected text)
        labels.append(True)            # Label as True since it's the correct completion

    # Now, create 'False' labeled pairs by associating "src" with a random incorrect "tgt"
    for src in src_texts:
        incorrect_tgt = random.choice(tgt_texts)  # Pick a random tgt (incorrect completion)
        print(src)
        while incorrect_tgt == tgt_texts[src_texts.index(src)]:
            # Ensure we don't accidentally pick the correct tgt
            incorrect_tgt = random.choice(tgt_texts)

        prompts.append(src)           # Prompt is the same "src"
        completions.append(incorrect_tgt)  # Completion is an incorrect tgt
        labels.append(False)          # Label as False since it's incorrect

    # Return as a dictionary to be converted into a Hugging Face Dataset later
    return {
        "prompt": prompts,
        "completion": completions,
        "label": labels
    }

toy_train_data = train_gec_ds.select(range(100))

kto_v2_train_dataset = toy_train_data.map(
    create_kto_dataset,
    batched=True,
    remove_columns=toy_train_data.column_names  # Remove old columns
)

display(kto_v2_train_dataset)
display(kto_v2_train_dataset[0])



Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 200
})

{'prompt': 'Remove all grammatical errors from this text: For example, countries with a lot of deserts can terraform their desert to increase their habitable land and using irrigation to provide clean water to the desert.',
 'completion': 'For example, countries with a lot of deserts can transform their desert to increase their habitable land and use irrigation to provide clean water to the desert.',
 'label': True}

#### Training

In [None]:
from trl import KTOConfig, KTOTrainer
output_dir_kto = "smollm-gec" + "-kto"
training_args = KTOConfig(
    output_dir=output_dir_kto,
    learning_rate=5e-5,
    num_train_epochs=1,
    gradient_accumulation_steps=4,
    beta=0.1,
    desirable_weight=1.0,
    undesirable_weight=1.0,
    logging_steps = 100
)

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

base_model_ref = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M",
                                                  device_map='auto',
                                                  use_cache=True)

tokenizer.pad_token = tokenizer.eos_token
base_model.config.pad_token_id = tokenizer.eos_token_id
base_model_ref.config.pad_token_id = tokenizer.eos_token_id

kto_trainer = KTOTrainer(
    base_model,
    base_model_ref,
    args=training_args,
    train_dataset=kto_v2_train_dataset,
    eval_dataset=kto_v2_train_dataset,
    tokenizer=tokenizer,
)

Tokenizing train dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Extracting KL train dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Processing tokenized train dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Processing tokenized train KL dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Extracting eval KL dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Processing tokenized eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Processing tokenized eval KL dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

In [None]:
kto_trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss


TrainOutput(global_step=6, training_loss=0.4428652922312419, metrics={'train_runtime': 578.7323, 'train_samples_per_second': 0.346, 'train_steps_per_second': 0.01, 'total_flos': 0.0, 'train_loss': 0.4428652922312419, 'epoch': 0.96})

#### Save Model

In [None]:
## push model to huggingface hub
from huggingface_hub import notebook_login
# notebook_login()

# push model
kto_trainer.model.push_to_hub("smollm-gec-kto", use_auth_token=True)
tokenizer.push_to_hub("smollm-gec-kto", use_auth_token=True)


model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Huertas97/smollm-gec-kto/commit/7c4881b0e7c510bf23feed9c3d7b029da031a2c0', commit_message='Upload tokenizer', commit_description='', oid='7c4881b0e7c510bf23feed9c3d7b029da031a2c0', pr_url=None, pr_revision=None, pr_num=None)

#### Evalaute KTO

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer
best_model_kto = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-kto",
                                                  device_map='auto',
                                                  use_cache=True,
                                                           )
tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-kto")

# TODO: Evaluate model, use evaluate_model function
# toy_test_data = test_gec_ds.select(range(5))
result_scores = evaluate_model_v2(best_model_kto, tokenizer, test_gec_ds)
print(f"BLEU Score: {result_scores['bleu']}")
print(f"SacreBLEU Score: {result_scores['sacrebleu']}")
print(f"ROUGE Score: {result_scores['rouge']}")

config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/863 [00:00<?, ?B/s]

  0%|          | 0/485 [00:00<?, ?it/s]



Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

BLEU Score: {'bleu': 0.22572858052952735, 'precisions': [0.37572731561566286, 0.2614761744100661, 0.18951144242715723, 0.13944593965639354], 'brevity_penalty': 1.0, 'length_ratio': 2.0625177376849786, 'translation_length': 50872, 'reference_length': 24665}
SacreBLEU Score: {'score': 22.572858052952736, 'counts': [19114, 13175, 9457, 6891], 'totals': [50872, 50387, 49902, 49417], 'precisions': [37.572731561566286, 26.14761744100661, 18.951144242715724, 13.944593965639354], 'bp': 1.0, 'sys_len': 50872, 'ref_len': 24665}
ROUGE Score: {'rouge1': 0.5143754513454688, 'rouge2': 0.3682031475650668, 'rougeL': 0.49822802556953366, 'rougeLsum': 0.4965161813106641}


## Compare otuputs

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_inference(model, tokenizer, input_text):
    inputs = tokenizer(input_text, return_tensors="pt", padding=True,
                    truncation=True, max_length=128).to(model.device)
    beam_output = model.generate(
        **inputs,
        max_length=128,
        num_beams=5,  # Use beam search with 5 beams
        temperature = 0.9, # Deterministic output
        length_penalty=-1.0,  # Adjust length penalty
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(beam_output[0], skip_special_tokens=True)



tokenizer = AutoTokenizer.from_pretrained("Huertas97/smollm-gec-sftt")
best_model_sftt = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

best_model_sftt_dpo = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt-dpo",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

best_model_sftt_kto = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-sftt-kto",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

best_model_kto = AutoModelForCausalLM.from_pretrained("Huertas97/smollm-gec-kto",
                                                  device_map='auto',
                                                  use_cache=True,
                                                  )

config.json:   0%|          | 0.00/722 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
id = 100
print(test_gec_ds[id])

input_text = test_gec_ds[id]["src"]
output_sftt = generate_inference(best_model_sftt, tokenizer, input_text)
print(f"{'='*25} SFTT {'='*25}\n {output_sftt}\n\n{'='*55}\n")

output_sftt_dpo = generate_inference(best_model_sftt_dpo, tokenizer, input_text)
print(f"{'='*25} SFTT+DPO {'='*25}\n {output_sftt_dpo}\n\n{'='*55}\n")

output_sftt_kto = generate_inference(best_model_sftt_kto, tokenizer, input_text)
print(f"{'='*25} SFTT+KTO {'='*25}\n {output_sftt_kto}\n\n{'='*55}\n")


output_sftt_kto = generate_inference(best_model_kto, tokenizer, input_text)
print(f"{'='*25} KTO {'='*25}\n {output_sftt_kto}\n\n{'='*55}\n")



{'_id': '230', 'task': 'gec', 'src': 'Fix grammaticality in this sentence: I certify between Liana being a very problematic employees.', 'tgt': 'I can attest that Liana was a very problematic employee.'}
 Fix grammaticality in this sentence: I certify between Liana being a very problematic employees.


 Fix grammaticality in this sentence: I certify between Liana being a very problematic employees.


 Fix grammaticality in this sentence: I certify between Liana being a very problematic employees.


 Fix grammaticality in this sentence: I certify between Liana being a very problematic employees.
Grammatically correct: I certify between Liana being a very problematic employees.




In [None]:
import warnings
warnings.filterwarnings('ignore')
id = 305
display(test_gec_ds[id])

input_text = test_gec_ds[id]["src"]
output_sftt = generate_inference(best_model_sftt, tokenizer, input_text)
print(f"{'='*25} SFTT {'='*25}\n {output_sftt}\n\n{'='*55}\n")

output_sftt_dpo = generate_inference(best_model_sftt_dpo, tokenizer, input_text)
print(f"{'='*25} SFTT+DPO {'='*25}\n {output_sftt_dpo}\n\n{'='*55}\n")

output_sftt_kto = generate_inference(best_model_sftt_kto, tokenizer, input_text)
print(f"{'='*25} SFTT+KTO {'='*25}\n {output_sftt_kto}\n\n{'='*55}\n")


output_sftt_kto = generate_inference(best_model_kto, tokenizer, input_text)
print(f"{'='*25} KTO {'='*25}\n {output_sftt_kto}\n\n{'='*55}\n")



{'_id': '712',
 'task': 'gec',
 'src': "Fix grammaticality of the sentence: I feel South East Asia would be worth your time, the most expensive things you'd does been getting your plane tickets, but other during that it being smooth sailings.",
 'tgt': "I feel Southeast Asia would be worth your time. The most expensive thing you'll do is get your plane tickets, but other than that, it's smooth sailing."}

 Fix grammaticality of the sentence: I feel South East Asia would be worth your time, the most expensive things you'd does been getting your plane tickets, but other during that it being smooth sailings.


 Fix grammaticality of the sentence: I feel South East Asia would be worth your time, the most expensive things you'd does been getting your plane tickets, but other during that it being smooth sailings.


 Fix grammaticality of the sentence: I feel South East Asia would be worth your time, the most expensive things you'd does been getting your plane tickets, but other during that it being smooth sailings.


 Fix grammaticality of the sentence: I feel South East Asia would be worth your time, the most expensive things you'd does been getting your plane tickets, but other during that it being smooth sailings. I feel South East Asia would be worth your time, the most expensive things you'd does being getting your plane tickets, but other during that it being smooth sailings. I feel Sou

#### Human Selection

El que más me gusta a mí es

### Alternative Preferenced Data Generation

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M",
                                                  device_map='auto',
                                                  use_cache=True,
                                                           )
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
tokenizer.pad_token = tokenizer.eos_token
base_model.config.pad_token_id = tokenizer.eos_token_id
# TODO: Evaluate model, use evaluate_model function
# toy_test_data = test_gec_ds.select(range(5))
result_scores = evaluate_model_v2(base_model, tokenizer, test_gec_ds)
print(f"BLEU Score: {result_scores['bleu']}")
print(f"SacreBLEU Score: {result_scores['sacrebleu']}")
print(f"ROUGE Score: {result_scores['rouge']}")

  0%|          | 0/485 [00:00<?, ?it/s]

BLEU Score: {'bleu': 0.20468702015878654, 'precisions': [0.3405030727454623, 0.23749797264421257, 0.17203577791513652, 0.12617156691916875], 'brevity_penalty': 1.0, 'length_ratio': 2.269450638556659, 'translation_length': 55976, 'reference_length': 24665}
SacreBLEU Score: {'score': 20.46870201587866, 'counts': [19060, 13179, 9463, 6879], 'totals': [55976, 55491, 55006, 54521], 'precisions': [34.050307274546235, 23.749797264421257, 17.203577791513652, 12.617156691916875], 'bp': 1.0, 'sys_len': 55976, 'ref_len': 24665}
ROUGE Score: {'rouge1': 0.4756876337787921, 'rouge2': 0.3403056165958035, 'rougeL': 0.45990730321313206, 'rougeLsum': 0.45982247139202626}


### Discussion

|           | BLEU score |
|-----------|------------|
| Baseline      | 0.201       |
| SFT     | 0.35       |
| SFT+DPO* | 0.382      |
| SFT+KTO* | 0.383      |
| KTO**     | 0.226      |

(*) Using just 10 training instances from the optimized preference datasets, due to computational limits to generate variants.

(**) Using 100 training instances from the original training GEC task

---

Firstly, what is remarkable indeed is that all the techinques are above the baseline (base pre-trained mdoel).

From these results, it’s evident that both SFT+DPO and SFT+KTO improved the BLEU score, even though slightly, compared to SFT alone. Both methods were trained using only 10 instances from the preference-optimized datasets due to computational limitations, making the improvement even more noteworthy. Remarkably, this suggests that preference-based optimization techniques, even with a small number of labeled examples, can have a significant impact.

Interestingly, the KTO model trained from scratch (i.e., without prior SFT) performed better than the baseline but worse than SFT and its variants.


SFT+KTO seems to be slightly better and can be due to the nature of KTO  for data formatting. For KTO data we broke down preference pairs into individual instances, doubling the number of training examples. This likely the reason why we observe the slightly better performance of SFT+KTO over SFT+DPO. However, the method still depends on having a solid base model (SFT) for effective results.

However, from my (human) point of view, despite the incremental improvements in BLEU scores, neither DPO nor KTO fully addressed the GEC task.
I noticed that while these techniques did enhance performance, the outputs still fell short of expectations for correcting grammatical errors consistently (as the examples in `Compared Outputs` section). One key takeaway from my observation is that decoding parameters such as beam search, temperature, and top-k sampling have a greater influence on the quality of the generated text than the fine-tuning or preference-based methods themselves.

### Limitation
It’s essential to acknowledge the computational limitations in my experiments, so the results observed are not string evidence for the conclusion one can make.
Related to this, with more computing resources and larger preference-annotated dataset, we could expect to improve further the BLUE score and succesfully address the GEC task.



