# Convergence ML Engineer / Researcher Take-home

We'd like to learn a little more about how you practically approach a small research-like project loosely based on Rejection Sampling Fine-tuning (aka RFT, introduced in https://arxiv.org/abs/2308.01825).

Tip: focus on section 3.3 ("Rejection Sampling Fine-tuning"). The paper isn't the best written, and we're happy to clarify anything.

We will provide some skeleton code for you to guide what we would like to see from you, although if you have ideas for a different structure you feel is better or more elegant, then feel free to rewrite and replace at will.

Note: your final submission does not have to be in a colab notebook, does not have to use Hugging Face, etc.

---

## A note from the team

We want to give you a chance to show off some of your best abilities.

For some people that might mean generating high quality data in a smart way. For others, it might be speeding up the whole process to enable easy reproducibility, and maybe organizing the code in a better way than given. Yet for others, it might be a chance to show off some modern policy optimization techniques like DPO or its variants. Or maybe focusing on solid evaluations and identifying limitations of small models and limited fine-tuning.

An ideal outcome of course is some sense of the model improving its mathematical abilities, but it’s not a bad thing if the final evaluation somehow shows equal or worse performance 😂 (negative results are results).

Ask lots of question! We're happy to answer any questions about the assignment, and to discuss concepts like RFT.

# Setup [ignore - just run]
***

In [2]:
#!pip3 install -r requirements.txt

In [3]:
%load_ext autoreload
%autoreload 2

import os  
import torch
import random 
import datasets
import numpy as np 
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from torch.utils.data import DataLoader

from data import GSM8KDataset, _apply_template
from prompt import EvalTemplate

import utils
import generation 

os.environ["TOKENIZERS_PARALLELISM"] = "true"
SEED = 128 
MODEL_NAME = "microsoft/Phi-3.5-mini-instruct"

# set seeds
torch.random.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

In [3]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    torch_dtype=torch.bfloat16, # accelerate inf 
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

train_dataset = datasets.load_dataset('gsm8k', 'main')['train']
val_dataset = datasets.load_dataset('gsm8k', 'main')['test'] 
print(f"Num Training instances: {len(train_dataset)}")
print(f"Num Validation instances: {len(val_dataset)}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Num Training instances: 7473
Num Validation instances: 1319


# Dataset
****

## Inspection 

In [4]:
train_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [5]:
for _ in range(5):
    seed = np.random.randint(0, len(train_dataset))
    print("*"*100)
    print(f"Checking instance {seed}:")
    utils.inspect_instance(train_dataset, seed)

****************************************************************************************************
Checking instance 3282:
question
It’s February 2021.  Mark was born in January 1976.  Graham is 3 years younger than Mark, and Graham’s sister, Janice, is 1/2 the age of Graham.  How old is Janice?
answer
It’s 2021 and Mark was born in 1976 so Mark is 2021-1976 = <<2021-1976=45>>45 years old
Graham is 3 years younger than Mark who is 45 so Graham is 45-3 = 42 years old
Janice is 1/2 the age of Graham who is 42 so Janice is 42/2 = <<42/2=21>>21 years old
#### 21
**************************************************
****************************************************************************************************
Checking instance 7251:
question
Melissa sells a coupe for $30,000 and an SUV for twice as much. If her commission is 2%, how much money did she make from these sales?
answer
First find the total cost of the SUV: $30,000 * 2 = $<<30000*2=60000>>60,000
Then add the cost of the coup

## Extract statistics 

We only look at train set now for certain information that will be used
during inference 

- Maximum length (num_tokens) of question: 239
- Maximum length (num_tokens) of answer: 475 

In [6]:
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_question_length(x['question'], tokenizer)
)

train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_answer_length(x['answer'], tokenizer) 
)
print(f"Maximum answer num_tokens: {max(train_dataset['answer_length'])}")
print(f"Maximum question num_tokens: {max(train_dataset['question_length'])}")

Maximum answer num_tokens: 475
Maximum question num_tokens: 239


## Extract Answer 

In [7]:
# infer number of hops 
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)

# infer answes using ground truth parser 
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_answer_from_gt(x['answer'])
)

In [8]:
# Optinal Cell (Only to verify that parsing from 
# ground truth and parsing from completion would 
# yield the same result 
# infer answers using prediction parser
answer_str_inf = [
    utils.GSM8KParser.get_answer_from_pred(x)['answer_str_digit'] \
    for x in train_dataset['answer']
]
assert answer_str_inf == train_dataset['answer_str_digit']

## Collate Dataset 
- Instead of collate a dataset using collate_fn on the fly. I went for pre-tokenzing the dataset into a static format, avoiding the need to tokenzie data on the fly. This would lead to faster training runs

In [9]:
train_dataset = datasets.load_dataset('gsm8k', 'main')['train']
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)
TrainData = GSM8KDataset(train_dataset, tokenizer)

Maximum answer num_tokens: 477
Maximum question num_tokens: 334
Maximum sequence num_tokens: 811
Maximum new tokens in generation: 1024
Setup Completed dataset:
Dataset({
    features: ['question', 'answer', 'num_hops', 'answer_str_digit', 'formatted_question', 'formatted_answer', 'question_length', 'answer_length', 'question_input_ids', 'question_attention_mask', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 7473
})


In [10]:
val_dataset = datasets.load_dataset('gsm8k', 'main')['test'] 
val_dataset = val_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)
valData = GSM8KDataset(val_dataset, tokenizer)

Maximum answer num_tokens: 430
Maximum question num_tokens: 310
Maximum sequence num_tokens: 740
Maximum new tokens in generation: 1024
Setup Completed dataset:
Dataset({
    features: ['question', 'answer', 'num_hops', 'answer_str_digit', 'formatted_question', 'formatted_answer', 'question_length', 'answer_length', 'question_input_ids', 'question_attention_mask', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1319
})


## Validation 

* We wanted to validate that our mannual padding is done correctly 

In [11]:
dummy_dataloader = DataLoader(TrainData, batch_size=2, shuffle=False)
for batch in dummy_dataloader:
    print(batch.keys())
    assert batch['input_ids'].shape[0] == 2
    assert (batch['input_ids'] == TrainData[:2]['input_ids']).all()
    break 
# first we validated that it's loading a homugenous batch of data

dict_keys(['question_input_ids', 'question_attention_mask', 'input_ids', 'attention_mask', 'labels', 'question', 'answer', 'num_hops', 'answer_str_digit', 'formatted_question', 'formatted_answer', 'question_length', 'answer_length'])


In [12]:
instance = TrainData[0]
# check attention mask for the entire sequence
print(tokenizer.decode(
    instance["input_ids"][instance['attention_mask']!=0]
))

<|system|> You are a highly intelligent assistant who is exceptional at solving Math Problems.<|end|><|user|> *Task*    
Think step by step to solve the following question:
```question
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
```

*Format*
1. Write all equations in a single line wihtout breaks in the middle.
2. End generation with pattern "#### <DIGITS>". Replace <DIGITS> with your final answer<|end|><|assistant|> Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|end|><|endoftext|>


In [13]:
# check labels for the entire sequence
print(tokenizer.decode(
    instance["input_ids"][instance['labels']!=-100]
))
# noticed how loss is not measured on <|assistant|>, it belongs to part of the question 

Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|end|><|endoftext|>


In [14]:
print(tokenizer.decode(
    instance["question_input_ids"][instance["question_attention_mask"]!=0]
))
# noticed how the question_input_ids only include everything up to <|assistant>| 

<|system|> You are a highly intelligent assistant who is exceptional at solving Math Problems.<|end|><|user|> *Task*    
Think step by step to solve the following question:
```question
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
```

*Format*
1. Write all equations in a single line wihtout breaks in the middle.
2. End generation with pattern "#### <DIGITS>". Replace <DIGITS> with your final answer<|end|><|assistant|>


# Model Vibe Check 
*** 

## Generation Config

In [15]:
generation_config = {
    "max_new_tokens" : TrainData.inf_seq_length,
    "temperature": 0.1,
    "num_return_sequences":1,
    "top_p": 0.9,
    "eos_token_id":tokenizer.eos_token_id,  # Specify the EOS token
    "pad_token_id":tokenizer.eos_token_id, 
    "do_sample":True,
    "output_scores":False,
    "return_dict_in_generate":True,
}
model = torch.compile(model)
model.eval()

OptimizedModule(
  (_orig_mod): Phi3ForCausalLM(
    (model): Phi3Model(
      (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
      (embed_dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0-31): 32 x Phi3DecoderLayer(
          (self_attn): Phi3FlashAttention2(
            (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
            (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
            (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
          )
          (mlp): Phi3MLP(
            (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
            (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
            (activation_fn): SiLU()
          )
          (input_layernorm): Phi3RMSNorm()
          (resid_attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
          (post_attention_layernorm): Phi3RMSNorm()
        )

## Generate instances 
* Test the "easiest" and the "hardest" instances 
* Noticed that our `utils.GSM8KParser.get_answer_from_pred` works even the model keeps generate after `#### digit`
* Noticed that the equation parser `GSM8KParser.parse_equations_from_pred` is not perfectly working, but it has extracted numbers from the text is based on the location of the equal signs inside the equaiton  

* Last, I highly suspect that GSM8K has been leaked into Phi-3.5 training, the maj@1 is extermely high in one of my previous experiment, ref to [wandb board](https://api.wandb.ai/links/moed/5dxnwaau) here. The plot shows the batch-wise maj@1 for zero-shot generation by using the EvalTemplate (Though I was using a very complete CoT prompt). Take it as a pintch of salt 

In [16]:
## question that would need to be solved by the shortest hope 
sorted_data = sorted(TrainData, key=lambda x: x["num_hops"])
instance = sorted_data[0]

chats = [instance["formatted_question"]]
responses = utils.sample_answers(
    tokenizer,
    model,
    chats,
    **generation_config,
)
print(responses[0])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


Natalia sold clips to 48 friends in April and half as many in May, so the equation representing the total number of clips sold in April and May is: 48 + (48/2)

Calculating the total: 48 + 24 = 72

#### 72


In [17]:
answer = instance["answer_str_digit"] 
prediction =  utils.GSM8KParser.get_answer_from_pred(responses[0])["answer_str_digit"]

print(f"Prediction -> {prediction}")
print(f"Label -> {answer}")

Prediction -> 72
Label -> 72


In [18]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0])
print(parsed_eqs["equations"])

['48+24=72']


In [19]:
# longest answer 
instance = sorted_data[-1]

chats = [instance["question"]]
responses = utils.sample_answers(
    tokenizer,
    model,
    chats,
    **generation_config,
)
print(responses[0])

To solve this problem, we need to calculate the number of male and female students, determine how many of each like to play basketball, and then find out how many do not like to play basketball.

Step 1: Calculate the number of male and female students.
The ratio of male to female students is 3:2, and there are 1000 students in total. To find out how many are male and how many are female, we can set up a proportion:

3 males / 2 females = 1000 students / (3 males + 2 females)

Let's denote the number of male students as M and the number of female students as F:

3M / 2F = 1000 / (M + F)

Since M + F = 1000, we can substitute:

3M / 2F = 1000 / 1000
3M / 2F = 1
3M = 2F

Now we can solve for M and F:

M = (2/3)F

Using the total number of students:

M + F = 1000
(2/3)F + F = 1000
(2/3)F + (3/3)F = 1000
(5/3)F = 1000
F = (3/5) * 1000
F = 600

Now we can find M:

M = (2/3) * 600
M = 400

Step 2: Calculate the number of students who like to play basketball.
2/3 of the male students like to 

In [20]:
answer = instance["answer_str_digit"] 
prediction =  utils.GSM8KParser.get_answer_from_pred(responses[0])["answer_str_digit"]

print(f"Prediction -> {prediction}")
print(f"Label -> {answer}")

Prediction -> <INVALID_ANSWER>
Label -> 52


In [21]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0])
# the parsing of the equaitons is not perfect, but it shows the direction which we are going
print(parsed_eqs["equations"])

['267+120=387', '2/3+3/3=1000', '613/1000*100=613', '3/2=1000', '1/5*600=120', '3/2=1000/3+2', '2/3+=1000', '3=2', '3/2=1', '3/2=1000/1000', '5/3=1000', '2/3*400=26667267', '1000-387=613']


# Generate Synthetic Data 
***
- Here we generate 10 response per question using our base model 


## Example Walthrough 

In [22]:
# First, we adjust the generation config 
# [1] bump up the temperature to increase the uncertainty, this will boost the diversity of the reasoning path
# [2] since we are using top-p sampling, by increasing the temperature, we are potentially increasing the 
# number of tokens inside this top-p pool, which further diversifies the choices of tokens available for us in 
# each timestamp. We decrease top-p to 0 to further encourage diversify. 
# This is the same as suggested in the scaling paper 

num_sequence = 10
generation_config = {
    "max_new_tokens" : TrainData.inf_seq_length,
    "temperature": 0.7,
    "num_return_sequences":num_sequence,
    #"top_p": 0.5,
    "eos_token_id":tokenizer.eos_token_id,  # Specify the EOS token
    "pad_token_id":tokenizer.eos_token_id, 
    "do_sample":True,
    "output_scores":False,
    "return_dict_in_generate":True,
}
batch_size = 2

In [23]:
generations = generation.get_generations(
    tokenizer,
    model,
    TrainData,
    batch_size=batch_size,
    **generation_config
)
# force only one generation with a break 

  0%|          | 0/3737 [00:00<?, ?it/s]

  0%|          | 0/3737 [00:09<?, ?it/s]


In [24]:
instance_sequences = []
for gen in generations: 

    sequences = gen.sequences

    counter = 0
    for i in range(0, sequences.shape[0], generation_config["num_return_sequences"]):
        instance_sequences.append(sequences[i:i+generation_config["num_return_sequences"]])
        counter += 1 

    assert counter == batch_size

print(len(instance_sequences), instance_sequences[0].shape)

2 torch.Size([10, 480])


In [25]:
unique_correct_completions,incorrect_completions,unique_correct_completions_eqs =\
    generation._filter_completions(instance_sequences[0], TrainData.max_length_question, tokenizer)
print(f"We filteded {len(unique_correct_completions)} unique correct complets") 

We filteded 10 unique correct complets


In [26]:
best_idx, best_worst_idx, gap = generation._socre_equations(unique_correct_completions_eqs, unique_correct_completions)
print(f"Best index: {best_idx}")
print(f"Worst index: {best_worst_idx}")
print(f"Gap: {gap}")

Best index: 1
Worst index: 0
Gap: 0.765527950310559


In [27]:
for i in range(len(unique_correct_completions)):
    print(i)
    print(unique_correct_completions[i])
    print('*'*100)

0
Natalia sold clips to 48 of her friends in April and half as many in May, so the total number of clips sold can be represented by the equation: 48 (April sales) + (48/2) (May sales) = 48 + 24 = 72 #### 72
****************************************************************************************************
1
Natalia sold clips to 48 friends in April and half as many in May, so the equations are:

April sales: 48
May sales: 48/2

Total sales (April + May): 48 + (48/2)

Simplifying the equation:

Total sales = 48 + 24

Total sales = 72

#### 72
****************************************************************************************************
2
Natalia sold clips to 48 friends in April and half as many in May, so the equation representing the total number of clips sold in April and May is: 48 + (48/2)

Calculating the total: 48 + 24 = 72

#### 72
****************************************************************************************************
3
Natalia sold clips to 48 friends in Apr

**We can see that the best completion selected is indeed better in terms of the structure and reasoning steps**

## End-2-End generation based on trainset 

In [None]:
#!accelerate launch --multi-gpu generation.py

  0%|          | 0/7473 [00:00<?, ?it/s]

In [29]:
syntehetic_dataset

Dataset({
    features: ['question', 'favored_solutions', 'infavored_solutions', 'wrong_solutions', 'favored_infavored_gaps'],
    num_rows: 2
})

In [34]:
idx = 1
print(syntehetic_dataset["wrong_solutions"][idx])

(12 * (50/60)) #### 10

Weng earns $12 per hour and did 50 minutes of babysitting. To find the earnings, we convert 50 minutes to hours by dividing by 60 (since there are 60 minutes in an hour) and then multiply by her hourly rate of $12.

Final answer: 10 ####


In [35]:
print(syntehetic_dataset["question"][idx])

Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?


In [36]:
print(syntehetic_dataset["favored_solutions"][idx])

First, we need to convert the 50 minutes of babysitting into hours since Weng earns per hour. There are 60 minutes in an hour, so:

50 minutes ÷ 60 = 0.8333 hours (rounded to 4 decimal places)

Now, we can multiply the number of hours she worked by her hourly rate:

0.8333 hours × $12/hour = $10.00 (rounded to 2 decimal places)

The equation in a single line: 0.8333 × 12 = 10.00

#### 10.00


In [33]:
print(syntehetic_dataset["infavored_solutions"][idx])

Natalia sold clips to 48 friends in April, and then sold half as many in May, which is 48/2. To find the total number of clips sold in April and May, we add the two amounts together: 48 + (48/2)

#### 72


# Training
***

Employ whatever trick you would like to reduce the VRAM requirements during training (including swapping the model for a smaller one, although please only as a last resort).

In [None]:
from dataclasses import dataclass
from torch.optim import AdamW
from torch.utils.data import DataLoader

@dataclass
class TrainConfig:
    lr: float = 3e-5
    epochs: int = 2
    batch_size: int = 4
    device: str = 'cpu'

def train(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    dataset: datasets.Dataset,
    config: TrainConfig,
) -> AutoModelForCausalLM:

    def collate_fn(batch):
        # Implement this
        return

    def loss_fn(batch):
        # Implement an appropriate loss - note we don't expect this to necessarily
        # be tied to the earlier mentioned paper, just something that is sensible
        return

    optimizer = AdamW(model.parameters(), lr=config.lr)

    dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=config.batch_size, shuffle=True)

    # This is a pretty bare-bones loop, feel free to add anything particularly useful
    for epoch in config.epochs:
        model.train()

        for batch in dataloader:
            optimizer.zero_grad()

            loss = loss_fn(**batch)
            loss.backward()

            optimizer.step()

    return model

amazing_model = train(model, tokenizer, synthetic_dataset, TrainConfig())

### Evaluating the Model

This final part is more free-form. We'd like to evaluate our new model on the test set to see if it's improved, but then spend however much time you have left examining the model more closely / demonstrating some interesting behaviour / showing off beautiful plots.


In [None]:
def evaluate_model(
    model: AutoModelForCausalLM,
    eval_dataset: datasets.Dataset,
) -> float:
    return 0.0

our_score = evaluate_model(amazing_model, ds['test'])
original_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
their_score = evaluate_model(original_model, ds['test'])

conclusion = '🎉🎉🎉' if our_score > their_score else 'oh well, was it even supposed to work?'
print(conclusion)

### [Optional] - Discussion

We would be interested to know:

1.   If you were less time / computationally constrained, what would you do differently?
2.   What would your ideal first project look like if you joined?

