# Convergence ML Engineer / Researcher Take-home

We'd like to learn a little more about how you practically approach a small research-like project loosely based on Rejection Sampling Fine-tuning (aka RFT, introduced in https://arxiv.org/abs/2308.01825).

Tip: focus on section 3.3 ("Rejection Sampling Fine-tuning"). The paper isn't the best written, and we're happy to clarify anything.

We will provide some skeleton code for you to guide what we would like to see from you, although if you have ideas for a different structure you feel is better or more elegant, then feel free to rewrite and replace at will.

Note: your final submission does not have to be in a colab notebook, does not have to use Hugging Face, etc.

---

## A note from the team

We want to give you a chance to show off some of your best abilities.

For some people that might mean generating high quality data in a smart way. For others, it might be speeding up the whole process to enable easy reproducibility, and maybe organizing the code in a better way than given. Yet for others, it might be a chance to show off some modern policy optimization techniques like DPO or its variants. Or maybe focusing on solid evaluations and identifying limitations of small models and limited fine-tuning.

An ideal outcome of course is some sense of the model improving its mathematical abilities, but it’s not a bad thing if the final evaluation somehow shows equal or worse performance 😂 (negative results are results).

Ask lots of question! We're happy to answer any questions about the assignment, and to discuss concepts like RFT.

### extra package installation

pip3 install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

pip install flash-attn==2.5.8 --no-build-isolation

In [1]:
import torch
import datasets
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from tqdm import tqdm

In [2]:
torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

ds = datasets.load_dataset("openai/gsm8k", "main")

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: libcudart.so.12: cannot open shared object file: No such file or directory.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [6]:
ds['train'][0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}

In [29]:
def generate_solution(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    question,
    k = 10,
) -> datasets.Dataset:
    
    prompt = """
    You are an AI assistant to solve maths problems. For each question, write a step-by-step solution and give the final answer in format: solution\n#### final_answer
    Only include the solution and answer, do not include any other descriptions.
    
    Example:
    Question:
    Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? 
    
    Your answer should be:
    Natalia sold 48/2 = <<48/2=24>>24 clips in May.
    Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
    #### 72
    """

    messages = [
      {"role": "system", "content": prompt},
      {"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
      {"role": "assistant", "content": 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'},
      {"role": "user", "content": question}
    ]
    
    pipe = pipeline(
      "text-generation",
      model=model,
      tokenizer=tokenizer
    )
    
    generation_args = {
      "max_new_tokens": 500,
      "return_full_text": False,
      "do_sample": True,
      "temperature": 0.7,
    }

    responses = []
    
    for i in range(k):
        print(f"############{i}############")
        output = pipe(messages, **generation_args)
        text = output[0]['generated_text']
        print(text)

        responses.append(text)

    return responses
        

def generate_synthetic_data(model, tokenizer, questions, answers, k=10):
    dataset = []
    for question, answer in zip(questions, answers):
        print('question:', question)
        print('answer:', answer)
        gt_answer = answer.split("####")[1].lstrip(' ').strip()
        print('gt_answer:', gt_answer)
        responses = generate_solution(model, tokenizer, question, k)
        for i, response in enumerate(responses):
            response_answer = response.split("####")[1].lstrip(' ').strip()
            if gt_answer == response_answer:
                correct = True
            else:
                correct = False
            dataset.append({'question': question, 'gt_answer': answer, 'llm_answer': response, 'correct': correct})
    return dataset

In [31]:
dataset = generate_synthetic_data(model, tokenizer, ds['train']['question'][1:2], ds['train']['answer'][1:2], k=1)

question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
answer: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10
gt_answer: 10
############0############
 To find out how much Weng earned, we first need to convert the minutes she worked into hours. There are 60 minutes in an hour, so:

50 minutes ÷ 60 minutes/hour = 0.8333 hours (rounded to four decimal places)

Now, we can calculate her earnings by multiplying the hours worked by her hourly rate:

0.8333 hours × $12/hour = $10.00 (rounded to two decimal places)

Weng earned $10.00 for 50 minutes of babysitting yesterday.
#### 10.00


In [40]:
def apply_chat_template(
    example,
    tokenizer,
):
    messages = [
      {"role": "user", "content": example['question']},
      {"role": "assistant", "content": example['answer']},
    ]
    
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example

In [41]:
ds_1 = ds.map(
    apply_chat_template,
    fn_kwargs={"tokenizer": tokenizer},
    num_proc=10,
    # remove_columns=column_names,
    desc="Applying chat template to train_sft",
)

Applying chat template to train_sft (num_proc=10):   0%|          | 0/7473 [00:00<?, ? examples/s]

Applying chat template to train_sft (num_proc=10):   0%|          | 0/1319 [00:00<?, ? examples/s]

In [43]:
ds_1['train'][0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72',
 'text': '<|user|>\nNatalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<|end|>\n<|assistant|>\nNatalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72<|end|>\n<|endoftext|>'}

### Training on the Collected Data

Of course, the computational resources of colab are limited.

Employ whatever trick you would like to reduce the VRAM requirements during training (including swapping the model for a smaller one, although please only as a last resort).

In [None]:
from dataclasses import dataclass
from torch.optim import AdamW
from torch.utils.data import DataLoader

@dataclass
class TrainConfig:
    lr: float = 3e-5
    epochs: int = 2
    batch_size: int = 4
    device: str = 'cpu'

def train(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    dataset: datasets.Dataset,
    config: TrainConfig,
) -> AutoModelForCausalLM:

    def collate_fn(batch):
        # Implement this
        return

    def loss_fn(batch):
        # Implement an appropriate loss - note we don't expect this to necessarily
        # be tied to the earlier mentioned paper, just something that is sensible
        return

    optimizer = AdamW(model.parameters(), lr=config.lr)

    dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=config.batch_size, shuffle=True)

    # This is a pretty bare-bones loop, feel free to add anything particularly useful
    for epoch in config.epochs:
        model.train()

        for batch in dataloader:
            optimizer.zero_grad()

            loss = loss_fn(**batch)
            loss.backward()

            optimizer.step()

    return model

amazing_model = train(model, tokenizer, synthetic_dataset, TrainConfig())

### Evaluating the Model

This final part is more free-form. We'd like to evaluate our new model on the test set to see if it's improved, but then spend however much time you have left examining the model more closely / demonstrating some interesting behaviour / showing off beautiful plots.


In [None]:
def evaluate_model(
    model: AutoModelForCausalLM,
    eval_dataset: datasets.Dataset,
) -> float:
    return 0.0

our_score = evaluate_model(amazing_model, ds['test'])
original_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
their_score = evaluate_model(original_model, ds['test'])

conclusion = '🎉🎉🎉' if our_score > their_score else 'oh well, was it even supposed to work?'
print(conclusion)

### [Optional] - Discussion

We would be interested to know:

1.   If you were less time / computationally constrained, what would you do differently?
2.   What would your ideal first project look like if you joined?

