# Goal

We'd like to learn a little more about how you practically approach a small research-like project loosely based on Rejection Sampling Fine-tuning (aka RFT, introduced in https://arxiv.org/abs/2308.01825).

Tip: focus on section 3.3 ("Rejection Sampling Fine-tuning"). The paper isn't the best written, and we're happy to clarify anything.

We will provide some skeleton code for you to guide what we would like to see from you, although if you have ideas for a different structure you feel is better or more elegant, then feel free to rewrite and replace at will.

Note: your final submission does not have to be in a colab notebook, does not have to use Hugging Face, etc.

---


We want to give you a chance to show off some of your best abilities.

For some people that might mean generating high quality data in a smart way. For others, it might be speeding up the whole process to enable easy reproducibility, and maybe organizing the code in a better way than given. Yet for others, it might be a chance to show off some modern policy optimization techniques like DPO or its variants. Or maybe focusing on solid evaluations and identifying limitations of small models and limited fine-tuning.

An ideal outcome of course is some sense of the model improving its mathematical abilities, but it’s not a bad thing if the final evaluation somehow shows equal or worse performance 😂 (negative results are results).

Ask lots of question! We're happy to answer any questions about the assignment, and to discuss concepts like RFT.

# Setup [ignore - just run]
***

In [1]:
#!pip3 install -r requirements.txt

In [3]:
%load_ext autoreload
%autoreload 2

import os  
import torch
import random 
import datasets
import numpy as np 
import pandas as pd

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

from torch.utils.data import DataLoader

from data import GSM8KDataset, _apply_template
from prompt import EvalTemplate

import utils
import generation 

os.environ["TOKENIZERS_PARALLELISM"] = "true"
SEED = 128 
MODEL_NAME = "microsoft/Phi-3.5-mini-instruct"

# set seeds
torch.random.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [34]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    torch_dtype=torch.bfloat16, # accelerate inf 
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

train_dataset = datasets.load_dataset('gsm8k', 'main')['train']
val_dataset = datasets.load_dataset('gsm8k', 'main')['test'] 
print(f"Num Training instances: {train_dataset.shape[0]}")
print(f"Num Validation instances: {val_dataset.shape[0]}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Num Training instances: 7473
Num Validation instances: 1319


In [133]:
# from tqdm import tqdm
# for i in tqdm(range(len(gen_dataset_2))):
#     for j in range(len(train_dataset)):
#         if gen_dataset_2[i]['question'] == train_dataset[j]['question']:
#             assert utils.GSM8KParser.get_answer_from_gt(train_dataset[j]['answer']) == \
#                  utils.GSM8KParser.get_answer_from_pred(gen_dataset_2[i]['favored_solutions']), print(i, j)
# verifcation, this cell takes about 7mins to run 

In [130]:
utils.inspect_instance(gen_dataset_2, i)

question
Five coworkers were talking during the lunch break. Roger, the oldest one, said that he has the same amount of experience in years as all four of the others combined and that his retirement should come when he accumulates 50 years of experience. Peter said that when he came to the company his daughter was 7 years old, and now she is 19 years old. Tom then said he has twice as many years of experience as Robert. Robert said that he has 4 years of experience less than Peter but 2 more years of experience than Mike. How many more years does Roger have to work before he retires?
question_idx
512
favored_solutions
Let's denote the years of experience of Roger, Peter, Tom, Robert, and Mike as R, P, T, Ro, and Mi respectively.

1. R = P + T + Ro + Mi (Roger's experience is the sum of the others)
2. R + (P - 12) = 50 (Peter's experience is 19 - 7 = 12 years more than his daughter's current age)
3. T = 2Ro (Tom has twice as many years of experience as Robert)
4. Ro = P - 4 (Robert has 

In [122]:
gen_dataset = datasets.load_from_disk("corrected-pred-parser-0.5lev-gsm8k_synthetic_data_747instances_5samples")
gen_dataset = gen_dataset.sort(column_names="favored_infavored_gaps", reverse=True)
utils.inspect_instance(gen_dataset, 0)

question
Ian used a grocery delivery app to have his groceries delivered.  His original order was $25 before delivery and tip.  He noticed that 3 items changed on his order.  A $0.99 can of tomatoes was replaced by a $2.20 can of tomatoes, his $1.00 lettuce was replaced with $1.75 head of lettuce and his $1.96 celery was replaced with celery that cost $2.00.  Delivery and tip came to a total of $8.00.  How much is his new bill now, with the food substitutes and delivery/tip?
question_idx
588
favored_solutions
First, we calculate the difference in cost for each item changed:
$2.20 (new can of tomatoes) - $0.99 (old can of tomatoes) = $1.21
$1.75 (new head of lettuce) - $1.00 (old lettuce) = $0.75
$2.00 (new celery) - $1.96 (old celery) = $0.04

Next, we add the differences together to find the total increase in cost due to the substitutions:
$1.21 + $0.75 + $0.04 = $2.00

Now, we add the total increase in cost to the original order amount:
$25.00 (original order) + $2.00 (increase due t

In [116]:

idx = 2

utils.inspect_instance(gen_dataset, idx)

question
Susy goes to a large school with 800 students, while Sarah goes to a smaller school with only 300 students.  At the start of the school year, Susy had 100 social media followers.  She gained 40 new followers in the first week of the school year, half that in the second week, and half of that in the third week.  Sarah only had 50 social media followers at the start of the year, but she gained 90 new followers the first week, a third of that in the second week, and a third of that in the third week.  After three weeks, how many social media followers did the girl with the most total followers have?
question_idx
2
favored_solutions
Let's calculate the total followers for both girls after three weeks:

Susy's followers:
Initial followers: 100
First week: +40
Second week: +40/2 = +20
Third week: +20/2 = +10
Total followers for Susy: 100 + 40 + 20 + 10 = 170

Sarah's followers:
Initial followers: 50
First week: +90
Second week: +90/3 = +30
Third week: +30/3 = +10
Total followers for

In [None]:
for i in range(len(train_dataset)):
    if train_dataset[i]['question'] == gen_dataset[idx]['question']:
        utils.inspect_instance(train_dataset, i)
        print(utils.GSM8KParser.get_answer_from_gt(gen_dataset[idx]['favored_solutions']))
        print(utils.GSM8KParser.get_answer_from_gt(train_dataset[i]['answer']))
        print(utils.GSM8KParser.get_answer_from_pred(train_dataset[i]['answer']))

question
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
answer
In the beginning, Betty has only 100 / 2 = $<<100/2=50>>50.
Betty's grandparents gave her 15 * 2 = $<<15*2=30>>30.
This means, Betty needs 100 - 50 - 30 - 15 = $<<100-50-30-15=5>>5 more.
#### 5
question_length
64
answer_length
107
num_hops
3
answer_str_digit
5
**************************************************


# Dataset
****
- The reasoning paths/hops seem to present in seperate lines (leading to easiniess in parsing them)
- The equaiton tags ```<< >>``` are used to train language models to invoke calculators in the original GSM8K OpenAI paper.
- **Verified that our ```utils.GSM8KParser.get_answer_from_pred``` is parsing the same result as the ground truth parser**

## Inspection 

In [5]:
train_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [6]:
for _ in range(5):
    seed = np.random.randint(0, len(train_dataset))
    print("*"*100)
    print(f"Checking instance {seed}:")
    utils.inspect_instance(train_dataset, seed)

****************************************************************************************************
Checking instance 3282:
question
It’s February 2021.  Mark was born in January 1976.  Graham is 3 years younger than Mark, and Graham’s sister, Janice, is 1/2 the age of Graham.  How old is Janice?
answer
It’s 2021 and Mark was born in 1976 so Mark is 2021-1976 = <<2021-1976=45>>45 years old
Graham is 3 years younger than Mark who is 45 so Graham is 45-3 = 42 years old
Janice is 1/2 the age of Graham who is 42 so Janice is 42/2 = <<42/2=21>>21 years old
#### 21
**************************************************
****************************************************************************************************
Checking instance 7251:
question
Melissa sells a coupe for $30,000 and an SUV for twice as much. If her commission is 2%, how much money did she make from these sales?
answer
First find the total cost of the SUV: $30,000 * 2 = $<<30000*2=60000>>60,000
Then add the cost of the coup

## Extract statistics 

We only look at train set now for certain information that will be used
during inference 

- Maximum length (num_tokens) of question: 239
- Maximum length (num_tokens) of answer: 475 

In [7]:
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_question_length(x['question'], tokenizer)
)

train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_answer_length(x['answer'], tokenizer) 
)
print(f"Maximum answer num_tokens: {max(train_dataset['answer_length'])}")
print(f"Maximum question num_tokens: {max(train_dataset['question_length'])}")

Maximum answer num_tokens: 475
Maximum question num_tokens: 239


## Extract Answer 

In [19]:
# infer number of hops 
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)

# infer answes using ground truth parser 
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_answer_from_gt(x['answer'])
)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [20]:
# Optinal Cell (Only to verify that parsing from 
# ground truth and parsing from completion would 
# yield the same result 
# infer answers using prediction parser
answer_str_inf = [
    utils.GSM8KParser.get_answer_from_pred(x)['answer_str_digit'] \
    for x in train_dataset['answer']
]
assert answer_str_inf == train_dataset['answer_str_digit']

## A note on the collate function 
Instead of tokenizing the dataset using ```collate_fn``` on the fly. I went for pre-tokenzing the dataset into a static format, avoiding the need to tokenzie data on the fly. This would lead to faster training runs, but less flexibility in terms of handling data on the fly. i.e. If you want to slightly modify the quesiton conditional on the model performance during training, it's impossible to do it without a collate function.

However, since our problem is simple, we can afford to pre-tokenize the dataset.

That said, there is a dummy collation function ```class PreprocessedCollator(DataCollatorMixin)``` in ```lora.py```, simply offloading the input to format required by SFTTRainer. 

Inside ```data.py```, the tokenization is done in the ```GSM8KDataset``` class, which runs a ````self._preprocess```` function on instentiation. 

All required attributes, i.e. input_ids, attention_mask, labels are prepared once instentiation completes. For detailed explanation, please refer to ```data.GSM8KDataset._preprocess```. 

If we refers to ```data.GSM8KDataset``` class, you could find that the output schema is rather long, which could result in taking up nunecessary spaces during training, further engineering effort could be inplace to remediate that. 


In [9]:
train_dataset = datasets.load_dataset('gsm8k', 'main')['train']
train_dataset = train_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)
TrainData = GSM8KDataset(train_dataset, tokenizer) #Complete tokenization of the dataset

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Maximum answer num_tokens: 477
Maximum question num_tokens: 359
Maximum sequence num_tokens: 836
Maximum new tokens in generation: 527


Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Setup Completed dataset:
Dataset({
    features: ['question', 'answer', 'num_hops', 'answer_str_digit', 'formatted_question', 'formatted_answer', 'question_length', 'answer_length', 'question_input_ids', 'question_attention_mask', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 7473
})


In [10]:
val_dataset = datasets.load_dataset('gsm8k', 'main')['test'] 
val_dataset = val_dataset.map(
    lambda x: utils.GSM8KParser.get_num_hops(x['answer'])
)
valData = GSM8KDataset(val_dataset, tokenizer)

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Maximum question num_tokens: 310
Maximum answer num_tokens: 430
Maximum sequence num_tokens: 740
Maximum new tokens in generation: 480


Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Setup Completed dataset:
Dataset({
    features: ['question', 'answer', 'num_hops', 'answer_str_digit', 'formatted_question', 'formatted_answer', 'question_length', 'answer_length', 'question_input_ids', 'question_attention_mask', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1319
})


## Validation 

* We wanted to validate that our mannual padding is done correctly 

### Data Loader Compatibility

In [11]:
dummy_dataloader = DataLoader(TrainData, batch_size=2, shuffle=False)
for batch in dummy_dataloader:
    assert batch['input_ids'].shape[0] == 2
    assert (batch['input_ids'] == TrainData[:2]['input_ids']).all()
    break 
# first we validated that it's loading a homugenous batch of data

### Padding for the entire sequence (training)

In [14]:
instance = TrainData[0]
# check (attention mask == 1) part for the entire sequence
print(tokenizer.decode(
    instance["input_ids"][instance['attention_mask']!=0]
))

<|system|> You are a highly intelligent assistant who is exceptional at solving Math Problems.<|end|><|user|> *Task*    
Think step by step to solve the following question:
NOTE:
1. Reason deductively.
1. Write all equations in a single line wihtout breaks in the middle.
2. Submit final answer using PURE DIGITS, in the last starting with "####".
    i.e. "#### 10" if the final answer is $10

```question
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
```<|end|><|assistant|> Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|end|><|endoftext|>


In [15]:
# check (labels!= -100) part for the entire sequence
print(tokenizer.decode(
    instance["input_ids"][instance['labels']!=-100]
))
# noticed how loss is not measured on <|assistant|>, it belongs to part of the question 

Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<|end|><|endoftext|>


### Padding for the question only (inference)

In [16]:
# check (attention_mask = 1) part for the question 
print(tokenizer.decode(
    instance["question_input_ids"][instance["question_attention_mask"]!=0]
))
# noticed how the question_input_ids only include everything up to <|assistant>| 

<|system|> You are a highly intelligent assistant who is exceptional at solving Math Problems.<|end|><|user|> *Task*    
Think step by step to solve the following question:
NOTE:
1. Reason deductively.
1. Write all equations in a single line wihtout breaks in the middle.
2. Submit final answer using PURE DIGITS, in the last starting with "####".
    i.e. "#### 10" if the final answer is $10

```question
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
```<|end|><|assistant|>


# Model Vibe Check 
*** 

## Generation Config

In [35]:
generation_config = {
    "max_new_tokens" : 1024,
    "temperature": 0.1, 
    "num_return_sequences":1,
    "top_p": 0.9,
    "eos_token_id":tokenizer.eos_token_id,  # Specify the EOS token
    "pad_token_id":tokenizer.eos_token_id, 
    "do_sample":True,
    "output_scores":False,
    "return_dict_in_generate":True,
}
model = torch.compile(model)
model.eval()

OptimizedModule(
  (_orig_mod): Phi3ForCausalLM(
    (model): Phi3Model(
      (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
      (embed_dropout): Dropout(p=0.0, inplace=False)
      (layers): ModuleList(
        (0-31): 32 x Phi3DecoderLayer(
          (self_attn): Phi3FlashAttention2(
            (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
            (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
            (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
          )
          (mlp): Phi3MLP(
            (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
            (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
            (activation_fn): SiLU()
          )
          (input_layernorm): Phi3RMSNorm()
          (resid_attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
          (post_attention_layernorm): Phi3RMSNorm()
        )

## Generate instances 
* Test the "easiest" and the "hardest" instances from GSM8K, ranked according to the number of hops 
required to solve the problem 

* Noticed that the equation parser `GSM8KParser.parse_equations_from_pred` is not perfectly working, but it can extract sequences of numers and operatiing signs inside each lines of the text. Still giving us information about the numerical operations that has been performed. 

* Whether we should include text when parsing the eqaution is a question that we could investigate 
(In different cases, we could see the advantage and )

#### Shortest hop required instance

In [24]:
## question that would need to be solved by the shortest hope 
sorted_data = sorted(TrainData, key=lambda x: x["num_hops"])
instance = sorted_data[0]

chats = [instance["formatted_question"]]
responses = utils.sample_answers(
    tokenizer,
    model,
    chats,
    **generation_config,
)
print(responses[0])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


Natalia sold clips to 48 friends in April and half as many in May, which means she sold 48/2 = 24 clips in May. To find the total number of clips sold in April and May, we add the two amounts together: 48 (April) + 24 (May) = 72. Therefore, the final answer is ####72.


In [25]:
answer = instance["answer_str_digit"] 
prediction =  utils.GSM8KParser.get_answer_from_pred(responses[0])["answer_str_digit"]

print(f"Prediction -> {prediction}")
print(f"Label -> {answer}")

Prediction -> 72
Label -> 72


In [None]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0])
print(parsed_eqs["equations"])
## the parser will keep looks for the previous digits before the equal sign until it stops,
## which could be clamping all numbers and operational signs in a single line together 

['4848/2=2448+24=7272']


In [30]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0], include_text=True)
print(parsed_eqs["equations"])
## including the text could boil down to using the entire sequence, which is undesired 

['48 friends in April and half as many in May, which means she sold 48/2 = 24 clips in May. To find the total number of clips sold in April and May, we add the two amounts together: 48 (April) + 24 (May) = 72. Therefore, the final answer is ####72']


### Longest hop required instnace

In [36]:
# longest answer 
instance = sorted_data[-1]

chats = [instance["question"]]
responses = utils.sample_answers(
    tokenizer,
    model,
    chats,
    **generation_config,
)
print(responses[0])

To solve this problem, we need to follow these steps:

1. Determine the number of male and female students based on the given ratio and total number of students.
2. Calculate the number of male and female students who like to play basketball.
3. Find the total number of students who like to play basketball.
4. Calculate the number of students who do not like to play basketball.
5. Determine the percentage of students who do not like to play basketball out of the total student population.

Let's go through these steps:

Step 1: Determine the number of male and female students.
The ratio of male to female students is 3:2, and there are 1000 students in total. To find out how many are male and how many are female, we can set up a proportion:

Let \( M \) be the number of male students and \( F \) be the number of female students.
\[ \frac{M}{F} = \frac{3}{2} \]
\[ M + F = 1000 \]

From the ratio, we can express \( M \) in terms of \( F \):
\[ M = \frac{3}{2}F \]

Now we can substitute \( 

In [None]:
answer = instance["answer_str_digit"] 
prediction =  utils.GSM8KParser.get_answer_from_pred(responses[0])["answer_str_digit"]

print(f"Prediction -> {prediction}")
print(f"Label -> {answer}")

Prediction -> <INVALID_ANSWER>
Label -> 52


In [None]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0])

# this time the model addresses the problem in LaTex, which leads to only 2 ssuccessfull parsing
print(parsed_eqs["equations"])

['5201000100=52', '1000-480=520', '23600=400', '400+80=480', '52=1000', '15400=80', '32+=1000']


In [None]:
parsed_eqs = utils.GSM8KParser.parse_equations_from_pred(responses[0], include_text=True)

# this time a text-inclusive parser provides much better extraction in terms of 
# numerical operations required 
print(parsed_eqs["equations"])

['\\frac{2}{3} \\times 600 = 400', '400 \\text{ (male students)} + 80 \\text{ (female students)} = 480', '\\frac{1}{5} \\times 400 = 80', '\\frac{5}{2}F = 1000', '1000 \\text{ (total students)} - 480 \\text{ (students who like basketball)} = 520', '\\frac{3}{2}F + F = 1000', '\\frac{520}{1000} \\times 100\\% = 52']


# Generate Synthetic Data 
***
- Here we generate 10 response per question using our base model 


## Example Walthrough 

In [22]:
# First, we adjust the generation config 
# [1] bump up the temperature to increase the uncertainty, this will boost the diversity of the reasoning path
# [2] since we are using top-p sampling, by increasing the temperature, we are potentially increasing the 
# number of tokens inside this top-p pool, which further diversifies the choices of tokens available for us in 
# each timestamp. We decrease top-p to 0 to further encourage diversify. 
# This is the same as suggested in the scaling paper 

num_sequence = 10
generation_config = {
    "max_new_tokens" : TrainData.inf_seq_length,
    "temperature": 0.7,
    "num_return_sequences":num_sequence,
    #"top_p": 0.5,
    "eos_token_id":tokenizer.eos_token_id,  # Specify the EOS token
    "pad_token_id":tokenizer.eos_token_id, 
    "do_sample":True,
    "output_scores":False,
    "return_dict_in_generate":True,
}
batch_size = 2

In [23]:
generations = generation.get_generations(
    tokenizer,
    model,
    TrainData,
    batch_size=batch_size,
    **generation_config
)
# force only one generation with a break 

  0%|          | 0/3737 [00:00<?, ?it/s]

  0%|          | 0/3737 [00:09<?, ?it/s]


In [24]:
instance_sequences = []
for gen in generations: 

    sequences = gen.sequences

    counter = 0
    for i in range(0, sequences.shape[0], generation_config["num_return_sequences"]):
        instance_sequences.append(sequences[i:i+generation_config["num_return_sequences"]])
        counter += 1 

    assert counter == batch_size

print(len(instance_sequences), instance_sequences[0].shape)

2 torch.Size([10, 480])


In [25]:
unique_correct_completions,incorrect_completions,unique_correct_completions_eqs =\
    generation._filter_completions(instance_sequences[0], TrainData.max_length_question, tokenizer)
print(f"We filteded {len(unique_correct_completions)} unique correct complets") 

We filteded 10 unique correct complets


In [26]:
best_idx, best_worst_idx, gap = generation._socre_equations(unique_correct_completions_eqs, unique_correct_completions)
print(f"Best index: {best_idx}")
print(f"Worst index: {best_worst_idx}")
print(f"Gap: {gap}")

Best index: 1
Worst index: 0
Gap: 0.765527950310559


In [27]:
for i in range(len(unique_correct_completions)):
    print(i)
    print(unique_correct_completions[i])
    print('*'*100)

0
Natalia sold clips to 48 of her friends in April and half as many in May, so the total number of clips sold can be represented by the equation: 48 (April sales) + (48/2) (May sales) = 48 + 24 = 72 #### 72
****************************************************************************************************
1
Natalia sold clips to 48 friends in April and half as many in May, so the equations are:

April sales: 48
May sales: 48/2

Total sales (April + May): 48 + (48/2)

Simplifying the equation:

Total sales = 48 + 24

Total sales = 72

#### 72
****************************************************************************************************
2
Natalia sold clips to 48 friends in April and half as many in May, so the equation representing the total number of clips sold in April and May is: 48 + (48/2)

Calculating the total: 48 + 24 = 72

#### 72
****************************************************************************************************
3
Natalia sold clips to 48 friends in Apr

**We can see that the best completion selected is indeed better in terms of the structure and reasoning steps**

## End-2-End generation based on trainset 

In [None]:
#!python generation.py --run-name "corrected-pred-parser-1.0lev-" --generation_path "generations/gsm8k_synthetic_data_747instances_5samples_generations" --beta_1 1.0

  0%|          | 0/7473 [00:00<?, ?it/s]

```bash
python generation.py --generation-path "generations/gsm8k_synthetic_data_747instances_5samples_generations" --run-name "corrected-predp-parser-0.5levelwithtext-0.5leneq-" --beta-1 0.5 --include-text
```

- 1.0 Lev - Mean Optimality Gap = 12.66
- 0.75 Lev - Mean Optimality Gap = 1.13
- 0.5 Lev - Mean Optimality Gap - 1.15

# Training
***

Employ whatever trick you would like to reduce the VRAM requirements during training (including swapping the model for a smaller one, although please only as a last resort).

In [None]:
#!accelerate launch lora.py --train --run-name "[attn-ffn]-lora-64r-64alpha-componly" --no-evaluate

In [None]:
%wand

# Evaluating the Model
*** 

This final part is more free-form. We'd like to evaluate our new model on the test set to see if it's improved, but then spend however much time you have left examining the model more closely / demonstrating some interesting behaviour / showing off beautiful plots.


Make sure that you use align your accelerate config with ```deepspeed_inference.yaml``` provided before trying those scripts :-D 

## Evaluate Phi-3.5 

```bash
accelerate launch lora.py --evaluate --model-name "[attn-ffn]-lora-64r-64alpha_rft_747instances" --no-train
``` 

In [26]:
paths_to_compare = {
    "baseline": "results/results_microsoft-Phi-3.5-mini-instruct_.csv",
}

datas = {k : pd.read_csv(v) for k, v  in paths_to_compare.items()}

for k, v in datas.items(): 
    mean_ = v["maj_1s"].mean()
    print(f"### Experiment:{k}\n### Mean Maj@1 {mean_}")

### Experiment:baseline
### Mean Maj@1 0.8257575757575758


## Evaluate LoRA Adaptation on the Union between two datasets 

### Case 1. Completion utility is purely based on the Levenshiten Distance 
#### a.k.a $\beta_1=1.0$

[attn-ffn]-lora-64r-64alpha_rft_747instances 

stands for:

Attention and feedforward attached LoRa using Rank=64 with Alpha=64 (stablised), trained on 747 Rejection-Sampled Instances 


cli: 
```bash
accelerate launch lora.py --evaluate --model-name "[attn-ffn]-lora-64r-64alpha_rft_747instances" --no-train
``` 

In [27]:
paths_to_compare = {
    "baseline": "results/results_microsoft-Phi-3.5-mini-instruct_.csv",
    "lora_adapted": "results/results_models-[attn-ffn]-lora-32r-25alpha-componly-uniondata-qlora_rft_747instances-checkpoint-117_.csv",
}

datas = {k : pd.read_csv(v) for k, v  in paths_to_compare.items()}

for k, v in datas.items(): 
    mean_ = v["maj_1s"].mean()
    print(f"### Experiment:{k}\n### Mean Maj@1 {mean_}")

### Experiment:baseline
### Mean Maj@1 0.8257575757575758
### Experiment:lora_adapted
### Mean Maj@1 0.8005050505050505


## Evaluate LoRA adapation purely based on generated data 

```bash
accelerate launch lora.py --no-train --model-name "[attn-ffn]-lora-16r-12alpha-componly-rftdataonly-qlora_rft_747instances/checkpoint-11" --evaluate
``` 

Evaluation stroed at ```results/results_[attn-ffn]-lora-16r-12alpha-componly-rftdataonly-qlora_rft_747instances-checkpoint-11_.csv``` 

Slightly different LoRA config used in this experiment since it's a much smaller dataset to learn from: 

```python 
@dataclass
class LoraLayersConfig:
    
    task_type: TaskType = TaskType.CAUSAL_LM
    
    r: int = 16
    
    lora_alpha: int = int(r*0.8)
    
    use_rslora:bool = True 
    
    lora_dropout: float = 0.1
    
    bias: str = "none"
```


In [None]:
paths_to_compare = {
    "baseline": "results/results_microsoft-Phi-3.5-mini-instruct_.csv",
    "lora_adapted": "results/results_[attn-ffn]-lora-16r-12alpha-componly-rftdataonly-qlora_rft_747instances-checkpoint-11_.csv",
    
}

datas = {k : pd.read_csv(v) for k, v  in paths_to_compare.items()}

for k, v in datas.items(): 
    mean_ = v["maj_1s"].mean()
    print(f"### Dataset:{k}\n### Mean Maj@1 {mean_}")

### Dataset:baseline
### Mean Maj@1 0.8257575757575758
### Dataset:lora_adapted
### Mean Maj@1 0.8392255892255892


```bash 
models/[beta1(0.5)-equation-with-text]-[attn-ffn]-[16r-12alpha]-[lr1e-03-gradacum1]-[componly]-[rftdataonly]-qlora_rft_747instances/checkpoint-22
``` 

## Evaluate the model trained through a weighted manner 


```bash 
accelerate launch lora.py --no-train --evaluate --model-name "[attn-ffn]-lora-32r-25alpha-componly-uniondata-weighted-equationwithoutext-qlora_rft_747instances/checkpoint-117/"
```

## Comparison 

### [Optional] - Discussion

We would be interested to know:

1.   If you were less time / computationally constrained, what would you do differently?
2.   What would your ideal first project look like if you joined?

