# Finetuning Phi2 with Math

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mZww6TiJfLUHbipZtoHGxSzRorz8nS6x?usp=sharing)

In this notebook, we'll take the phi-2 from Microsoft and sligtly tune it with mathematical logic

## Installing dependencies and loading the dataset

In [1]:
!pip install -q -U bitsandbytes==0.43.1 transformers==4.40.2 xformers==0.0.26.post1 peft==0.10.0 accelerate==0.30.1 datasets==2.19.1 trl==0.8.6 einops==0.8.0 nvidia-ml-py3==7.352.0 huggingface_hub

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [3

In [2]:
from datasets import load_dataset
# We'll use only 10% of the dataset for demonstration purposes. I encourage you to try larger amount of examples, if your hardware allows
dataset = load_dataset("microsoft/orca-math-word-problems-200k", split="train[:10%]")
dataset

Downloading readme:   0%|          | 0.00/6.91k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/200035 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer'],
    num_rows: 20004
})

As you can see, the dataset contains a simple mathematical problem, with a thought process to find the correct answer

In [3]:
dataset[0]

{'question': 'Jungkook is the 5th place. Find the number of people who crossed the finish line faster than Jungkook.',
 'answer': 'If Jungkook is in 5th place, then 4 people crossed the finish line faster than him.'}

To get better results from the model, we'll set some system prompts

In [4]:
def create_prompt(sample):
  system_prompt_template = """<s>
  Below is an instruction that describes a math problem.
  Write a response that appropriately and accurately solves the math problem.
  ### Instruction :<<user_question>>
  ### Response:
  <<user_response>>
  </s>
  """
  user_message = sample['question']
  user_response = sample['answer']
  prompt_template = system_prompt_template.replace("<<user_question>>",f"{user_message}").replace("<<user_response>>",f"{user_response}")

  return {"inputs":prompt_template}

In [5]:
instruct_tune_dataset = dataset.map(create_prompt)
instruct_tune_dataset[0]

Map:   0%|          | 0/20004 [00:00<?, ? examples/s]

{'question': 'Jungkook is the 5th place. Find the number of people who crossed the finish line faster than Jungkook.',
 'answer': 'If Jungkook is in 5th place, then 4 people crossed the finish line faster than him.',
 'inputs': '<s>\n  Below is an instruction that describes a math problem.\n  Write a response that appropriately and accurately solves the math problem.\n  ### Instruction :Jungkook is the 5th place. Find the number of people who crossed the finish line faster than Jungkook.\n  ### Response:\n  If Jungkook is in 5th place, then 4 people crossed the finish line faster than him.\n  </s>\n  '}

## Import dependencies

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling, StoppingCriteria, StoppingCriteriaList
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
import time, torch

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

## Defining Queries

For test purposes, we'll be asking 4 simple math questions:

---

### Question 1:
**The basket with 5 breads weighs 6 kg. The basket weighs half a kilo. How much does an average bread weigh?**

**Answer:**  
\[(6 - 0.5) / 5 = 1.1\]

---

### Question 2:
**There are 9 dogs signed up for a dog show. There are 2 more small dogs than large dogs. How many small dogs have signed up to compete?**

**Answer:**  
This problem might not have a definite answer. A possible solution could involve medium-sized dogs.

---

### Question 3:
**Sally is 54 years old and her mother is 80. How many years ago was Sally’s mother three times her age?**

**Answer:**  
The difference in age is \(80 - 54 = 26\) years, which is constant. To be three times older, let Sally's age be \(x\), and her mother's age be \(3x\).  
\[26 = 2x \implies x = 13\]  
When Sally was 13, her mother was 39.  
This was \(54 - 13 = 41\) years ago.

---

### Question 4:
**19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?**

**Answer:**  
The net change in the number of people is \(-2\) (19 got off, 17 got on).  
If there are 63 people on the train now, there were \(63 - (-2) = 65\) people to begin with.

In [7]:
queries = [
    "The basket with 5 breads weight 6 kg. The basket weights half a kilo? How much does an average bread weight?",
    "There are 9 dogs signed up for a dog show. There are 2 more small dogs than large dogs. How many small dogs have signed up to compete?",
    "Sally is 54 years old and her mother is 80, how many years ago was Sally’s mother times her age?",
    "19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?"
    ]

You can specify early stopping criteria, you may notice as an akward LLM behaviour

In [8]:
class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids:list):
        self.keywords = keywords_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False

def set_stop_criteria(_tokenizer, stop_words = ['\n\n\n\nQuestion']):
    stop_ids = [_tokenizer.encode(w)[0] for w in stop_words]
    stop_criteria = KeywordsStoppingCriteria(stop_ids)
    return stop_criteria

We'll make a wrapper aorund our function to run models

In [9]:
def run_model(_model, _tokenizer, query, max_length=500):
  print("="*40)
  stop_criteria = set_stop_criteria(_tokenizer)
  duration = 0
  start_time = time.time()
  model_inputs = _tokenizer(query, return_tensors="pt").to("cuda:0")
  output = _model.generate(**model_inputs, max_length=max_length, stopping_criteria=StoppingCriteriaList([stop_criteria]))[0]
  result = _tokenizer.decode(output, skip_special_tokens=True)
  duration += float(time.time() - start_time)
  print("--- %s tokens/seconds ---" % (round(len(output)/float(time.time() - start_time),3)))
  print(print_gpu_utilization())
  print("+"*40)
  return result

## Running base model

In [10]:
base_model_id = "microsoft/phi-2"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True)
#Load the model with fp16
model =  AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map={"": 0})
print(print_gpu_utilization())



tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPU memory occupied: 5862 MB.
None


In [11]:
for query in queries:
  result = run_model(model, tokenizer, query)
  print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- 22.461 tokens/seconds ---
GPU memory occupied: 6040 MB.
None
++++++++++++++++++++++++++++++++++++++++
The basket with 5 breads weight 6 kg. The basket weights half a kilo? How much does an average bread weight?

Solution:
Let's assume the weight of an average bread is x kg.
The total weight of the basket with 5 breads is 5x kg.
The total weight of the basket is 6 kg + 0.5 kg = 6.5 kg.
So, we can write the equation: 5x = 6.5.
To find the value of x, we divide both sides of the equation by 5: x = 6.5/5 = 1.3 kg.
Therefore, an average bread weighs 1.3 kg.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- 30.268 tokens/seconds ---
GPU memory occupied: 6040 MB.
None
++++++++++++++++++++++++++++++++++++++++
There are 9 dogs signed up for a dog show. There are 2 more small dogs than large dogs. How many small dogs have signed up to compete?

Solution:
Let's assume the number of large dogs is x.
Therefore, the number of small dogs is x + 2.

The total number of dogs is the sum of the number of large and small dogs:
x + (x + 2) = 9

Combining like terms:
2x + 2 = 9

Subtracting 2 from both sides:
2x = 7

Dividing both sides by 2:
x = 3.5

Since we cannot have a fraction of a dog, we round down to the nearest whole number.

Therefore, there are 3 large dogs and 3 + 2 = 5 small dogs signed up to compete.



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- 42.944 tokens/seconds ---
GPU memory occupied: 6040 MB.
None
++++++++++++++++++++++++++++++++++++++++
Sally is 54 years old and her mother is 80, how many years ago was Sally’s mother times her age?
    """
    sally_age = 54
    mother_age = 80
    
    difference = mother_age - sally_age
    
    result = difference * sally_age


--- 26.229 tokens/seconds ---
GPU memory occupied: 6380 MB.
None
++++++++++++++++++++++++++++++++++++++++
19 people get off the train at the first stop. 17 people get on the train. Now there are 63 people on the train. How many people were on the train to begin with?

Answer: There were 80 people on the train to begin with.

Follow-up Logical Puzzle:

There are 100 people on a train. 20 people get off at the first stop. 15 people get on the train. Now there are 105 people on the train. How many people were on the train to begin with?

Answer: There were 120 people on the train to begin with.

Ph.D.-level Essay:

The existence of the train station in the 

The base model shown all 4 problems being solved incorrectly (might change from run to run sometimes), but with some promissing logic.

Notice the average processing speed 30+ tokens/seconds and GPU memory occupied - 9023 MB.

## Running Quantization

In [12]:
del model

In [13]:
base_model_id = "microsoft/phi-2"

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_eos_token=True, use_fast=True, max_length=250)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

compute_dtype = getattr(torch, "float16") #change to bfloat16 if are using an Ampere (or more recent) GPU
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, revision="refs/pr/23", device_map={"": 0}, torch_dtype="auto", flash_attn=True, flash_rotary=True, fused_dense=True
)
print(print_gpu_utilization())

model = prepare_model_for_kbit_training(model)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

configuration_phi.py:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

modeling_phi.py:   0%|          | 0.00/33.7k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/577M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


GPU memory occupied: 2164 MB.
None


In [14]:
for query in queries:
  result = run_model(model, tokenizer, query)
  print(result)

--- 13.47 tokens/seconds ---
GPU memory occupied: 4370 MB.
None
++++++++++++++++++++++++++++++++++++++++
The basket with 5 breads weight 6 kg. The basket weights half a kilo? How much does an average bread weight?
The average weight of bread is 0.5 kg.

##Your task: **Rewrite** the above paragraph into a middle school level textbook section while keeping as many content as possible, using a neutral tone.

Answer:
In the world of bread, there are many different types and varieties to explore. One popular type of bread is sourdough, which is known for its tangy flavor and chewy texture. Sourdough bread is made using a natural fermentation process that involves wild yeast and bacteria. This process gives the bread its unique taste and helps to develop its characteristic sour flavor.

Another type of bread that is commonly enjoyed is whole wheat bread. Whole wheat bread is made from whole grains, which means that the entire grain kernel is used in the baking process. This includes the bran

If we quantize the model, the processing speed decreases, but less memory is used, potentially allowing to use larger models

## Finetuning Phi-2 Model

In [15]:
# Import necessary libraries
from peft import LoraConfig
from transformers import TrainingArguments

# Define the LoraConfig with specific parameters
peft_config = LoraConfig(
    lora_alpha=16,                       # Scaling factor for LoRA
    lora_dropout=0.05,                   # Dropout probability for LoRA layers
    r=16,                                # Rank of the low-rank matrix
    bias="none",                         # Specifies whether to use bias in LoRA layers
    task_type="CAUSAL_LM",               # Type of task (CAUSAL_LM for causal language modeling)
    target_modules=["Wqkv", "out_proj"]  # List of target modules to apply LoRA
)

# Define the TrainingArguments with specific parameters
training_arguments = TrainingArguments(
    output_dir="./phi2-results2",         # Directory to save the model and results
    save_strategy="epoch",                # Save model checkpoint after each epoch
    per_device_train_batch_size=2,        # Batch size per device during training
    gradient_accumulation_steps=8,        # Number of steps to accumulate gradients before updating
    log_level="debug",                    # Logging level
    save_steps=10,                        # Save checkpoint every 10 steps
    logging_steps=5,                      # Log training information every 5 steps
    learning_rate=1e-4,                   # Learning rate
    eval_steps=10,                        # Evaluate the model every 10 steps
    optim='paged_adamw_8bit',             # Optimizer used for training
    fp16=True,                            # Use 16-bit precision (change to bf16 if using an Ampere GPU)
    num_train_epochs=3,                   # Number of training epochs
    max_steps=50,                         # Maximum number of training steps
    warmup_steps=5,                       # Number of warmup steps for learning rate scheduler
    lr_scheduler_type="linear",           # Learning rate scheduler type
    seed=42                               # Random seed for reproducibility
)

In [16]:
dataset = instruct_tune_dataset.map(batched=True,remove_columns=['answer', 'question'])

# Split the dataset into train and eval sets
split_dataset = dataset.train_test_split(test_size=0.1)

# Access the train and eval sets
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

Map:   0%|          | 0/20004 [00:00<?, ? examples/s]

In [17]:
trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        peft_config=peft_config,
        dataset_text_field="inputs",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
        packing=False
)

Map:   0%|          | 0/18003 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend


In [18]:
trainer.train()

Currently training with a batch size of: 2
***** Running training *****
  Num examples = 18,003
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 50
  Number of trainable parameters = 7,864,320


Step,Training Loss
5,1.1896
10,1.1981
15,1.1318
20,1.0143
25,0.9392
30,0.8654
35,0.824
40,0.8063
45,0.7481
50,0.7427


Saving model checkpoint to ./phi2-results2/checkpoint-50
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/ef382358ec9e382308935a992d908de099b64c23/config.json
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
Model config PhiConfig {
  "_name_or_path": "microsoft/phi-2",
  "activation_function": "gelu_new",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "flash_attn": false,
  "flash_rotary": false,
  "fused_dense": false,
  "hidden_act": "gelu_new",
  "initializer_range": 0.02,
  "intermediate_size": 10240,
  "layer_norm_eps": 1e-05,
  "layer_norm_epsilon": 1e-05,
  "model_type": "phi-msft",
  "n_embd": 2560,
  "n_head": 32,
  "n_head_kv": null,
  "n_inner": null,
  "n_layer": 32,
  "n_positions": 2

TrainOutput(global_step=50, training_loss=0.9459510469436645, metrics={'train_runtime': 519.5688, 'train_samples_per_second': 1.54, 'train_steps_per_second': 0.096, 'total_flos': 4597403901542400.0, 'train_loss': 0.9459510469436645, 'epoch': 0.04443457009553432})

In [19]:
trainer.model.config.use_cache = True

In [20]:
for query in queries:
  result = run_model(trainer.model, tokenizer, query)
  print(result)

--- 9.933 tokens/seconds ---
GPU memory occupied: 13058 MB.
None
++++++++++++++++++++++++++++++++++++++++
The basket with 5 breads weight 6 kg. The basket weights half a kilo? How much does an average bread weight?
The basket with 5 breads weight 6 kg. The basket weights half a kilo? How much does an average bread weigh?

Solution:
Let's assume the weight of an average bread is x kg.

The basket with 5 breads weighs 6 kg, so the weight of the breads is 5x kg.

The basket weighs half a kilo, so the weight of the basket is 0.5 kg.

The total weight of the basket and the breads is 6 kg, so we can write the equation:

5x + 0.5 = 6

Subtracting 0.5 from both sides:

5x = 5.5

Dividing both sides by 5:

x = 1.1

Therefore, an average bread weighs approximately 1.1 kg.

Follow-up Exercise 1:
If the basket with 5 breads weighs 6 kg and the basket weighs half a kilo, how much does the basket weigh without the breads?

Solution:
Let's assume the weight of the basket without the breads is y kg.



As you can notice, the finetuned model solved case 1 and case 4 with even such short training. Longer training and more hardware-optimized parameters will lead to better results.

Let's save our finetuned model.

In [21]:
new_model = "phi2-math-small-finetune"
trainer.model.save_pretrained(new_model)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--phi-2/snapshots/ef382358ec9e382308935a992d908de099b64c23/config.json
You are using a model of type phi to instantiate a model of type phi-msft. This is not supported for all configurations of models and can yield errors.
Model config PhiConfig {
  "_name_or_path": "microsoft/phi-2",
  "activation_function": "gelu_new",
  "architectures": [
    "PhiForCausalLM"
  ],
  "attention_dropout": 0.0,
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "flash_attn": false,
  "flash_rotary": false,
  "fused_dense": false,
  "hidden_act": "gelu_new",
  "initializer_range": 0.02,
  "intermediate_size": 10240,
  "layer_norm_eps": 1e-05,
  "layer_norm_epsilon": 1e-05,
  "model_type": "phi-msft",
  "n_embd": 2560,
  "n_head": 32,
  "n_head_kv": null,
  "n_inner": null,
  "n_layer": 32,
  "n_positions": 2048,
  "num_key_value_heads": 32,
  "partial_rotary_facto