<a href="https://colab.research.google.com/github/Bryan-Az/Mathematics-LLM/blob/training/%5BFinetuning%5D_Mathematics_Model_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning the 'SmolLM2-Instruct' Pre-trained Education & Mathematics Problem Solving Model on a GPU Environment
This notebook is running on an T4 GPU environment in google colab. The pre-trained foundation model we are using is the publically available unsloth/Llama-3.2-1B-Instruct, requiring authentication with HuggingFace.

## Imports and Installs

In [1]:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip install --upgrade --force-reinstall --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [2]:
#from transformers import AutoTokenizer
#from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

In [3]:
import torch.nn as nn
import wandb
__wandb__=True
from transformers import get_linear_schedule_with_warmup,get_cosine_schedule_with_warmup

In [4]:
import math
from dataclasses import dataclass, field
from typing import List, Optional
from collections import defaultdict
import torch
import torch.nn as nn
import re
from transformers import LlamaConfig
from unsloth import FastLanguageModel
from transformers import TrainingArguments, Trainer

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
%%capture
!pip install datasets
from torch.utils.data import Dataset as TorchDataset
from datasets import load_dataset
from torch.optim import Adam

In [6]:
import pandas as pd
# import library to keep time using .now
import datetime

## Loading the Tokenizer of the Pre-trained SmolLM2B 1.7b Model
It's necessary to import the tokenizer of the model for loading the dataset.

In [7]:
MAX_INPUT=4096
#SmolLM2 is a small pretrained general model on educational content (including math)
MODEL = "HuggingFaceTB/SmolLM2-1.7B-Instruct"         #"unsloth/Llama-3.2-1B-Instruct" #You should be able to use 7B model with no changes! There should be enough HBM
SAVED_MODEL = "Alexis-Az/Math-Problem-LlaMA-3.2-1.7B"
SAVED_MODEL_GGUF = "Alexis-Az/Math-Problem-LlaMA-3.2-1.7B-GGUF"

## Loading the Pre-trained Model with LoRa Adapters using Unsloth
Adding LoRa will allow us to fine-tune the model on our story dataset.

In [8]:
#set device
device= f'cuda:{torch.cuda.current_device()}'
device

'cuda:0'

In [9]:
from unsloth import is_bfloat16_supported
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(MODEL, max_seq_length=max_seq_length, dtype=None)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

HuggingFaceTB/SmolLM2-1.7B-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.


## Loading the Dataset

In [10]:
train_dataset="Alexis-Az/math_datasets"
# ~1/5 of the dataset is used for validation
train_data_derivs = load_dataset(train_dataset, name='derivatives', split='train[:8000]').shuffle()
val_derivs = (load_dataset(train_dataset, 'derivatives', split="train[-2000:]")).shuffle()

In [11]:
train_data_roots = load_dataset(train_dataset, 'roots', split='train[:8000]').shuffle()
val_roots = (load_dataset(train_dataset, 'roots', split="train[-2000:]")).shuffle()

In [12]:
processor = AutoProcessor.from_pretrained(
    MODEL
)

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

In [13]:
'''
  This instruction dataset class follows the chat template used
  by SmolLM
'''

def collate_fn(dataset):
    examples = dataset # aka prompt
    question = ""
    answer = "" # text decomposed into answer and response
    texts = []
    for prompt in examples:
        text = ""
        answer = ""
        question = ""
        for prompt, data in prompt.items():
          if prompt == 'Function':
              text = f"<|im_start|>user\n Can you help me solve this math problem? {data}<|im_end|>"
              question = text
          if prompt == 'Roots':
              text = f"<|im_start|>assistant\n Here's the answer to solve this root-based math problem: {data}<|im_end|>"
              answer = text
          if prompt == 'Derivatives':
              text = f"<|im_start|>assistant\n Here's the answer to solve this derivative-based math problem: {data}<|im_end|>"
              answer = text
        messages = [
            {
                "role": "user",
                "content": question
            },
            {
                "role": "assistant",
                "content": answer
            }
        ]

        final_text = processor.apply_chat_template(messages, add_generation_prompt=False)
        texts.append(text.strip())
    batch = processor(text=texts, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    labels[labels == tokenizer.pad_token_id] = -100
    batch["labels"] = labels

    return batch

In [14]:
model_name = MODEL.split("/")[-1]
model_name

'SmolLM2-1.7B-Instruct'

In [15]:
training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=250,
    save_total_limit=1,
    optim="adamw_hf", # for 8-bit, keep this, else adamw_hf
    bf16=True, # underlying precision for 8bit
    output_dir=f"./{model_name}",
    logging_dir=f"./{model_name}-run1",
    hub_model_id=SAVED_MODEL,
    report_to="tensorboard",
    remove_unused_columns=False,
    gradient_checkpointing=True,
)

In [16]:
'''
prev_train_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':1000, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': -1,#Ooverides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_derivs),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_derivs,
         "TEST_DATASET":val_derivs,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }
'''

'\nprev_train_configs = {\'MAX_INPUT\': MAX_INPUT,\n         \'LOGGING_STEPS\': 1,\n         \'NUM_EPOCHS\': 1,\n         \'PAUSE_STEPS\':1000, # asks to exit training after x steps #todo checkpoints\n         \'MAX_STEPS\': -1,#Ooverides num epochs\n         \'BATCH_SIZE\': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\\context. Fortunately, we have 128GBs. Setting higher batch_size doesn\'t seem to improve time.\n          \'LEN_TRAIN_DATA\': len(train_data_derivs),\n         \'VAL_STEPS\': 20,\n         \'VAL_BATCH\': 5,\n         \'GRAD_ACCUMULATION_STEP\':1,\n         \'MAX_GRAD_CLIP\':1,\n        \'LEARNING_RATE\':6e-5,\n         \'WARMUP_RATIO\':0.01,\n         \'OPTIMIZER\':\'adam\', # default = \'adamw\'  options->  [\'adamw\',\'SM3\',\'came\',\'adafactor\',\'lion\']\n         \'SCHEDULAR\':\'cosine\', # default= \'cosine\'     options:-> [\'linear\',\'cosine\']\n         \'WEIGHT_DECAY\':0.1,\n         \'TRAIN_DATASET\':tr

In [17]:
ls=LoraConfig(
    r = 12, # Lora Rank should generally be smaller for smaller models
    target_modules = ['q_proj', 'down_proj', 'up_proj', 'o_proj', 'v_proj', 'gate_proj', 'k_proj'],
    lora_alpha = 16, #weight_scaling
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    modules_to_save = ["lm_head", "embed_tokens"] ## if you use new chat formats or embedding tokens
)

In [18]:
model.add_adapter(ls)

In [19]:
model.enable_adapters()

In [20]:
model = get_peft_model(model, ls)
print(model.get_nb_trainable_parameters())

(214892544, 1926268928)


## Training the Model

### Finetuning on Derivatives

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_data_derivs,
    eval_dataset=val_derivs,
)

In [None]:
trainer.train()

### Finetuning on Roots

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_data_roots,
    eval_dataset=val_roots
)

In [None]:
trainer.train()

## Saving the Model Trained for 1000 Steps on HuggingFace

In [None]:
# saving the non quantized model
model.push_to_hub(
    SAVED_MODEL,
    tokenizer=tokenizer,
    safe_serialization=True,
    create_pr=True,
    max_shard_size="3GB",
)

tokenizer.push_to_hub(
    SAVED_MODEL,
)

In [23]:
#the model config had to be fixed due to incompatibility with the model class config set with unsloth's model loader
# do not merge the config file when retraining and saving the model to the repo
# may comment if training from base model overwrites the config, and config hasnt been corrected
model, tokenizer = FastLanguageModel.from_pretrained(SAVED_MODEL, max_seq_length=max_seq_length, dtype=None)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
HuggingFaceTB/SmolLM2-1.7B-Instruct does not have a padding token! Will use pad_token = <|endoftext|>.


adapter_model.safetensors:   0%|          | 0.00/457M [00:00<?, ?B/s]

Unsloth 2024.11.10 patched 24 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [24]:
#saving the quantized model
model.push_to_hub_gguf(SAVED_MODEL_GGUF, tokenizer, quantization_method = "q4_k_m")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 3.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 63.02 out of 83.48 RAM for saving.


100%|██████████| 24/24 [00:00<00:00, 112.21it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at Alexis-Az/Math-Problem-LlaMA-3.2-1.7B-GGUF into bf16 GGUF format.
The output location will be /content/Alexis-Az/Math-Problem-LlaMA-3.2-1.7B-GGUF/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Math-Problem-LlaMA-3.2-1.7B-GGUF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> BF16, shape = {2048, 49152}
INFO:hf-to-gguf:token_embd.weight,           tor

unsloth.Q4_K_M.gguf:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/Alexis-Az/Math-Problem-LlaMA-3.2-1.7B-GGUF
