<a href="https://colab.research.google.com/github/Bryan-Az/Mathematics-LLM/blob/main/%5BTraining%5D_Mathematics_Model_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training the 'Integration' Mathematics Problem Solving Model on a GPU Environment
This notebook is running on an T4 GPU environment in google colab. The pre-trained foundation model we are using is the publically available unsloth/Llama-3.2-1B-Instruct, requiring authentication with HuggingFace.

## Imports and Installs

In [1]:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip install --upgrade --force-reinstall --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [2]:
#from transformers import AutoTokenizer
#from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

In [3]:
#!pip install -U bitsandbytes

In [4]:
import math
from dataclasses import dataclass, field
from typing import List, Optional
from collections import defaultdict
import torch
import torch.nn as nn
import re
from transformers import LlamaConfig
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [5]:
%%capture
!pip install datasets
from torch.utils.data import Dataset as TorchDataset
from datasets import load_dataset
from torch.optim import Adam

In [6]:
import pandas as pd
# import library to keep time using .now
import datetime

## Loading the Tokenizer of the Pre-trained LlaMA 8B Model
It's necessary to import the tokenizer of the model for loading the dataset.

In [7]:
MAX_INPUT=4096
MODEL = "unsloth/Llama-3.2-1B-Instruct" #You should be able to use 7B model with no changes! There should be enough HBM
SAVED_MODEL = "Alexis-Az/Math-Problem-LlaMA-3.2-1B"

In [8]:
#tokenizer = AutoTokenizer.from_pretrained(MODEL)
#if 'pad_token' not in tokenizer.special_tokens_map:
#  tokenizer.pad_token=tokenizer.eos_token
#print(f"Tokens :\n {tokenizer.special_tokens_map} \n\n")

## Loading the Pre-trained Model with LoRa Adapters using Unsloth
Adding LoRa will allow us to fine-tune the model on our story dataset.

In [9]:
#set device
device= f'cuda:{torch.cuda.current_device()}'
device

'cuda:0'

In [10]:
from unsloth import is_bfloat16_supported
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(MODEL, max_seq_length=max_seq_length, dtype=None,load_in_4bit=True)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

## Loading the Dataset

In [11]:
class InstructionDataset(TorchDataset):
    def __init__(self, tokenizer, max_length=1024, dataset=None):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        prompts = self.dataset[idx]
        text = ""
        for prompt in prompts:
            data = prompts[prompt]
            if prompt == 'Function':
                text += f"<|im_start|>user\n Can you help me solve this problem? {data}<|im_end|>\n"
            if prompt == 'Roots':
                text += f"<|im_start|>assistant\n Here's the answer to solve this root-based problem: {data}<|im_end|>"
            if prompt == 'Derivatives':\
                text += f"<|im_start|>assistant\n Here's the answer to solve this derivative-based problem: {data}<|im_end|>"


        try:
            input_ids = self.tokenizer(text, add_special_tokens=True, max_length=self.max_length, truncation=True, padding="max_length", return_attention_mask=True, return_tensors="pt")
        except Exception as e:  # You can catch specific tokenizer exceptions if known
            print(f"Error tokenizing text at index {idx}: {e}")
            print(f"Problematic text: {text}")  # Print the text causing the issue
            # Handle the exception (e.g., skip the sample, replace with empty tokens)
            # Here, I'll skip the problematic sample:
            return None  # or raise the exception if desired


        if input_ids is None:
            return None
        return {
            "input_ids": input_ids["input_ids"].squeeze(0),
            "labels": input_ids["input_ids"].squeeze(0),
            "attention_mask":input_ids["attention_mask"].squeeze(0),
        }

In [12]:
train_dataset="Alexis-Az/math_datasets"
# ~1/5 of the dataset is used for validation
train_data_derivs = load_dataset(train_dataset, name='derivatives', split='train[:8000]').shuffle()
val_derivs = (load_dataset(train_dataset, 'derivatives', split="train[-2000:]")).shuffle()

README.md:   0%|          | 0.00/2.99k [00:00<?, ?B/s]

derivatives/Derivatives.csv:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [13]:
train_data_roots = load_dataset(train_dataset, 'roots', split='train[:8000]').shuffle()
val_roots = (load_dataset(train_dataset, 'roots', split="train[-2000:]")).shuffle()

roots/Roots.csv:   0%|          | 0.00/5.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [14]:
train_deriv_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':1000, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': -1,#Ooverides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_derivs),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_derivs,
         "TEST_DATASET":val_derivs,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }

In [15]:
train_roots_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':1000, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': -1,#Ooverides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_roots),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_roots,
         "TEST_DATASET":val_roots,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }

In [16]:
train_data_derivs_instruct = InstructionDataset(tokenizer, dataset=train_data_derivs, max_length=train_deriv_configs['MAX_INPUT'])
val_derivs_instruct = InstructionDataset(tokenizer, dataset=val_derivs)

#collate fn to skip nones in the batch
def collate_fn(batch):
    batch = list(filter(lambda x: x is not None, batch))
    return torch.utils.data.dataloader.default_collate(batch)


train_deriv_loader = torch.utils.data.DataLoader(train_data_derivs_instruct, batch_size=train_deriv_configs["BATCH_SIZE"], collate_fn=collate_fn,shuffle=True)
testing_deriv_loader = torch.utils.data.DataLoader(val_derivs, batch_size=train_deriv_configs["BATCH_SIZE"], collate_fn=collate_fn, shuffle=True)

print(f"Max Steps: {len(train_deriv_loader)}, Batch size: {8*train_deriv_configs['BATCH_SIZE']}")
print(f"Val Size: {len(testing_deriv_loader)}, Batch Size: {8*train_deriv_configs['BATCH_SIZE']}")
train_deriv_configs['STEPS']=len(train_deriv_loader)
train_deriv_configs['BATCH_DATA']=train_deriv_configs['BATCH_SIZE']

Max Steps: 4000, Batch size: 16
Val Size: 1000, Batch Size: 16


In [17]:
train_data_roots_instruct = InstructionDataset(tokenizer, dataset=train_data_roots, max_length=train_roots_configs['MAX_INPUT'])
val_roots_instruct = InstructionDataset(tokenizer, dataset=val_roots)

train_roots_loader = torch.utils.data.DataLoader(train_data_roots_instruct, batch_size=train_roots_configs["BATCH_SIZE"],collate_fn=collate_fn, shuffle=True)
testing_roots_loader = torch.utils.data.DataLoader(val_roots, batch_size=train_roots_configs["BATCH_SIZE"], collate_fn=collate_fn,shuffle=True)

print(f"Max Steps: {len(train_roots_loader)}, Batch size: {8*train_roots_configs['BATCH_SIZE']}")
print(f"Val Size: {len(testing_roots_loader)}, Batch Size: {8*train_roots_configs['BATCH_SIZE']}")
train_roots_configs['STEPS']=len(train_roots_loader)
train_roots_configs['BATCH_DATA']=train_roots_configs['BATCH_SIZE']

Max Steps: 4000, Batch size: 16
Val Size: 1000, Batch Size: 16


In [18]:
ls=LoraConfig(
    r = 12, # Lora Rank should generally be smaller for smaller models
    target_modules = ['q_proj', 'down_proj', 'up_proj', 'o_proj', 'v_proj', 'gate_proj', 'k_proj'],
    lora_alpha = 16, #weight_scaling
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    modules_to_save = ["lm_head", "embed_tokens"] ## if you use new chat formats or embedding tokens
)
model = get_peft_model(model, ls)
model.print_trainable_parameters()

trainable params: 533,790,720 || all params: 1,769,605,120 || trainable%: 30.1644


## Training the Model

In [19]:
import torch.nn as nn
import wandb
__wandb__=train_deriv_configs['WANDB']
from transformers import get_linear_schedule_with_warmup,get_cosine_schedule_with_warmup
# from random import randrange
# from bitsandbytes.optim import AdamW8bit
# from torchdistx.optimizers import AnyPrecisionAdamW

val_step=0




def evaluate_loss(outputs,labels,pad_id=tokenizer.pad_token_id):
  epsilon=1e-8
  logits=outputs.logits
  logits = logits[..., :-1, :].contiguous()
  labels = labels[..., 1:].contiguous()
  log_probs = -nn.functional.log_softmax(logits, dim=-1)
  if labels.dim() == log_probs.dim() - 1:
    labels = labels.unsqueeze(-1)
  padding_mask = labels.eq(pad_id)
  labels = torch.clamp(labels, min=0)
  nll_loss = log_probs.gather(dim=-1, index=labels)
  smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.bfloat16)
  nll_loss.masked_fill_(padding_mask, 0.0)
  smoothed_loss.masked_fill_(padding_mask, 0.0)
  num_active_elements = padding_mask.numel() - padding_mask.long().sum()
  nll_loss = nll_loss.sum() / num_active_elements
  smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
  del labels,logits,padding_mask
  return (1-epsilon)*nll_loss + epsilon*smoothed_loss



def train(FLAGS, training_loader, testing_loader, device):


    ### Configuring Training
    global val_step
    update_params= filter(lambda p: p.requires_grad, model.parameters())
    num_iterations = int((FLAGS["NUM_EPOCHS"] * FLAGS['STEPS'] ) // FLAGS['GRAD_ACCUMULATION_STEP'])
    warmup_steps = int(num_iterations * FLAGS['WARMUP_RATIO'])

    if __wandb__:
        wandb.init(project=FLAGS['PROJECT'],config=FLAGS)
        wandb.define_metric("Validation_loss", step_metric="val_step")
        wandb.define_metric("Learning_rate",step_metric="train_step")
        wandb.define_metric("train_loss",step_metric="train_step")

    ### Optimizers

    if (FLAGS['OPTIMIZER']).lower()=='adam':
        optimizer = Adam(update_params, eps=1e-8, lr=FLAGS['LEARNING_RATE'], betas=(0.9, 0.999),weight_decay=FLAGS['WEIGHT_DECAY'])

    for param_group in optimizer.param_groups:
        if len(param_group["params"]) > 0:
            print(param_group["params"][0].device)
            break


    ### Schedulars

    if (FLAGS['SCHEDULAR']).lower()=='linear':
        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,num_iterations)
    else:
        scheduler = get_cosine_schedule_with_warmup(optimizer,warmup_steps,num_iterations)




    ### Training Loop
    val_step=0
    check=False #for brakes
    for epoch in range(1, FLAGS['NUM_EPOCHS'] + 1):
        if check:
            break
        model.train()
        print('Epoch {} train begin {}'.format(epoch, datetime.datetime.now()))
        for step, batch in enumerate(training_loader):
            input_ids, labels,attention_mask = batch["input_ids"].to(device),  batch["labels"].to(device),batch['attention_mask'].to(device)

            outputs = model(input_ids=input_ids,attention_mask=attention_mask)
            loss = evaluate_loss(outputs,labels)


            if (step + 1) % FLAGS['LOGGING_STEPS'] == 0:
                print(f'loss: {loss.detach().cpu().item()}, time: {datetime.datetime.now()}, step: {step+1}')
            if __wandb__:
                wandb.log({
                'Learning_rate': optimizer.param_groups[0]['lr'],
                'train_loss':  loss.detach().cpu().item(),
                'train_step': step + 1 + ((epoch-1) * FLAGS["STEPS"]),
                        })




            del input_ids , attention_mask
            loss.backward()
            del outputs,loss




            if (step+1) % FLAGS['GRAD_ACCUMULATION_STEP'] == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=FLAGS['MAX_GRAD_CLIP']*8)
                scheduler.step()
                optimizer.step()
                optimizer.zero_grad()





            if (step+1)% FLAGS['VAL_STEPS'] == 0:
                end_index=FLAGS["VAL_BATCH"]
                model.eval()
                with torch.no_grad():
                    total_loss = 0
                    total_step = 0
                    for stepx, batchx in enumerate(testing_loader):
                        #check that the key 'input_ids' is in the batchx dict
                        if 'input_ids' not in batchx:
                            continue
                        input_ids = batchx["input_ids"].to(device)
                        labels = batchx["labels"].to(device)
                        attention_mask = batchx["attention_mask"].to(device)
                        outputs = model(input_ids=input_ids,attention_mask=attention_mask)
                        loss = evaluate_loss(outputs,labels)
                        total_loss += loss.item()
                        total_step +=1
                        print('----- Time -> {} ----- Validation Batch -> {} ----  Validation Loss -> {:.4f}'.format(datetime.datetime.now(), total_step , loss.item()))
                        if __wandb__:
                            val_step+=1
                            wandb.log({
                                'Validation_loss': loss.item(),
                                'val_step':val_step,
                                    })
                        if (stepx+1)%end_index==0:
                            break
                    model.train()
                    # avoid division by zero
                    if total_loss==0:
                      average_loss=0
                    else:
                      average_loss=total_loss/total_step
                    print('----- Time -> {} ----- Validation Batch Size -> {} ----  Validation Loss -> {:.7f}'.format(datetime.datetime.now(), total_step , average_loss))

            if (step+1)% FLAGS['PAUSE_STEPS']==0:
                inp=input('want to continue training after {} steps'.format(step+1))
                check = bool("no" in inp.lower())
                if check:
                    break
                else:
                    pass

In [20]:
train(train_deriv_configs, train_deriv_loader, testing_deriv_loader, device)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


cuda:0
Epoch 1 train begin 2024-11-21 03:17:48.757742
loss: 9.8125, time: 2024-11-21 03:17:53.087083, step: 1
loss: 9.8125, time: 2024-11-21 03:17:54.553588, step: 2
loss: 10.25, time: 2024-11-21 03:17:55.142462, step: 3
loss: 9.875, time: 2024-11-21 03:17:55.730463, step: 4
loss: 9.5, time: 2024-11-21 03:17:56.318939, step: 5
loss: 9.75, time: 2024-11-21 03:17:56.907013, step: 6
loss: 10.0625, time: 2024-11-21 03:17:57.494865, step: 7
loss: 9.75, time: 2024-11-21 03:17:58.083176, step: 8
loss: 9.6875, time: 2024-11-21 03:17:58.671798, step: 9
loss: 9.875, time: 2024-11-21 03:17:59.260365, step: 10
loss: 9.125, time: 2024-11-21 03:17:59.848726, step: 11
loss: 9.5, time: 2024-11-21 03:18:00.436865, step: 12
loss: 9.3125, time: 2024-11-21 03:18:01.025060, step: 13
loss: 9.1875, time: 2024-11-21 03:18:01.613343, step: 14
loss: 8.4375, time: 2024-11-21 03:18:02.201571, step: 15
loss: 8.8125, time: 2024-11-21 03:18:02.789796, step: 16
loss: 8.6875, time: 2024-11-21 03:18:03.378113, step: 17

## Saving the Model Trained for 1000 Steps on HuggingFace

In [21]:
import time
print('Loading the model on CPU')
START=time.time()
model = model.cpu()
print(f"Loaded model on cpu in {time.time()-START} seconds ")

Loading the model on CPU
Loaded model on cpu in 1.6359925270080566 seconds 


In [22]:
# saving the non quantized model
model.push_to_hub(
    SAVED_MODEL,
    tokenizer=tokenizer,
    safe_serialization=True,
    create_pr=True,
    max_shard_size="3GB",
)

tokenizer.push_to_hub(
    SAVED_MODEL,
)

adapter_model.safetensors:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [None]:
#saving the quantized model
#model.push_to_hub_gguf("Alexis-Az/Math-Problem-LlaMA-3.2-1B-GGUF", tokenizer, quantization_method = "q4_k_m")