<a href="https://colab.research.google.com/github/Bryan-Az/Mathematics-LLM/blob/training/%5BLLaMA%5D_Mathematics_Model_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training the 'Integration' Mathematics Problem Solving Model on a GPU Environment
This notebook is running on an T4 GPU environment in google colab. The pre-trained foundation model we are using is the publically available unsloth/Llama-3.2-1B-Instruct, requiring authentication with HuggingFace.

## Imports and Installs

In [None]:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip install --upgrade --force-reinstall --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


In [2]:
#from transformers import AutoTokenizer
#from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

In [3]:
#!pip install -U bitsandbytes

In [4]:
import math
from dataclasses import dataclass, field
from typing import List, Optional
from collections import defaultdict
import torch
import torch.nn as nn
import re
from transformers import LlamaConfig
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
%%capture
!pip install datasets
from torch.utils.data import Dataset as TorchDataset
from datasets import load_dataset
from torch.optim import Adam

In [6]:
import pandas as pd
# import library to keep time using .now
import datetime

## Loading the Tokenizer of the Pre-trained LlaMA 8B Model
It's necessary to import the tokenizer of the model for loading the dataset.

In [7]:
MAX_INPUT=4096
MODEL = "unsloth/Llama-3.2-1B-Instruct" #You should be able to use 7B model with no changes! There should be enough HBM
SAVED_MODEL = "Alexis-Az/Math-Problem-LlaMA-3.2-1B"

In [8]:
#tokenizer = AutoTokenizer.from_pretrained(MODEL)
#if 'pad_token' not in tokenizer.special_tokens_map:
#  tokenizer.pad_token=tokenizer.eos_token
#print(f"Tokens :\n {tokenizer.special_tokens_map} \n\n")

## Loading the Pre-trained Model with LoRa Adapters using Unsloth
Adding LoRa will allow us to fine-tune the model on our story dataset.

In [9]:
#set device
device= f'cuda:{torch.cuda.current_device()}'
device

'cuda:0'

In [10]:
from unsloth import is_bfloat16_supported
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(MODEL, max_seq_length=max_seq_length, dtype=None,load_in_4bit=True)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Loading the Dataset

In [11]:
class InstructionDataset(TorchDataset):
    def __init__(self, tokenizer, max_length=1024, dataset=None):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        prompts = self.dataset[idx]
        text = ""
        for prompt in prompts:
            data = prompts[prompt]
            if prompt == 'Function':
                text += f"<|im_start|>user\n Can you help me solve this math problem? {data}<|im_end|>\n"
            if prompt == 'Operation':
                text += f"<|im_start|>user\n Can you help me solve this math problem with addition? {data}<|im_end|>\n"
            if prompt == 'Roots':
                text += f"<|im_start|>assistant\n Here's the answer to solve this root-based problem: {data}<|im_end|>"
            if prompt == 'Derivatives':
                text += f"<|im_start|>assistant\n Here's the answer to solve this derivative-based problem: {data}<|im_end|>"
            if prompt == 'Result':
                text += f"<|im_start|>assistant\n Here's the answer to solve this addition problem: {data}<|im_end|>"


        try:
            input_ids = self.tokenizer(text, add_special_tokens=True, max_length=self.max_length, truncation=True, padding="max_length", return_attention_mask=True, return_tensors="pt")
        except Exception as e:  # You can catch specific tokenizer exceptions if known
            print(f"Error tokenizing text at index {idx}: {e}")
            print(f"Problematic text: {text}")  # Print the text causing the issue
            # Handle the exception (e.g., skip the sample, replace with empty tokens)
            # Here, I'll skip the problematic sample:
            return None  # or raise the exception if desired


        if input_ids is None:
            return None
        return {
            "input_ids": input_ids["input_ids"].squeeze(0),
            "labels": input_ids["input_ids"].squeeze(0),
            "attention_mask":input_ids["attention_mask"].squeeze(0),
        }

In [12]:
train_dataset="Alexis-Az/math_datasets"
# ~1/5 of the dataset is used for validation
train_data_derivs = load_dataset(train_dataset, name='derivatives', split='train[:8000]').shuffle()
val_derivs = (load_dataset(train_dataset, 'derivatives', split="train[-2000:]")).shuffle()

In [13]:
train_data_roots = load_dataset(train_dataset, 'roots', split='train[:8000]').shuffle()
val_roots = (load_dataset(train_dataset, 'roots', split="train[-2000:]")).shuffle()

In [14]:
train_data_adds = load_dataset(train_dataset, 'additions', split='train[:800000]').shuffle()
val_adds = (load_dataset(train_dataset, 'additions', split="train[-200000:]")).shuffle()

Since the additions dataset (our main objective) is 100x larger than the data for derivatives and roots, the max steps for additions is set to 2000 and is set to 500 each for the other datasets. This will train for 3,000 steps and will take roughly 1 hour and a half on the A100 gpu environment in google colab.

In [15]:
train_deriv_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':0, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': 500,#Ooverides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_derivs),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_derivs,
         "TEST_DATASET":val_derivs,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }

In [16]:
train_roots_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':0, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': 500,#-1 trains on entire data. Settins max steps overides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_roots),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_roots,
         "TEST_DATASET":val_roots,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }

In [17]:
# additions data is 100x bigger than for the other sides so max steps will be decreased
train_adds_configs = {'MAX_INPUT': MAX_INPUT,
         'LOGGING_STEPS': 1,
         'NUM_EPOCHS': 1,
         'PAUSE_STEPS':0, # asks to exit training after x steps #todo checkpoints
         'MAX_STEPS': 2000,#-1 trains on entire data. Settins max steps overides num epochs
         'BATCH_SIZE': 2, #Making batch_size lower then 8 will result in slower training, but will allow for larger models\context. Fortunately, we have 128GBs. Setting higher batch_size doesn't seem to improve time.
          'LEN_TRAIN_DATA': len(train_data_adds),
         'VAL_STEPS': 20,
         'VAL_BATCH': 5,
         'GRAD_ACCUMULATION_STEP':1,
         'MAX_GRAD_CLIP':1,
        'LEARNING_RATE':6e-5,
         'WARMUP_RATIO':0.01,
         'OPTIMIZER':'adam', # default = 'adamw'  options->  ['adamw','SM3','came','adafactor','lion']
         'SCHEDULAR':'cosine', # default= 'cosine'     options:-> ['linear','cosine']
         'WEIGHT_DECAY':0.1,
         'TRAIN_DATASET':train_data_adds,
         "TEST_DATASET":val_adds,
         'WANDB':True,
        'PROJECT':'Math-Model',
        }

In [18]:
train_data_derivs_instruct = InstructionDataset(tokenizer, dataset=train_data_derivs, max_length=train_deriv_configs['MAX_INPUT'])
val_derivs_instruct = InstructionDataset(tokenizer, dataset=val_derivs)

#collate fn to skip nones in the batch
def collate_fn(batch):
    batch = list(filter(lambda x: x is not None, batch))
    return torch.utils.data.dataloader.default_collate(batch)


train_deriv_loader = torch.utils.data.DataLoader(train_data_derivs_instruct, batch_size=train_deriv_configs["BATCH_SIZE"], collate_fn=collate_fn,shuffle=True)
testing_deriv_loader = torch.utils.data.DataLoader(val_derivs, batch_size=train_deriv_configs["BATCH_SIZE"], collate_fn=collate_fn, shuffle=True)

print(f"Max Steps: {len(train_deriv_loader)}, Batch size: {8*train_deriv_configs['BATCH_SIZE']}")
print(f"Val Size: {len(testing_deriv_loader)}, Batch Size: {8*train_deriv_configs['BATCH_SIZE']}")
train_deriv_configs['STEPS']=len(train_deriv_loader)
train_deriv_configs['BATCH_DATA']=train_deriv_configs['BATCH_SIZE']

Max Steps: 4000, Batch size: 16
Val Size: 1000, Batch Size: 16


In [19]:
train_data_roots_instruct = InstructionDataset(tokenizer, dataset=train_data_roots, max_length=train_roots_configs['MAX_INPUT'])
val_roots_instruct = InstructionDataset(tokenizer, dataset=val_roots)

train_roots_loader = torch.utils.data.DataLoader(train_data_roots_instruct, batch_size=train_roots_configs["BATCH_SIZE"],collate_fn=collate_fn, shuffle=True)
testing_roots_loader = torch.utils.data.DataLoader(val_roots, batch_size=train_roots_configs["BATCH_SIZE"], collate_fn=collate_fn,shuffle=True)

print(f"Max Steps: {len(train_roots_loader)}, Batch size: {8*train_roots_configs['BATCH_SIZE']}")
print(f"Val Size: {len(testing_roots_loader)}, Batch Size: {8*train_roots_configs['BATCH_SIZE']}")
train_roots_configs['STEPS']=len(train_roots_loader)
train_roots_configs['BATCH_DATA']=train_roots_configs['BATCH_SIZE']

Max Steps: 4000, Batch size: 16
Val Size: 1000, Batch Size: 16


In [20]:
train_data_adds_instruct = InstructionDataset(tokenizer, dataset=train_data_adds, max_length=train_adds_configs['MAX_INPUT'])
val_adds_instruct = InstructionDataset(tokenizer, dataset=val_adds)

train_adds_loader = torch.utils.data.DataLoader(train_data_adds_instruct, batch_size=train_adds_configs["BATCH_SIZE"],collate_fn=collate_fn, shuffle=True)
testing_adds_loader = torch.utils.data.DataLoader(val_adds, batch_size=train_adds_configs["BATCH_SIZE"], collate_fn=collate_fn,shuffle=True)

print(f"Max Steps: {len(train_adds_loader)}, Batch size: {8*train_adds_configs['BATCH_SIZE']}")
print(f"Val Size: {len(testing_adds_loader)}, Batch Size: {8*train_adds_configs['BATCH_SIZE']}")
train_adds_configs['STEPS']=len(train_adds_loader)
train_adds_configs['BATCH_DATA']=train_adds_configs['BATCH_SIZE']

Max Steps: 400000, Batch size: 16
Val Size: 100000, Batch Size: 16


In [21]:
ls=LoraConfig(
    r = 12, # Lora Rank should generally be smaller for smaller models
    target_modules = ['q_proj', 'down_proj', 'up_proj', 'o_proj', 'v_proj', 'gate_proj', 'k_proj'],
    lora_alpha = 16, #weight_scaling
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    modules_to_save = ["lm_head", "embed_tokens"] ## if you use new chat formats or embedding tokens
)
model = get_peft_model(model, ls)
model.print_trainable_parameters()

trainable params: 533,790,720 || all params: 1,769,605,120 || trainable%: 30.1644


## Training the Model

In [25]:
import torch.nn as nn
import wandb
__wandb__=train_deriv_configs['WANDB']
from transformers import get_linear_schedule_with_warmup,get_cosine_schedule_with_warmup
# from random import randrange
# from bitsandbytes.optim import AdamW8bit
# from torchdistx.optimizers import AnyPrecisionAdamW

val_step=0




def evaluate_loss(outputs,labels,pad_id=tokenizer.pad_token_id):
  epsilon=1e-8
  logits=outputs.logits
  logits = logits[..., :-1, :].contiguous()
  labels = labels[..., 1:].contiguous()
  log_probs = -nn.functional.log_softmax(logits, dim=-1)
  if labels.dim() == log_probs.dim() - 1:
    labels = labels.unsqueeze(-1)
  padding_mask = labels.eq(pad_id)
  labels = torch.clamp(labels, min=0)
  nll_loss = log_probs.gather(dim=-1, index=labels)
  smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.bfloat16)
  nll_loss.masked_fill_(padding_mask, 0.0)
  smoothed_loss.masked_fill_(padding_mask, 0.0)
  num_active_elements = padding_mask.numel() - padding_mask.long().sum()
  nll_loss = nll_loss.sum() / num_active_elements
  smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
  del labels,logits,padding_mask
  return (1-epsilon)*nll_loss + epsilon*smoothed_loss



def train(FLAGS, training_loader, testing_loader, device):


    ### Configuring Training
    global val_step
    update_params= filter(lambda p: p.requires_grad, model.parameters())
    num_iterations = int((FLAGS["NUM_EPOCHS"] * FLAGS['STEPS'] ) // FLAGS['GRAD_ACCUMULATION_STEP'])
    warmup_steps = int(num_iterations * FLAGS['WARMUP_RATIO'])

    if __wandb__:
        wandb.init(project=FLAGS['PROJECT'],config=FLAGS)
        wandb.define_metric("Validation_loss", step_metric="val_step")
        wandb.define_metric("Learning_rate",step_metric="train_step")
        wandb.define_metric("train_loss",step_metric="train_step")

    ### Optimizers

    if (FLAGS['OPTIMIZER']).lower()=='adam':
        optimizer = Adam(update_params, eps=1e-8, lr=FLAGS['LEARNING_RATE'], betas=(0.9, 0.999),weight_decay=FLAGS['WEIGHT_DECAY'])

    for param_group in optimizer.param_groups:
        if len(param_group["params"]) > 0:
            print(param_group["params"][0].device)
            break


    ### Schedulars

    if (FLAGS['SCHEDULAR']).lower()=='linear':
        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,num_iterations)
    else:
        scheduler = get_cosine_schedule_with_warmup(optimizer,warmup_steps,num_iterations)




    ### Training Loop
    val_step=0
    check=False #for brakes
    for epoch in range(1, FLAGS['NUM_EPOCHS'] + 1):
        if check:
            break
        model.train()
        print('Epoch {} train begin {}'.format(epoch, datetime.datetime.now()))
        for step, batch in enumerate(training_loader):
            input_ids, labels,attention_mask = batch["input_ids"].to(device),  batch["labels"].to(device),batch['attention_mask'].to(device)

            outputs = model(input_ids=input_ids,attention_mask=attention_mask)
            loss = evaluate_loss(outputs,labels)


            if (step + 1) % FLAGS['LOGGING_STEPS'] == 0:
                print(f'loss: {loss.detach().cpu().item()}, time: {datetime.datetime.now()}, step: {step+1}')
            if __wandb__:
                wandb.log({
                'Learning_rate': optimizer.param_groups[0]['lr'],
                'train_loss':  loss.detach().cpu().item(),
                'train_step': step + 1 + ((epoch-1) * FLAGS["STEPS"]),
                        })




            del input_ids , attention_mask
            loss.backward()
            del outputs,loss




            if (step+1) % FLAGS['GRAD_ACCUMULATION_STEP'] == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=FLAGS['MAX_GRAD_CLIP']*8)
                scheduler.step()
                optimizer.step()
                optimizer.zero_grad()





            if (step+1)% FLAGS['VAL_STEPS'] == 0:
                end_index=FLAGS["VAL_BATCH"]
                model.eval()
                with torch.no_grad():
                    total_loss = 0
                    total_step = 0
                    for stepx, batchx in enumerate(testing_loader):
                        #check that the key 'input_ids' is in the batchx dict
                        if 'input_ids' not in batchx:
                            continue
                        input_ids = batchx["input_ids"].to(device)
                        labels = batchx["labels"].to(device)
                        attention_mask = batchx["attention_mask"].to(device)
                        outputs = model(input_ids=input_ids,attention_mask=attention_mask)
                        loss = evaluate_loss(outputs,labels)
                        total_loss += loss.item()
                        total_step +=1
                        print('----- Time -> {} ----- Validation Batch -> {} ----  Validation Loss -> {:.4f}'.format(datetime.datetime.now(), total_step , loss.item()))
                        if __wandb__:
                            val_step+=1
                            wandb.log({
                                'Validation_loss': loss.item(),
                                'val_step':val_step,
                                    })
                        if (stepx+1)%end_index==0:
                            break
                    model.train()
                    # avoid division by zero
                    if total_loss==0:
                      average_loss=0
                    else:
                      average_loss=total_loss/total_step
                    print('----- Time -> {} ----- Validation Batch Size -> {} ----  Validation Loss -> {:.7f}'.format(datetime.datetime.now(), total_step , average_loss))

            #uncomment if want to add checkpointing
            #if (step+1)% FLAGS['PAUSE_STEPS']==0:
            #    inp=input('want to continue training after {} steps'.format(step+1))
            #    check = bool("no" in inp.lower())
            #    if check:
            #        break
            #    else:
            #        pass

            # stop at number of MAX_STEPS
            if (step+1) == FLAGS['MAX_STEPS']:
                break

    if __wandb__:
        wandb.finish()


In [26]:
train(train_adds_configs, train_adds_loader, testing_adds_loader, device)

0,1
Learning_rate,▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▇▇▇████
train_loss,██▇▇▆▅▄▄▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇█

0,1
Learning_rate,4e-05
train_loss,0.39258
train_step,2360.0


cuda:0
Epoch 1 train begin 2024-12-03 23:31:18.697146
loss: 0.390625, time: 2024-12-03 23:31:19.022726, step: 1
loss: 0.388671875, time: 2024-12-03 23:31:19.615637, step: 2
loss: 0.404296875, time: 2024-12-03 23:31:20.206165, step: 3
loss: 0.423828125, time: 2024-12-03 23:31:20.796784, step: 4
loss: 0.416015625, time: 2024-12-03 23:31:21.387397, step: 5
loss: 0.396484375, time: 2024-12-03 23:31:21.977794, step: 6
loss: 0.408203125, time: 2024-12-03 23:31:22.567942, step: 7
loss: 0.46484375, time: 2024-12-03 23:31:23.158730, step: 8
loss: 0.380859375, time: 2024-12-03 23:31:23.749491, step: 9
loss: 0.375, time: 2024-12-03 23:31:24.340133, step: 10
loss: 0.400390625, time: 2024-12-03 23:31:24.930283, step: 11
loss: 0.40625, time: 2024-12-03 23:31:25.521231, step: 12
loss: 0.404296875, time: 2024-12-03 23:31:26.111397, step: 13
loss: 0.40625, time: 2024-12-03 23:31:26.701678, step: 14
loss: 0.40625, time: 2024-12-03 23:31:27.292442, step: 15
loss: 0.388671875, time: 2024-12-03 23:31:27.88

0,1
Learning_rate,▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇█
train_loss,▄▄▃▅█▄▄▄▆▃▃▃▃▄▃▅▂▃▅▄▂▂▃▃▆▃▃▃▃▃▃▁▃▃▃▂▄▃▃▄
train_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇█████

0,1
Learning_rate,3e-05
train_loss,0.38672
train_step,2000.0


In [47]:
train(train_roots_configs, train_roots_loader, testing_roots_loader, device)

cuda:0
Epoch 1 train begin 2024-12-04 00:46:10.060686
loss: 8.4375, time: 2024-12-04 00:46:10.344872, step: 1
loss: 8.5, time: 2024-12-04 00:46:10.938250, step: 2
loss: 9.3125, time: 2024-12-04 00:46:11.528453, step: 3
loss: 7.84375, time: 2024-12-04 00:46:12.118468, step: 4
loss: 8.4375, time: 2024-12-04 00:46:12.709084, step: 5
loss: 8.5, time: 2024-12-04 00:46:13.299819, step: 6
loss: 7.1875, time: 2024-12-04 00:46:13.889733, step: 7
loss: 7.09375, time: 2024-12-04 00:46:14.479630, step: 8
loss: 5.125, time: 2024-12-04 00:46:15.069995, step: 9
loss: 7.09375, time: 2024-12-04 00:46:15.660504, step: 10
loss: 6.90625, time: 2024-12-04 00:46:16.251059, step: 11
loss: 6.53125, time: 2024-12-04 00:46:16.842007, step: 12
loss: 5.90625, time: 2024-12-04 00:46:17.432285, step: 13
loss: 6.15625, time: 2024-12-04 00:46:18.022238, step: 14
loss: 6.21875, time: 2024-12-04 00:46:18.612472, step: 15
loss: 4.21875, time: 2024-12-04 00:46:19.202327, step: 16
loss: 5.46875, time: 2024-12-04 00:46:19.

0,1
Learning_rate,▁▂▅▆████████████████████████████████████
train_loss,█▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train_step,▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▆▆▆▆▆▆▆▆▆▆▆▆▇▇▇███

0,1
Learning_rate,6e-05
train_loss,0.53516
train_step,500.0


In [48]:
train(train_deriv_configs, train_deriv_loader, testing_deriv_loader, device)

cuda:0
Epoch 1 train begin 2024-12-04 00:51:11.448253
loss: 1.2890625, time: 2024-12-04 00:51:11.731619, step: 1
loss: 2.375, time: 2024-12-04 00:51:12.325095, step: 2
loss: 2.421875, time: 2024-12-04 00:51:12.914929, step: 3
loss: 1.6484375, time: 2024-12-04 00:51:13.505497, step: 4
loss: 1.6953125, time: 2024-12-04 00:51:14.095896, step: 5
loss: 2.03125, time: 2024-12-04 00:51:14.685075, step: 6
loss: 1.46875, time: 2024-12-04 00:51:15.274935, step: 7
loss: 1.921875, time: 2024-12-04 00:51:15.865600, step: 8
loss: 1.65625, time: 2024-12-04 00:51:16.455467, step: 9
loss: 1.7265625, time: 2024-12-04 00:51:17.045163, step: 10
loss: 1.28125, time: 2024-12-04 00:51:17.635686, step: 11
loss: 1.2109375, time: 2024-12-04 00:51:18.225854, step: 12
loss: 1.09375, time: 2024-12-04 00:51:18.815728, step: 13
loss: 0.95703125, time: 2024-12-04 00:51:19.405769, step: 14
loss: 0.69921875, time: 2024-12-04 00:51:19.995751, step: 15
loss: 0.9765625, time: 2024-12-04 00:51:20.585529, step: 16
loss: 1.3

0,1
Learning_rate,▁▂▄▅▆███████████████████████████████████
train_loss,█▅▃▃▄▁▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▂▂▂▁▂▂▂▂▁▂▁▂▂▁▁▁▂▁▁
train_step,▁▁▁▂▂▂▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███████

0,1
Learning_rate,6e-05
train_loss,0.37695
train_step,500.0


## Saving the Model Trained for 3000 Steps on HuggingFace

In [49]:
# saving the non quantized model
model.push_to_hub(
    SAVED_MODEL,
    tokenizer=tokenizer,
    safe_serialization=True,
    create_pr=True,
    max_shard_size="3GB",
)

tokenizer.push_to_hub(
    SAVED_MODEL,
)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
