<a href="https://colab.research.google.com/github/Bryan-Az/Mathematics-LLM/blob/model_eval/%5BEvaluation%2C_GGUF%2C_Quantization%5D_Mathematics_LLM_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating the Math Finetuned 'Education & Math' Pre-Trained HuggingFaceTB SmolLM2-1.7B-Instruct Model as Compared to the Base Finetuned Model

This notebook is running on a GPU environment in Google Colab. The pre-trained foundation model we are using is being pulled from a cloud repository on HuggingFace. The model backbone was originally hosted by HuggingFaceTB, before being fine-tuned on our own dataset of math problems and subsequently uploaded to our own public model repo. Both the finetuned base/pre-trained LLaMA models have been quantized for use with llama.cpp for less memory-intense inference.

## Imports and Installs

In [4]:
# login with huggginface for using gated LLaMA foundational models
!huggingface-cli login --token $hf_token

usage: huggingface-cli <command> [<args>] login [-h] [--token TOKEN] [--add-to-git-credential]
huggingface-cli <command> [<args>] login: error: argument --token: expected one argument


In [5]:
%%capture
!pip install llama-cpp-python
!pip install datasets

In [6]:
from google.colab import userdata
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
from torch.utils.data import Dataset as TorchDataset
from datasets import load_dataset
import pandas as pd
import re
import os
hf_token = userdata.get('HF_TOKEN')

## Loading the Quantized Models

In [7]:
finetuned_pretrained_model_repo = 'Alexis-Az/Math-Problem-LlaMA-3.2-1.7B-GGUF'
finetuned_base_model_repo = 'Alexis-Az/Math-Problem-LlaMA-3.2-1B-GGUF'
filename = 'unsloth.Q4_K_M.gguf'
filename_ftpt = 'unsloth_ftpt.Q4_K_M.gguf'
filename_ftb = 'unsloth_ftb.Q4_K_M.gguf'
max_seq_length = 4096

In [8]:
# Download the file
temp_path = hf_hub_download(repo_id=finetuned_base_model_repo, filename=filename, local_dir='.')

# Rename the file
os.rename(temp_path, filename_ftb)
# Download the file
temp_path = hf_hub_download(repo_id=finetuned_pretrained_model_repo, filename=filename, local_dir = '.')

# Rename the file
os.rename(temp_path, filename_ftpt)

unsloth.Q4_K_M.gguf:   0%|          | 0.00/955M [00:00<?, ?B/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [9]:
%%capture
finetuned_base_model = Llama(model_path=filename_ftb, max_seq_length=max_seq_length, verbose=False)

In [10]:
%%capture
finetuned_pretrained_model= Llama(model_path=filename_ftpt, max_seq_length=max_seq_length, verbose=False)

## Loading the Math Related Evaluation Datasets

In [11]:
integration_dataset="Alexis-Az/math_datasets"

### Addition Data

In [12]:
val_additions = (load_dataset(integration_dataset, 'additions', split='test')).shuffle()

README.md:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

addition_operations_train.csv:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

additions/addition_operations_eval.csv:   0%|          | 0.00/2.86M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/801600 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/200401 [00:00<?, ? examples/s]

In [13]:
val_additions

Dataset({
    features: ['Operation', 'Result'],
    num_rows: 200401
})

### Roots Data

In [14]:
val_roots = (load_dataset(integration_dataset, 'roots', split="train[-2000:]")).shuffle() # the last 2000 rows from the train set were heldout for eval

roots/Roots.csv:   0%|          | 0.00/5.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [15]:
val_roots

Dataset({
    features: ['Function', 'Roots'],
    num_rows: 2000
})

### Derivatives Data

In [16]:
val_derivs = (load_dataset(integration_dataset, 'derivatives', split="train[-2000:]")).shuffle() # the last 2000 rows from the train set were heldout for eval

derivatives/Derivatives.csv:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [17]:
val_derivs

Dataset({
    features: ['Function', 'Derivative'],
    num_rows: 2000
})

## Evaluating the Models on the Eval Datasets

In [18]:
num_problems = 50

In [24]:
def solve_math_problem(example, llm, task):
    if task == 'Additions':
      response = llm(
          f"Solve this math problem, only print the final answer, do not provide steps: {example['Operation']}",
          max_tokens=50
      )
    if task == 'Roots':
      response = llm(
          f"Solve this math problem, only print the final answer, do not provide steps: {example['Function']}",
          max_tokens=50
      )

    if task == 'Derivatives':
      response = llm(
          f"Solve this math problem, only print the final answer, do not provide steps: {example['Function']}",
          max_tokens=50
      )

    solution = response['choices'][0]['text'].strip()
    return solution

def extract_number(text):
    match = re.search(r'\d+', text)
    if match:
        # Always return as string to avoid overflow
        return match.group()
    else:
        return None

def process_example(example, llm, subset, model_name):
    assert(subset in ['Roots', 'Derivatives', 'Additions'])
    assert(model_name in ['BFT', 'PTFT'])
    example['Raw_Result'] = solve_math_problem(example, llm, subset)
    # Apply extract_number to get the result
    extracted_result = extract_number(example['Raw_Result'])

    # Store result, handling potential string values
    example[model_name + '_Model_Result_' + subset] = extracted_result

    # Compare with 'Result' column, handling potential string values if integer overflow occurs
    if subset == 'Additions':
      example['Correct'] = str(example['Result']) == str(extracted_result) if extracted_result else False
    if subset == 'Roots':
      example['Correct'] = str(example['Roots']) == str(extracted_result) if extracted_result else False
    if subset == 'Derivatives':
      example['Correct'] = str(example['Derivative']) == str(extracted_result) if extracted_result else False

    return example

In [20]:
# sampling 50 rows from the huggingface datasets
additions_eval_subset = val_additions.shuffle().select(range(num_problems))

In [21]:
roots_eval_subset = val_roots.shuffle().select(range(num_problems))

In [22]:
derivatives_eval_subset = val_derivs.shuffle().select(range(num_problems))

### The Base Fine-tuned 1B Model

In [25]:
# Assuming all datasets are a Hugging Face Dataset object
additions_eval_subset = additions_eval_subset.map(lambda example: process_example(example, finetuned_base_model, 'Additions', 'BFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [26]:
roots_eval_subset = roots_eval_subset.map(lambda example: process_example(example, finetuned_base_model, 'Roots', 'BFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [27]:
derivatives_eval_subset = derivatives_eval_subset.map(lambda example: process_example(example, finetuned_base_model, 'Derivatives', 'BFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

### The Pre-trained & Fine-tuned 1.7B Model

In [28]:
additions_eval_subset = additions_eval_subset.map(lambda example: process_example(example, finetuned_pretrained_model, 'Additions', 'PTFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [31]:
roots_eval_subset = roots_eval_subset.map(lambda example: process_example(example, finetuned_pretrained_model, 'Roots', 'PTFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [32]:
derivatives_eval_subset = derivatives_eval_subset.map(lambda example: process_example(example, finetuned_pretrained_model, 'Derivatives', 'PTFT'))

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

## Saving the Evaluations to a HuggingFace Dataset

In [None]:
evaluations_repo = 'Alexis-Az/Math-LLM-Evaluations'
additions_eval_subset.push_to_hub(evaluations_repo, 'Additions')

In [36]:
roots_eval_subset.push_to_hub(evaluations_repo, 'Roots', split='test')
derivatives_eval_subset.push_to_hub(evaluations_repo, 'Derivatives', split = 'test')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/664 [00:00<?, ?B/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Alexis-Az/Math-LLM-Evaluations/commit/c1c2ba1240aa3c0cf15fc50f1c9e9498272b1ddb', commit_message='Upload dataset', commit_description='', oid='c1c2ba1240aa3c0cf15fc50f1c9e9498272b1ddb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/Alexis-Az/Math-LLM-Evaluations', endpoint='https://huggingface.co', repo_type='dataset', repo_id='Alexis-Az/Math-LLM-Evaluations'), pr_revision=None, pr_num=None)