## 3 Evaluating Locally deployed models

### 3.1 Load the (Quantized) model to a single GPU

In [1]:
!pip install flash-attn

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [2]:
import accelerate, bitsandbytes
import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from transformers import pipeline

os.environ["BNB_CUDA_VERSION"]="125"

## choose one of the local models.  Let's use a small one for faster evaluation.
model_path = '/ssdshare/share/Phi-3-mini-128k-instruct/'

# here is how you load a local model using haggingface api
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True) 


This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Verify that the model is loaded to GPU (look at the memory utilization).

In [3]:
# let's check the GPU memory utilization after loading the model
!nvidia-smi

Sat May  3 11:27:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:35:00.0 Off |                  N/A |
| 32%   31C    P0            108W /  350W |    7553MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### 3.2 Generate responses locally

In [4]:
def chat_resp(model, tokenizer, question_list):
    # here is how you use the pipeline to generate local responses
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 1024,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }

    output = pipe(question_list, **generation_args)  # note that you send in a list of questions (faster)
    # here is how you get the response               # however if you send too many questions, it will run out of memory
    return output

def chat_resp_batched(model, tokenizer, question_list, batch_size=4):
    # Split a large question list into batches of the specified size, to avoid running out of memory
    batches = [question_list[i:i + batch_size] for i in range(0, len(question_list), batch_size)]
    all_responses = []
    for batch in batches:
        print(f"processing batch: %s " % batch)
        responses = chat_resp(model, tokenizer, batch)
        all_responses.extend(responses)
    return all_responses

def gsm8k_prompt(question):
    # add system prompt to the question
    chat = [
        {"role": "system", "content": r"""Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value in the form "ANSWER: <single exact numerical value>" on the last line. """},
        {"role": "user", "content": "Question: " + question},
    ]
    return chat

In [5]:
## Test the model with a sample question
p = gsm8k_prompt("Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?")
p = [p]  # remember to send in a list of questions

chat_resp(model, tokenizer, p)  # p is the list of questions


Device set to use cuda:0
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


[[{'generated_text': ' Step 1: Determine the number of clips Natalia sold in May.\nSince Natalia sold half as many clips in May as she did in April, we need to find half of the number of clips she sold in April.\n\nStep 2: Calculate half the number of clips sold in April.\nApril sales = 48 clips\nHalf of April sales = 48 clips / 2\nHalf of April sales = 24 clips\n\nStep 3: Find the total number of clips sold in April and May.\nNow we add the number of clips sold in April to the number of clips sold in May to get the total.\n\nApril sales = 48 clips\nMay sales = 24 clips\nTotal sales = April sales + May sales\nTotal sales = 48 clips + 24 clips\n\nStep 4: Calculate the total number of clips sold.\nTotal sales = 72 clips\n\nStep 5: Provide the final answer.\nANSWER: 72'}]]

### 3.3 Prepare the evaluation datasets

In [6]:
# add proxy to access huggingface ...
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [7]:
from datasets import load_dataset
dataset = load_dataset("gsm8k", "main")  # read directly from huggingface

# if you want to use a local dataset, you can use the following code
# from datasets import load_dataset, load_from_disk
# dataset = load_from_disk("/ssdshare/share/gsm8k")

# to save time, we only use a small subset
subset = dataset['test'][5:12]
questions = subset['question']
answers = subset['answer']

dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

In [8]:
# We only want the numeric answers from the dataset for evalation (maybe a bad choice?)

def get_exact_answer(x):
    i = x.index('####')
    return x[i+5:].strip('\n')

num_answers = list(map(get_exact_answer, answers))
print(num_answers)


['64', '260', '160', '45', '460', '366', '694']


In [9]:
# this is very tentative and bad way to find the exact answer, consider fixing it. 

import re
def get_numbers(s):
    ans = re.findall(r'ANSWER:\s*([^)]+)', s)
    if ans:
        cleaned_ans = re.findall(r'[+-]?\d+\.?\d*', ans[-1])
        if cleaned_ans:
            return cleaned_ans[-1]
    return None

### 3.4 Evaluate!

In [10]:
question_prompts = [gsm8k_prompt(q) for q in questions]
resps = chat_resp_batched(model, tokenizer, question_prompts, batch_size=10)


Device set to use cuda:0


processing batch: [[{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value in the form "ANSWER: <single exact numerical value>" on the last line. '}, {'role': 'user', 'content': 'Question: Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?'}], [{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, con

In [11]:
llm_answers = []

for resp in resps:
    gen_text = resp[0]['generated_text']
    print("--------")
    print(gen_text)
    print("--------")
    num = get_numbers(gen_text)
    print(num)
    llm_answers.append(num)
    print("---------" )
    print(llm_answers)

--------
 Step 1: Determine the prices of the glasses.
The first glass costs $5.
Every second glass costs 60% of the price of the first glass, so 60% of $5 is calculated as follows:
0.60 * $5 = $3

Step 2: Determine the number of each type of glass.
Kylar wants to buy 16 glasses in total. Since every second glass is cheaper, we have an alternating pattern of expensive and cheap glasses. To calculate the number of each type, we can divide the total number of glasses by 2, because for every pair (1 expensive + 1 cheap), we have two glasses.
16 glasses / 2 = 8 pairs of glasses

Step 3: Calculate the total cost for each type of glass.
For the expensive glasses (the first glass in each pair), we have 8 glasses, and each costs $5. So the total cost for the expensive glasses is:
8 glasses * $5/glass = $40

For the cheap glasses (the second glass in each pair), we have 8 glasses, and each costs $3. So the total cost for the cheap glasses is:
8 glasses * $3/glass = $24

Step 4: Calculate the to

In [12]:
print(llm_answers)
print(num_answers)

['64', '260', '120', '315', '460', '366', '694']
['64', '260', '160', '45', '460', '366', '694']


In [13]:
## manual way to compute the correct rate

error = 0
for i in range(0, len(llm_answers)):
    if llm_answers[i] != num_answers[i]:
        error += 1
print(f"number of errors: %s \n correct rate: %s" % (error, 1 - error / len(llm_answers))) 

number of errors: 2 
 correct rate: 0.7142857142857143


In [14]:
## the way of using HuggingFace evaluate functions

import evaluate
exact_match = evaluate.load("exact_match")
results = exact_match.compute(predictions=llm_answers, references=num_answers)
print(results)

{'exact_match': 0.7142857142857143}


In [15]:
# do not forget to clean the gpu memory
import torch
torch.cuda.empty_cache()


In [16]:
# check the GPU memory utilization
!nvidia-smi


Sat May  3 11:30:26 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:35:00.0 Off |                  N/A |
| 32%   46C    P0            110W /  350W |    7635MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                