## 3 Evaluating Locally deployed models

### 3.1 Load the (Quantized) model to a single GPU

In [3]:
!pip install flash-attn

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
%pip install bitsandbytes
import accelerate, bitsandbytes
import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from transformers import pipeline

os.environ["BNB_CUDA_VERSION"]="125"

## choose one of the local models.  Let's use a small one for faster evaluation.
model_path = '/ssdshare/share/Phi-3-mini-128k-instruct/'

# here is how you load a local model using haggingface api
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                             device_map="cuda:0", 
                                             torch_dtype="auto", 
                                             trust_remote_code=True) 


Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
Collecting bitsandbytes
  Downloading https://mirrors.tuna.tsinghua.edu.cn/pypi/web/packages/07/b7/cb5ce4d1a382cf53c19ef06c5fc29e85f5e129b4da6527dd207d90a5b8ad/bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl (76.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m110.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.45.5
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Verify that the model is loaded to GPU (look at the memory utilization).

In [6]:
# let's check the GPU memory utilization after loading the model
!nvidia-smi

Sat Apr 19 12:53:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:9A:00.0 Off |                  Off |
| 31%   33C    P0             54W /  450W |    7683MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### 3.2 Generate responses locally

In [7]:
def chat_resp(model, tokenizer, question_list):
    # here is how you use the pipeline to generate local responses
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }

    output = pipe(question_list, **generation_args)  # note that you send in a list of questions (faster)
    # here is how you get the response               # however if you send too many questions, it will run out of memory
    return output

def chat_resp_batched(model, tokenizer, question_list, batch_size=4):
    # Split a large question list into batches of the specified size, to avoid running out of memory
    batches = [question_list[i:i + batch_size] for i in range(0, len(question_list), batch_size)]
    all_responses = []
    for batch in batches:
        print(f"processing batch: %s " % batch)
        responses = chat_resp(model, tokenizer, batch)
        all_responses.extend(responses)
    return all_responses

def gsm8k_prompt(question):
    # add system prompt to the question
    chat = [
        {"role": "system", "content": """Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line. """},
        {"role": "user", "content": "Question: " + question},
    ]
    return chat

In [8]:
## Test the model with a sample question
p = gsm8k_prompt("Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?")
p = [p]  # remember to send in a list of questions

chat_resp(model, tokenizer, p)  # p is the list of questions


The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


[[{'generated_text': ' Step 1: Determine the number of clips Natalia sold in May.\nSince she sold half as many clips in May as she did in April, we calculate the number of clips sold in May by taking half of the number of clips sold in April.\n\nNumber of clips sold in April = 48\nNumber of clips sold in May = 48 / 2\n\nStep 2: Calculate the number of clips sold in May.\nNumber of clips sold in May = 48 / 2 = 24\n\nStep 3: Determine the total number of clips sold in April and May.\nTo find the total number of clips sold over the two months, we add the number of clips sold in April to the number of clips sold in May.\n\nTotal number of clips sold = Number of clips sold in April + Number of clips sold in May\nTotal number of clips sold = 48 + 24\n\nStep 4: Calculate the total number of clips sold.\nTotal number of clips sold = 48 + 24 = 72\n\nIn conclusion, Natalia sold a total of 72 clips altogether in April and May.'}]]

### 3.3 Prepare the evaluation datasets

In [9]:
# add proxy to access huggingface ...
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [10]:
from datasets import load_dataset
dataset = load_dataset("gsm8k", "main")  # read directly from huggingface

# if you want to use a local dataset, you can use the following code
# from datasets import load_dataset, load_from_disk
# dataset = load_from_disk("/ssdshare/share/gsm8k")

# to save time, we only use a small subset
subset = dataset['test'][5:12]
questions = subset['question']
answers = subset['answer']

dataset

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

In [11]:
# We only want the numeric answers from the dataset for evalation (maybe a bad choice?)

def get_exact_answer(x):
    i = x.index('####')
    return x[i+5:].strip('\n')

num_answers = list(map(get_exact_answer, answers))
print(num_answers)


['64', '260', '160', '45', '460', '366', '694']


In [12]:
# this is very tentative and bad way to find the exact answer, consider fixing it. 

import re
def get_numbers(s):
    number =[]
    lines = s.split('\n')
    for i in range(-1, -len(lines), -1):
        number = re.findall(r'\d+(?:\.\d+)?', lines[i])
        if len(number) > 0:
            break
    if (len(number) == 0):
        return '-9999'
    return number[-1]  # the last number is the answer

### 3.4 Evaluate!

In [13]:
question_prompts = [gsm8k_prompt(q) for q in questions]
resps = chat_resp_batched(model, tokenizer, question_prompts, batch_size=10)


processing batch: [[{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line. '}, {'role': 'user', 'content': 'Question: Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?'}], [{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and s

In [14]:
llm_answers = []

for resp in resps:
    gen_text = resp[0]['generated_text']
    print("--------")
    print(gen_text)
    print("--------")
    num = get_numbers(gen_text)
    print(num)
    llm_answers.append(num)
    print("---------" )
    print(llm_answers)

--------
 Step 1: Determine the cost of the first glass.
The first glass costs $5.

Step 2: Determine the cost of the second glass.
The second glass costs 60% of the price of the first glass. To find this, we multiply the price of the first glass ($5) by 60% (or 0.60 in decimal form).
$5 * 0.60 = $3

Step 3: Determine the total number of glasses with different pricing.
Since every second glass costs 60% of the price, we have alternating glasses at $5 and $3. So, out of 16 glasses, we have 8 glasses at each price.

Step 4: Calculate the total cost for the glasses that cost $5.
We have 8 glasses at $5 each, so we multiply the number of these glasses (8) by their individual price ($5).
8 * $5 = $40

Step 5: Calculate the total cost for the glasses that cost $3.
We have 8 glasses at $3 each, so we multiply the number of these glasses (8) by their individual price ($3).
8 * $3 = $24

Step 6: Add the total costs of the two different priced glasses to get the total cost.
$40 (cost of $5 glass

In [15]:
print(llm_answers)
print(num_answers)

['16', '260', '200', '315', '460', '366', '694']
['64', '260', '160', '45', '460', '366', '694']


In [16]:
## manual way to compute the correct rate

error = 0
for i in range(0, len(llm_answers)):
    if llm_answers[i] != num_answers[i]:
        error += 1
print(f"number of errors: %s \n correct rate: %s" % (error, 1 - error / len(llm_answers))) 

number of errors: 3 
 correct rate: 0.5714285714285714


In [18]:
## the way of using HuggingFace evaluate functions

%pip install evaluate

import evaluate
exact_match = evaluate.load("exact_match")
results = exact_match.compute(predictions=llm_answers, references=num_answers)
print(results)

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
Collecting evaluate
  Downloading https://mirrors4.tuna.tsinghua.edu.cn/pypi/web/packages/a2/e7/cbca9e2d2590eb9b5aa8f7ebabe1beb1498f9462d2ecede5c9fd9735faaf/evaluate-0.4.3-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

{'exact_match': 0.5714285714285714}


In [19]:
# do not forget to clean the gpu memory
import torch
torch.cuda.empty_cache()


In [20]:
# check the GPU memory utilization
!nvidia-smi


Sat Apr 19 13:00:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:9A:00.0 Off |                  Off |
| 31%   31C    P8             10W /  450W |    7781MiB /  24564MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                