## Optimizing LLMs for Speed and Memory

Ref: https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization

To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://arxiv.org/abs/2001.08361), [Wei et. al](https://arxiv.org/abs/2206.07682)). This consequently amplifies the memory demands for inference.

The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences.

In this guide, we will go over the effective techniques for efficient LLM deployment:

1. **Parallelization**

2. **Lower Precision (Quantization)**

3. **Flash Attention**

### 1. Measuring VRAM Usage

Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition weights will be used to signify all model weight matrices and vectors.

At the time of 2025, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. 4.5689 which is usually stored in either float32, bfloat16, or float16 format. This allows us to easily compute the memory requirement to load the LLM into memory:

Loading the weights of a model having X billion parameters requires roughly 4 X GB of VRAM in float32 precision. Loading the weights of a model having X billion parameters requires roughly 2 X GB of VRAM in bfloat16/float16 precision.

To give some examples of how much VRAM it roughly takes to load a model in bfloat16:

- GPT3 requires 2 * 175 GB = 350 GB VRAM
- Llama-2-70b requires 2 * 70 GB = 140 GB VRAM

However, the current largest GPU chip on the market is the NVIDIA GB200 offering 192GB of VRAM. Our 4090 only has 24GB of VRAM. 4090D has 48GB of VRAM. Most of the models listed before require more than 24GB just to be loaded and therefore necessarily require some form of optimization.

Moreover, we sometimes cache the hiden representations of the input tokens and generated tokens to speed up the generation process. This technique, called KV caching, is an example of space-time trade-off in Computer Science. It is especially useful for long sequences and commonly intregrated into LLM inference frameworks like vllm, sglang.

https://lmcache.ai/kv_cache_calculator.html is a good tool to calculate the memory requirements of the KV cache.

Here we only input and output a short sequence of tokens, so the KV cache is negligible. In the following parts, we only focus on the memory requirements of the model weights.

In [1]:
# %pip install transformers accelerate bitsandbytes optimum

In [2]:
model_path = "/ssdshare/share/Meta-Llama-3-8B-Instruct/"

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16, # note here
    attn_implementation="flash_attention_2",
    device_map="cuda:0"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


In [4]:
prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:"

result = pipe(prompt, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"][len(prompt):]
result

' Here is a simple function in Python that converts bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n    return bytes / (1024.0 ** 3)\n```\n\nYou can use this function by passing the number of bytes as an argument. For example:\n\n```\nprint(bytes_to_gigabytes(1000000000))  # Output: 953.67431640625\n```\n\nThis function works by dividing the number of bytes by 1024 cubed (1024 * 1024 * 1024), which is the number of bytes in a Giga byte. The `**` operator is used to raise 1024 to the power of 3. The result is a float, so we use a decimal point in the function definition. \n\nNote that this function assumes that the input is in bytes. If the input is in a different unit (e.g., kilobytes, megabytes), you will need to convert it to bytes first. For example, if you have the number of kilobytes, you would multiply it by 1024. If you have the number of megabytes, you would multiply it by 1024 * 1024. \n\nAlso, this function does not perform any error checking, so you should make 

In [5]:
def bytes_to_giga_bytes(bytes):
    """
    Convert bytes to gigabytes.

    Parameters:
        bytes (int): The number of bytes to convert.

    Returns:
        float: The equivalent value in gigabytes.
    """
    # 1 Gigabyte = 1024^3 bytes
    gigabytes = bytes / (1024**3)
    return gigabytes

Let’s call torch.cuda.max_memory_allocated to measure the peak GPU memory allocation.

In [6]:
bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

15.010785579681396

Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. 

Note that if we had tried to run the model in full float32 precision, a whopping 32 GB of VRAM would have been required.

Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if your GPU supports bfloat16. Float32 won’t give better inference results than the precision that was used to train the model.

Let’s define a flush(...) function to free all allocated memory so that we can accurately measure the peak allocated GPU memory.

In [7]:
import gc
import torch

del pipe
del model

def flush():
  gc.collect()
  torch.cuda.empty_cache()
  torch.cuda.reset_peak_memory_stats()

In [8]:
flush()

In [9]:
# In the recent version of the accelerate library, you can also use a utility method called release_memory()

# from accelerate.utils import release_memory
# release_memory(model)

### 2. Lower Precision

What if we could reduce the memory requirements even further?
It has been found that model weights can be quantized to 8-bit or 4-bits without a significant loss in performance (see [Dettmers et al.](https://arxiv.org/abs/2208.07339)).

#### 2.1 Intro to Quantization

Without going into too many details, quantization schemes aim at reducing the precision of weights while trying to keep the model’s inference results as accurate as possible (a.k.a as close as possible to bfloat16). Note that quantization works especially well for text generation since all we care about is choosing the set of most likely next tokens and don’t really care about the exact values of the next token logit distribution. All that matters is that the next token logit distribution stays roughly the same so that an argmax or topk operation gives the same results.

There are various quantization techniques, which we won’t discuss in detail here, but in general, all quantization techniques work as follows:

1. Quantize all weights to the target precision
2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision
3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision

In a nutshell, this means that inputs-weight matrix multiplications, with $X$ being the inputs, $W$ being a weight matrix and $Y$ being the output:
$$
Y=X∗W
$$
are changed to
$$
Y=X∗\text{dequantize}(W)
$$
for every matrix multiplication. Dequantization and re-quantization is performed sequentially for all weight matrices as the inputs run through the network graph.

Therefore, inference time is often not reduced when using quantized weights, but rather increases. Enough theory, let’s give it a try! To quantize the weights with Transformers, you need to make sure that the bitsandbytes library is installed.

In [10]:
# !pip install bitsandbytes

In [11]:
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(
    load_in_8bit=True,
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto",
    quantization_config=config,
    attn_implementation="flash_attention_2",
    device_map="cuda:0",
)

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [12]:
pipe = pipeline("text-generation", model=quantized_model, tokenizer=tokenizer)

result = pipe(prompt, max_new_tokens=300, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"][len(prompt):]
result

Device set to use cuda:0


' Here is a simple function in Python that takes bytes as input and returns the equivalent value in Giga bytes:\n\n```python\ndef bytes_to_gigabytes(bytes):\n    return bytes / (1024 * 1024 * 1024)\n```\n\nYou can use this function to convert bytes to Giga bytes. For example:\n\n```python\nbytes_value = 1073741824  # 1 GB\ngigabytes_value = bytes_to_gigabytes(bytes_value)\nprint(gigabytes_value)  # Output: 1.0\n```\n\nIn this example, we are converting 1 GB (1073741824 bytes) to Giga bytes and the output is 1.0, which is the expected result. The function works by dividing the input bytes by the number of bytes in a Giga byte (1024 * 1024 * 1024). The result is the equivalent value in Giga bytes.'

In [13]:
bytes_to_giga_bytes(torch.cuda.max_memory_allocated())

8.689323425292969

Amazing! Significant less! We’re down to around 8 GBs and could therefore run this model on consumer GPUs like the 4070. We’re seeing a very nice gain in memory efficiency and more or less no degradation to the model’s output. However, we can also notice a slight slow-down during inference.

#### 2.2 Benchmarking the Quantized Model

In this section, we will benchmark the full precision and quantized models to see the precision loss and the inference performance trade-off.

The benchmarking codes are from Lab6, so we will not go into too much detail here.

In [14]:
import os
import gc
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from tqdm.notebook import tqdm
import re
from datasets import load_dataset

model_path = "/ssdshare/share/Meta-Llama-3-8B-Instruct/"

os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [15]:
def chat_resp(model, tokenizer, question_list):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "pad_token_id": tokenizer.eos_token_id,
        "max_new_tokens": 1024,
        "return_full_text": False,
        "do_sample": True,
        "temperature": 0.7,
    }

    output = pipe(question_list, **generation_args)
    return output

def chat_resp_batched(model, tokenizer, question_list, batch_size=4):
    batches = [question_list[i:i + batch_size] for i in range(0, len(question_list), batch_size)]
    all_responses = []
    for batch in tqdm(batches, desc="Processing batches"):
        responses = chat_resp(model, tokenizer, batch)
        all_responses.extend(responses)
    return all_responses

def gsm8k_prompt(question):
    chat = [
        {"role": "system", "content": """Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line. """},
        {"role": "user", "content": "Question: " + question},
    ]
    return chat

dataset = load_dataset("gsm8k", "main")  # read directly from huggingface

# to save time, we only use a small subset, you can use larger subset
subset = dataset['test'][0:40]
questions = subset['question']
answers = subset['answer']

In [16]:
def get_exact_answer(x):
    i = x.index('####')
    return x[i+5:].strip('\n')

num_answers = list(map(get_exact_answer, answers))
print(num_answers)

def get_numbers(s):
    number =[]
    lines = s.split('\n')
    for i in range(-1, -len(lines), -1):
        number = re.findall(r'\d+(?:\.\d+)?', lines[i])
        if len(number) > 0:
            break
    if (len(number) == 0):
        return '-9999'
    return number[-1]  # the last number is the answer

['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104', '109', '80', '35', '70', '23', '9', '75', '2', '10', '18']


In [17]:
# test running time for the quantized model

question_prompts = [gsm8k_prompt(q) for q in questions]

start_time_quantized_model = time.time()
quantized_model_answers = chat_resp_batched(quantized_model, tokenizer, question_prompts, batch_size=4)
quantized_model_answers = [get_numbers(resp[0]['generated_text']) for resp in quantized_model_answers]
end_time_quantized_model = time.time()

print("Quantized model time: %s seconds" % (end_time_quantized_model - start_time_quantized_model))

# clean up
del quantized_model, pipe
gc.collect()
with torch.no_grad():
    torch.cuda.empty_cache()



Processing batches:   0%|          | 0/10 [00:00<?, ?it/s]

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


Quantized model time: 3609.7523953914642 seconds


In [18]:
# test the full model
full_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    device_map="cuda:0"
)

start_time_full_model = time.time()
full_model_answers = chat_resp_batched(full_model, tokenizer, question_prompts, batch_size=4)
full_model_answers = [get_numbers(resp[0]['generated_text']) for resp in full_model_answers]
end_time_full_model = time.time()

print("Full model time: %s seconds" % (end_time_full_model - start_time_full_model))

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Processing batches:   0%|          | 0/10 [00:00<?, ?it/s]

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


Full model time: 526.1859505176544 seconds


In [19]:
# compare the results between the full model and the quantized model

full_model_answers = [float(x) for x in full_model_answers]
quantized_model_answers = [float(x) for x in quantized_model_answers]
num_answers = [float(x) for x in num_answers]

full_model_error_cnt, quantized_model_error_cnt = 0, 0
for i in range(0, len(full_model_answers)):
    if full_model_answers[i] != quantized_model_answers[i] or full_model_answers[i] != num_answers[i]:
        print("For question %s" % questions[i])
        print("full model answer: %s quantized model answer: %s correct answer: %s\n" % (full_model_answers[i], quantized_model_answers[i], num_answers[i]))
        if full_model_answers[i] != num_answers[i]:
            full_model_error_cnt += 1
        if quantized_model_answers[i] != num_answers[i]:
            quantized_model_error_cnt += 1

print(f"Full model error count: {full_model_error_cnt} out of {len(full_model_answers)}")
print(f"Quantized model error count: {quantized_model_error_cnt} out of {len(quantized_model_answers)}")

For question Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?
full model answer: 0.0 quantized model answer: 0.0 correct answer: 70000.0

For question Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy.  She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed.  In the afternoon, she gives her chickens another 25 cups of feed.  How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens?
full model answer: 7.0 quantized model answer: 21.0 correct answer: 20.0

For question Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 g

We observe little accuracy loss and noticeable inference performance degradation.

That's acceptable because we are trading off a bit of accuracy for a lot of memory(50\%). Novel quantization techniques other than `bitsandbytes` have been developed to patch up the performance degradation, you can try them yourself.

In [None]:
# Observe the experiments above, and summarize the difference between the full model and the quantized model here
# Based on the experiments above, we can observe several key differences between the full precision model and quantized model:

# 1. Memory Usage:
#    - The full precision model (bfloat16) required approximately 16GB of VRAM
#    - The 8-bit quantized model required about 8GB of VRAM (50% reduction)
#    - This memory reduction is crucial for running larger models on consumer GPUs with limited VRAM

# 2. Inference Speed:
#    - The quantized model was noticeably slower in inference time (3600s vs 526s)
#    - This speed degradation is due to the additional overhead of dequantizing weights during inference

# 3. Accuracy:
#    - The accuracy difference between full precision and quantized models was minimal
#    - On the GSM8K benchmark, the models showed very similar error rates
#    - The quantized model maintained most of the reasoning capabilities of the full model

# 4. Trade-offs:
#    - Quantization provides a valuable memory-speed tradeoff
#    - Ideal for deployment scenarios where memory is the primary constraint
#    - For latency-sensitive applications, the full precision model may be preferred if sufficient VRAM is available
#    - Additional optimization techniques can be combined with quantization for better performance

In [5]:
#### Your Task ####
# find a model on huggingface and try to optimize it to fit it into 24GB of VRAM 
# using 4 bit quantization and observe the performance and accuracy loss
# you can reuse the same benchmark code above
# Note:
# If you encounter CUDA out of memory error,
# you can assign None to the variables which were bound to the models and pipes,
# then call `gc.collect()` and `torch.cuda.empty_cache()`.
# Python garbage collector will handle the freeing of the GPU memory.
# See the codes above.
# You can also restart the jupyter notebook kernel.
# 导入必要的库
import os
import gc
import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from tqdm.notebook import tqdm
import re
import pickle
from datasets import load_dataset

# 设置代理(如需要)
os.environ['HTTP_PROXY'] = "http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY'] = "http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY'] = "socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

# 内存管理工具函数
def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

# 字节转换为GB的函数
def bytes_to_giga_bytes(bytes_val):
    return bytes_val / (1024**3)

In [6]:
# 修改后的文本生成函数，不再依赖聊天模板
def generate_text(model, tokenizer, question_list):
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )   
    generation_args = {
        "pad_token_id": tokenizer.eos_token_id,
        "max_new_tokens": 1024,
        "return_full_text": False,
        "do_sample": True,
        "temperature": 0.7,
    }

    output = pipe(question_list, **generation_args)
    return output

def generate_text_batched(model, tokenizer, question_list, batch_size=4):
    batches = [question_list[i:i + batch_size] for i in range(0, len(question_list), batch_size)]
    all_responses = []
    for batch in tqdm(batches, desc="Processing batches"):
        responses = generate_text(model, tokenizer, batch)
        all_responses.extend(responses)
    return all_responses

# 修改提示函数，使用普通文本格式而非聊天格式
def format_math_prompt(question):
    return f"""Please solve the following math problem by providing a detailed, step-by-step explanation:

Question: {question}

Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. 
After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line.

Solution:"""

def get_numbers(s):
    number = []
    lines = s.split('\n')
    for i in range(-1, -len(lines), -1):
        number = re.findall(r'\d+(?:\.\d+)?', lines[i])
        if len(number) > 0:
            break
    if (len(number) == 0):
        return '-9999'
    return number[-1]  # the last number is the answer

In [None]:
# 加载GSM8K数据集
print("Loading GSM8K benchmark dataset...")
dataset = load_dataset("gsm8k", "main")
subset = dataset['test'][0:20]  # 使用较小子集以节省时间
questions = subset['question']
answers = subset['answer']

def get_exact_answer(x):
    i = x.index('####')
    return x[i+5:].strip('\n')

num_answers = list(map(get_exact_answer, answers))
num_answers = [float(x) for x in num_answers]
print(f"Loaded {len(questions)} questions")

# 准备问题提示
formatted_prompts = [format_math_prompt(q) for q in questions]

# 将提示缓存到文件以方便后续使用
with open('formatted_prompts.pkl', 'wb') as f:
    pickle.dump(formatted_prompts, f)

Loading GSM8K benchmark dataset...
Loaded 20 questions


In [8]:
# 使用更快的Phi-3-mini模型
model_id = "/ssdshare/share/Phi-3-mini-128k-instruct/"

# 加载分词器
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 确保padding token设置正确
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("\n========== Testing Full Precision model ==========")
try:
    print("Loading model with full precision (bfloat16)...")
    
    model_full = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
    
    # 记录内存使用情况
    memory_full = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
    print("Full precision model memory usage:", memory_full, "GB")
    
    # 基准测试全精度模型
    start_time_full = time.time()
    model_full_answers_raw = generate_text_batched(model_full, tokenizer, formatted_prompts, batch_size=4)
    model_full_answers = [get_numbers(resp[0]['generated_text']) for resp in model_full_answers_raw]
    model_full_answers = [float(x) for x in model_full_answers]
    end_time_full = time.time()
    
    time_full = end_time_full - start_time_full
    print("Full precision model time:", time_full, "seconds")
    
    # 保存结果，以便后续使用
    with open('model_full_results.pkl', 'wb') as f:
        pickle.dump({
            'memory': memory_full,
            'time': time_full,
            'answers': model_full_answers
        }, f)
    
    # 清理内存
    del model_full
    flush()
except Exception as e:
    print(f"Error with full precision model: {str(e)}")
    flush()

Loading tokenizer...

Loading model with full precision (bfloat16)...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Full precision model memory usage: 13.881382465362549 GB


Processing batches:   0%|          | 0/5 [00:00<?, ?it/s]

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


Full precision model time: 287.61535596847534 seconds


In [9]:
print("\n========== Testing 4-bit quantization ==========")
try:
    print("Loading model with 4-bit quantization...")
    config_4bit = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )

    model_4bit = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=config_4bit,
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
    
    # 记录内存使用情况
    memory_4bit = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
    print("4-bit model memory usage:", memory_4bit, "GB")
    
    # 基准测试4位模型
    start_time_4bit = time.time()
    model_4bit_answers_raw = generate_text_batched(model_4bit, tokenizer, formatted_prompts, batch_size=4)
    model_4bit_answers = [get_numbers(resp[0]['generated_text']) for resp in model_4bit_answers_raw]
    model_4bit_answers = [float(x) for x in model_4bit_answers]
    end_time_4bit = time.time()
    
    time_4bit = end_time_4bit - start_time_4bit
    print("4-bit model time:", time_4bit, "seconds")
    
    # 保存结果，以便后续使用
    with open('model_4bit_results.pkl', 'wb') as f:
        pickle.dump({
            'memory': memory_4bit,
            'time': time_4bit,
            'answers': model_4bit_answers
        }, f)
    
    # 清理内存
    del model_4bit
    flush()
except Exception as e:
    print(f"Error with 4-bit model: {str(e)}")
    flush()


Loading model with 4-bit quantization...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

4-bit model memory usage: 10.318150520324707 GB


Processing batches:   0%|          | 0/5 [00:00<?, ?it/s]

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


4-bit model time: 345.04740834236145 seconds


In [10]:
print("\n========== Comparing Results ==========")
try:
    # 从保存的文件加载结果
    with open('model_full_results.pkl', 'rb') as f:
        model_full_results = pickle.load(f)
        model_full_answers = model_full_results['answers']
        memory_full = model_full_results['memory']
        time_full = model_full_results['time']
    
    
    with open('model_4bit_results.pkl', 'rb') as f:
        model_4bit_results = pickle.load(f)
        model_4bit_answers = model_4bit_results['answers']
        memory_4bit = model_4bit_results['memory']
        time_4bit = model_4bit_results['time']
    
    # 比较精度
    model_full_error_cnt = 0
    model_8bit_error_cnt = 0
    model_4bit_error_cnt = 0
    
    print("Accuracy comparison:")
    for i in range(len(num_answers)):
        if model_full_answers[i] != num_answers[i]:
            model_full_error_cnt += 1
        if model_4bit_answers[i] != num_answers[i]:
            model_4bit_error_cnt += 1

        if (model_full_answers[i] != model_4bit_answers[i] or
            model_full_answers[i] != num_answers[i]):
            print(f"Question {i}:")
            print(f"  Full precision answer: {model_full_answers[i]}")
            print(f"  4-bit answer: {model_4bit_answers[i]}")
            print(f"  Correct answer: {num_answers[i]}")
            print()
    
    # 总结
    print("\n========== Summary ==========")
    print(f"Full precision model error rate: {model_full_error_cnt}/{len(num_answers)} ({model_full_error_cnt/len(num_answers)*100:.2f}%)")
    print(f"4-bit model error rate: {model_4bit_error_cnt}/{len(num_answers)} ({model_4bit_error_cnt/len(num_answers)*100:.2f}%)")
    print(f"Full precision model time: {time_full:.2f} seconds")
    print(f"4-bit model time: {time_4bit:.2f} seconds")
    print(f"4-bit model time: {time_4bit:.2f} seconds")
    print(f"Memory comparison:")
    print(f"  Full precision: {memory_full:.2f}GB")
    print(f"  4-bit: {memory_4bit:.2f}GB ({(memory_full - memory_4bit)/memory_full*100:.2f}% reduction)")
except Exception as e:
    print(f"Error in comparison: {str(e)}")


Accuracy comparison:
Question 2:
  Full precision answer: 0.0
  4-bit answer: 0.0
  Correct answer: 70000.0

Question 7:
  Full precision answer: 40.0
  4-bit answer: 120.0
  Correct answer: 160.0

Question 8:
  Full precision answer: 315.0
  4-bit answer: 315.0
  Correct answer: 45.0

Question 12:
  Full precision answer: 12.0
  4-bit answer: 12.0
  Correct answer: 13.0

Question 13:
  Full precision answer: 6.0
  4-bit answer: 5.0
  Correct answer: 18.0

Question 15:
  Full precision answer: 96.0
  4-bit answer: 125.0
  Correct answer: 125.0

Question 16:
  Full precision answer: 170.0
  4-bit answer: 230.0
  Correct answer: 230.0

Question 17:
  Full precision answer: 500.0
  4-bit answer: 500.0
  Correct answer: 57500.0

Question 18:
  Full precision answer: 4.0
  4-bit answer: 4.0
  Correct answer: 7.0

Question 19:
  Full precision answer: 6.0
  4-bit answer: 4.0
  Correct answer: 6.0


Full precision model error rate: 9/20 (45.00%)
4-bit model error rate: 8/20 (40.00%)
Full pre