## 任务 5：大模型能力评估（40分）

### 作业要求

从开源和闭源两类大模型中，每一类至少选择 2 个模型、每个模型至少选择 2 个版本，参考文献中给出的评价指标，评估不同的大模型求解简单数学问题的能力。进一步，通过尝试优化提示、改进模型 CoT 推理过程等方法，提升大模型的性能。

### 参考数据

[GSM8K](https://huggingface.co/datasets/openai/gsm8k)

### 要求

撰写 5 页以内的评测报告，至少包含使用的模型及其特点、优化后的 Prompt、模型性能对比与分析等。

### 参考文献
 - [Karl Cobbe, et, at. Training Verifiers to Solve Math Word Problems. 2021.](https://arxiv.org/abs/2110.14168) 

---
---

## 源码 👇
 - 开源模型
   - GPT-NeoX-20B
   - [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B)
     - ChatGLM2
     - ChatGLM4-9B
 - 闭源模型
   - GPT
    - GPT-4
    - GPT-4o
   - B

---

In [13]:
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, GPTNeoXForCausalLM, GPT2Tokenizer

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from datasets import Dataset

In [14]:
# init logger
import logging, sys, codecs

logger = logging.getLogger()
logger.setLevel(logging.INFO)

handler = logging.FileHandler('evaluate.log', encoding='utf-8', mode='w')
formatter = logging.Formatter('[%(asctime)s][%(levelname)s] %(message)s')
handler.setFormatter(formatter)

logger.addHandler(handler)

In [None]:
gsm8k_test_path = "../data/gsm8k/test-00000-of-00001.parquet"
gsm8k_train_path = "../data/gsm8k/train-00000-of-00001.parquet"

ds_test = Dataset.from_parquet(gsm8k_test_path)
ds_train = Dataset.from_parquet(gsm8k_train_path)

print(f"Train Dataset Sample: {ds_train[0]}")
print(f"Test Dataset Sample: {ds_test[0]}")

---

## 开源模型 1.1：LLAMA-1-7B

In [None]:
from transformers import GPTNeoXForCausalLM, GPT2Tokenizer

# 加载模型和 tokenizer
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

def evaluate_model(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(inputs['input_ids'], max_length=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
from transformers import GPTNeoXForCausalLM, GPT2Tokenizer
import openai

# GPT-4 API配置
openai.api_key = "your_openai_api_key"

# 加载开源模型：GPT-NeoX
def load_gpt_neox():
    model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")
    tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
    return model, tokenizer

# 使用开源模型进行推理
def infer_with_gpt_neox(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(inputs['input_ids'], max_length=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 使用闭源模型：GPT-4
def infer_with_gpt4(prompt):
    response = openai.Completion.create(
        model="gpt-4",
        prompt=prompt,
        max_tokens=100,
        temperature=0.0
    )
    return response['choices'][0]['text'].strip()

# 示例：加载并测试模型
model, tokenizer = load_gpt_neox()
sample_prompt = "Let's solve 25 + 32 step by step."
gpt_neox_result = infer_with_gpt_neox(model, tokenizer, sample_prompt)
gpt4_result = infer_with_gpt4(sample_prompt)

print("GPT-NeoX Result:", gpt_neox_result)
print("GPT-4 Result:", gpt4_result)

In [None]:
# CoT 推理提示生成函数
def generate_cot_prompt(problem):
    return f"Let's solve this step by step: {problem}"

# 示例问题
problem = "What is 25 + 32?"
cot_prompt = generate_cot_prompt(problem)

# 测试 CoT 提示的推理结果
gpt_neox_cot_result = infer_with_gpt_neox(model, tokenizer, cot_prompt)
gpt4_cot_result = infer_with_gpt4(cot_prompt)

print("GPT-NeoX CoT Result:", gpt_neox_cot_result)
print("GPT-4 CoT Result:", gpt4_cot_result)

In [None]:
def evaluate_accuracy(model, tokenizer, dataset, cot_prompt=False):
    correct_count = 0
    total_count = len(dataset)
    
    for example in dataset:
        problem = example["problem"]
        correct_answer = example["answer"]
        
        # 根据是否启用CoT提示，选择不同的提示方式
        if cot_prompt:
            prompt = generate_cot_prompt(problem)
        else:
            prompt = problem
        
        # 获取模型的答案
        if isinstance(model, str) and model == "gpt4":
            model_result = infer_with_gpt4(prompt)
        else:
            model_result = infer_with_gpt_neox(model, tokenizer, prompt)
        
        # 简单的准确性判断，忽略空格和标点
        if correct_answer.strip() == model_result.strip():
            correct_count += 1
    
    accuracy = correct_count / total_count
    return accuracy

# 评估GPT-4和GPT-NeoX的准确性
gpt_neox_accuracy = evaluate_accuracy(model, tokenizer, train_data, cot_prompt=True)
gpt4_accuracy = evaluate_accuracy("gpt4", None, train_data, cot_prompt=True)

print(f"GPT-NeoX Accuracy: {gpt_neox_accuracy * 100:.2f}%")
print(f"GPT-4 Accuracy: {gpt4_accuracy * 100:.2f}%")

---

## 开源模型 2.1 ChatGLM2-6B

### 模型实现和权重

由于本地连接 Huggingface 的网络环境不佳，故模型实现和权重均为本地加载。

In [None]:
MODEL_PATH = "../data/chatglm2-6b"
tokenizer_glm2_6b = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model_glm2_6b = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, device='cuda')
model_glm2_6b = model_glm2_6b.eval()

In [9]:
def infer_with_glm2_6b(question, history=[]):
    response, _history = model_glm2_6b.chat(tokenizer_glm2_6b, question, history=[])
    return response, _history

---

## 开源模型 2.2 ChatGLM4-9B

- **[ATTENTION]**
    - 推理 `ChatGLM4-9B`前，需要升级 Transformers 到 `>= 4.46.0` 版本，否则会出现错误。
    - `Python` == `3.10.12`

参考资料：
 - [[FIXED] Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3](https://github.com/unslothai/unsloth/issues/1059)


In [None]:
MODEL_PATH = "../data/glm-4-9b-chat-hf"

tokenizer_glm4_9b_hf = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=MODEL_PATH, trust_remote_code=True)
model_glm4_9b_hf = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()

In [40]:
def infer_with_glm4_9b_hf(question):
    message = [
        {
            "role": "system",
            "content": "Answer the following question. At the end of you answer, include 'The answer: xxx.', xxx is a number. "
        },
        {
            "role": "user",
            "content": question
        }
    ]
    inputs = tokenizer_glm4_9b_hf.apply_chat_template(message,
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )
    try:    
        inputs['input_ids'] = inputs['input_ids'].to('cuda')
        inputs['attention_mask'] = inputs['attention_mask'].to('cuda')
    except:
        pass
           
    gen_kwargs = {"max_length": 4000, "do_sample": True, "top_k": 1}
    with torch.no_grad():
        outputs = model_glm4_9b_hf.generate(**inputs, **gen_kwargs)
        outputs = outputs[:, inputs['input_ids'].shape[1]:]
        print(tokenizer_glm4_9b_hf.decode(outputs[0], skip_special_tokens=True))
        return tokenizer_glm4_9b_hf.decode(outputs[0], skip_special_tokens=True)

---

## 闭源模型 1.1 GPT4

---

## 闭源模型 1.2 GPT4o

## 闭源模型 2.1 openai-o1

---

## 闭源模型 2.2 openai-o1-preview

---

## LLM 评价

这部分将基于 `GSM8K` Math Solving 数据集，对上述的 8 个模型进行评价。

这里根据参考文献中的 `CoT` 推理方法进行实验，评估这些模型在 1396 个测试样本上的结果正确性。

In [29]:
def judge(truth:str, answer:str) -> bool:
    """_summary_
        extract the numbers from the answer.
        
        e.g. Truth: "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day. \nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.\n#### 18"
        
        e.g. Answer: "First find how many eggs Janet eats each day: 16 eggs / day - 3 eggs / day = 13 eggs / day. Then find how many eggs she bakes each day: 13 eggs / day * 4 eggs / day = 52 eggs / day. Then find the total number of eggs she sells each day: 52 eggs / day - 13 eggs / day = 39 eggs / day. Then multiply the number of eggs she sells by the price per egg to find her total earnings: 39 eggs / day * $2 / egg = $78 / day.\nThe answer: 78."
        
        In Truth, the number '18' after the last '####' is the final answer.
        In Answer, the number '78' after the last 'The answer: ' is the final answer.
    """
    logging.info(f"[Judge] [Truth]: {truth}")
    logging.info(f"[Judge] [Answer]: {answer}")
    success = False
    try:
        truth_num = int(truth.split('####')[-1].strip())
        answer_num = answer.split('The answer: ')[-1].strip()
        if answer_num.endswith('.'):
            answer_num = answer_num.split('.')[0].strip()
        answer_num = int(answer_num)
        success = (truth_num == answer_num)
        logging.info(f"[Judge] {success}. Truth: {truth_num}, Answer: {answer_num}")
    except Exception as e:
        success = False
        logging.info(f"[Judge] Skip this question for patten dismatch: {e}")
    finally:
        return success

In [27]:
# CoT 推理提示生成函数
def generate_cot_prompt(problem):
    return f"Let's solve this problem step by step: {problem}"

# 评价函数
def evaluate_accuracy(model, dataset, cot_prompt=False):
    correct_count = 0
    current_count = 0
    total_count = len(dataset)
    
    for example in dataset:
        question = example["question"]
        truth = example["answer"]
        
        if cot_prompt:
            prompt = generate_cot_prompt(question)
        else:
            prompt = question
        
        current_count += 1
        logging.info(f"[Evaluate] eval at {current_count}/{total_count}")
        
        if isinstance(model, str) and model == "gpt4":
            model_result = infer_with_gpt4(prompt)
        elif isinstance(model, str) and model == "glm2-6b":
            model_result, _ = infer_with_glm2_6b(prompt)
        elif isinstance(model, str) and model == "glm4_9b_hf":
            model_result = infer_with_glm4_9b_hf(prompt)
        else:
            print(f"Model '{model}' is not supported.")
            return
        
        if judge(truth, model_result):
            correct_count += 1
    
    accuracy = correct_count / total_count
    return accuracy

In [None]:
# glm2_6b_accuracy = evaluate_accuracy("glm2-6b", dataset=ds_test, cot_prompt=True)
glm4_9b_hf_accuracy = evaluate_accuracy("glm4_9b_hf", dataset=ds_test, cot_prompt=True)

# print(f"ChatGLM2-6B Accuracy: {glm2_6b_accuracy * 100:.2f}%")
print(f"GLM4-9B-HF Accuracy: {glm4_9b_hf_accuracy * 100:.2f}%")