# COMP3361 Part 1: Building a Transformer Encoder

Note: You should finish your code solution of Part 1 & 2 with A2p12.tgz. For Q2 & Q3, you should include your writeup in this notebook.

## Q2:

## Answer:
Owing to the defination of attention maps, each row in matrix reflects the weights from all the index after softmax operation. Meanwhile, color depth in image is proportionate to the value of attention scores.
- What it looks like:
  
  In the entire sentence, other positions (y-coordinate) that are the same as the current query token (x-coordinate) and diagonal positions (x and y coordinates are the same token) have a more deep color in the figure, representing a high corresponding attention score.

- Whether this matches your expectations:

  In the 5 examples outputed, the number in each rows of each images equals to the predicted output of my transformer model.



## Q3:
## Answer:
I have tried Transformer layers 1-3. Several measured metrics are recorded below.

|  | 2 layers | 3 layers |4 layers|
| --- | --- | --- | --- |
| Accuracy on 100exs | 0.984 | 0.997 | 0.984 |
| Accuracy on whole dev | 0.978 | 0.995 |0.987 |

- Do all attention maps fit the pattern you expect?
  
  No, only the attention map armed by 1 layer Transformer fits my expectation (answer of what it looks like). It may be because the next layer of Transformer needs to access the output of the previous layer of Transformer, and the output of the previous layer is not the same as the embedding obtained from the original nn.Embedding (also including positional embedding). Additionally, after multiple layers of Transformer, the attention map is not as clear in terms of scores (weights * information content) as it is in a single layer. Therefore, the attention map obtained from multiple layers of Transformer does not have a clear interpretation as in a single layer.

- What do you see?
  
  Althugh the rule that high attention score corresponds to deep color doesn't work, positions with the same token are also been highlighted with different color (might be lighter or deeper). You can find it in the folders I saved: Part1_{i}_layers_plots (i = 1,2,3).

# COMP3361 Part 3: Generation with Large Language Model

## 3.1 Load model and tokenizer

In this section, we will use [CodeLlama-7B](https://huggingface.co/codellama/CodeLlama-7b-hf) as the language model.

In [1]:
!pip install transformers==4.37.2 datasets evaluate accelerate bitsandbytes

Collecting transformers==4.37.2
  Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [2]:
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import os
import json
import evaluate
from datasets import load_dataset
from tqdm import tqdm
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
class LLM(object):
    def __init__(self, model_name="codellama/CodeLlama-7b-hf"):
        self.model = AutoModelForCausalLM.from_pretrained(
          model_name, device_map="auto", load_in_4bit=True)
        self.tokenizer = AutoTokenizer.from_pretrained(
          model_name, padding_side="left")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        # 需要pad，且需要显示增加pad_token


    def generate(self, prompts: List[str], **kwargs) -> List[str]:
      # this model doesn't support parameter "batch_size"
      # we need to implement it manually
      if "batch_size" in kwargs.keys():
        outputs_list = []
        batch_size = kwargs["batch_size"]
        for batch in range(0,len(prompts),batch_size):
          for num in range(0,min(batch_size,len(prompts)-batch)):
            index = batch + num
            tokened_inputs = self.tokenizer(
                prompts[index], padding=True, return_tensors="pt").to("cuda")
            generated_ids = self.model.generate(
                **tokened_inputs, max_new_tokens=kwargs["max_new_tokens"])
            outputs = self.tokenizer.batch_decode(generated_ids,skip_special_tokens=True)[0]
            outputs_list.append(outputs)
        return outputs_list
      else:
        tokened_inputs = self.tokenizer(
              prompts, padding=True, return_tensors="pt").to("cuda")
        generated_ids = self.model.generate(
              **tokened_inputs, max_new_tokens=kwargs["max_new_tokens"])
        # 本模型确实不接受batch_size，如果必须要的话，自己手动实现
        # **的绑定和解绑后是一个字典啦
        outputs = self.tokenizer.batch_decode(generated_ids,skip_special_tokens=True)[0]
        return outputs



In [4]:
llm = LLM()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/637 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [5]:
llm.generate(["A list of colors: red, blue", "Portugal is"], max_new_tokens=20, batch_size=2)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['A list of colors: red, blue, green, yellow, orange, purple, brown, pink, black, white, gray',
 'Portugal is a small town in the state of New York, United States. \n\n## History\n\n']

## 3.2 (Q5) Zero-shot Code Generation

In [6]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!mkdir -p human_eval
!wget -O human_eval/__init__.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
!wget -O human_eval/data.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
!wget -O human_eval/evaluation.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/evaluation.py
!wget -O human_eval/execution.py human_eval https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/execution.py

!mkdir -p data/humaneval
!wget -O data/humaneval/HumanEval.jsonl.gz https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz

--2024-03-20 12:58:26--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/__init__.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/plain]
Saving to: ‘human_eval/__init__.py’

human_eval/__init__     [ <=>                ]       0  --.-KB/s    in 0s      

2024-03-20 12:58:26 (0.00 B/s) - ‘human_eval/__init__.py’ saved [0/0]

--2024-03-20 12:58:26--  http://human_eval/
Resolving human_eval (human_eval)... failed: Name or service not known.
wget: unable to resolve host address ‘human_eval’
--2024-03-20 12:58:26--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/human_eval/data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.

In [7]:
"""
`@abstractmethod` 是 Python 中的一个装饰器，它的作用是定义一个【抽象方法】
抽象方法是一个在基类中声明但没有具体实现的方法，它只提供了一个接口，
而具体的实现则由其子类来实现。
在 Python 中，抽象方法用于定义接口和规范，以便在子类中实现具体的功能。

1. 一个抽象方法必须被子类实现，否则会抛出 `TypeError` 异常。
2. 抽象类不能被实例化，因为其中有类没有被定义，只能用于被继承。
  子类定义之后才可以被实例化
"""

class Evaluator(ABC):
    def __init__(self, llm, evalset_file):
        self.evalset_file = evalset_file
        self.llm = llm

    @abstractmethod
    def load_data(self):
        pass

    @abstractmethod
    def build_prompts(self):
        pass

    @abstractmethod
    def postprocess_output(self, output: str) -> str:
        pass

    # TODO:
    def generate_completions(self, prompts: List[str], **kwargs) -> List[str]:
        outputs = llm.generate(prompts, **kwargs)
        # 其中kwargs包含了batch_size, max_bew_tokens, 其他kwargs中包含的参数.
        # 此时使用的llm不接受batch_size参数，自己实现batch功能 & 真正调用hugging face接口时候不给它batch_size参数
        # kwargs打包
        return outputs

    def evaluate(self, batch_size=4, save_dir="outputs", max_new_tokens=128, **kwargs):
        dataset = self.load_data()
        prompts = self.build_prompts(dataset)
        # BUG: need to predict all the examples in dataset, with batch size
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)
        # kwargs解包

        predictions = []
        if "model_type" in kwargs.keys():
          for i, (prompt, output) in enumerate(zip(prompts, outputs)):
              prediction = {
                "task_id": i,
                "prompt": prompt,
                # get a specific output of each data
                "completion": self.postprocess_output(output)
              }
              predictions.append(prediction)
        else:
          for i, (example, prompt, output) in enumerate(zip(dataset, prompts, outputs)):
              prediction = {
                "task_id": example.get("task_id", f"task_{i}"),
                "prompt": prompt,
                # get a specific output of each data
                "completion": self.postprocess_output(output)
              }
              predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        results = self.calculate_metrics(predictions, dataset)
        print(f"Results for {type(self).__name__}: {results}")

    @abstractmethod
    def calculate_metrics(self):
        pass

In [8]:
from human_eval.data import read_problems
from human_eval.evaluation import evaluate_functional_correctness


class HumanEvalEvaluator(Evaluator):
    def load_data(self, evalset_file="data/humaneval/HumanEval.jsonl.gz") -> List[Dict[str, Any]]:
        """
        Load the humaneval dataset
        :param evalset_file: path to the humaneval dataset file
        :return: list of examples
        """
        self.evalset_file = evalset_file
        return list(read_problems(evalset_file).values())

    def build_prompts(self, dataset) -> List[str]:
        """
        Build zero-shot prompts from the humaneval dataset.
        """
        prompts = [example["prompt"] for example in dataset]
        return prompts

    # Discription: extract solution from generated completions based on stop_sequences
    # TODO: find the first token which is in stop_sequences + indice initial string
    # output: one sentence
    def postprocess_output(self, output: str) -> str:
        stop_sequences =[ "\nclass", "\ndef", "\n#", "\nif", "\nprint"]
        start_index = output.find("\ndef")
        answer = output[start_index+1:] # 从第一个def开始之后才能算答案的开始
        stop_index = len(answer)
        for item in stop_sequences:
          temp_index = answer.find(item)
          if temp_index != -1: # find the stop_token
            stop_index = min(stop_index, temp_index)
        real_output = output[ :stop_index+start_index+1]
        return real_output

    def calculate_metrics(self, predictions, dataset):
        pass_at_k_results = evaluate_functional_correctness(
            sample_file=os.path.join("outputs", f"{type(self).__name__}_predictions.jsonl"),
            k=[1],
            problems={example["task_id"]: example for example in dataset},
            n_workers=64
        )
        return pass_at_k_results


In [9]:
human_eval_evaluator = HumanEvalEvaluator(llm, "data/humaneval/HumanEval.jsonl.gz")
human_eval_evaluator.evaluate(batch_size=8)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

Reading samples...


164it [00:19,  8.45it/s]


Running test suites...


100%|██████████| 164/164 [00:12<00:00, 13.21it/s]


Writing results to outputs/HumanEvalEvaluator_predictions.jsonl_results.jsonl...


100%|██████████| 164/164 [00:00<00:00, 15026.12it/s]

Results for HumanEvalEvaluator: {'pass@1': 0.29878048780487804}





## 3.3 (Q6): Few-shot Math Reasoning

> short_answer



In [10]:
# 8-shot in-context examples
GSM_EXAMPLARS = [
    {
        "question": "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
        "cot_answer": "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.",
        "pot_answer": "def solution():\n    \"\"\"There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\"\"\"\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result",
        "short_answer": "6"
    },
    {
        "question": "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
        "cot_answer": "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.",
        "pot_answer": "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
        "short_answer": "5"
    },
    {
        "question": "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
        "cot_answer": "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.",
        "pot_answer": "def solution():\n    \"\"\"Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\"\"\"\n    leah_chocolates = 32\n    sister_chocolates = 42\n    total_chocolates = leah_chocolates + sister_chocolates\n    chocolates_eaten = 35\n    chocolates_left = total_chocolates - chocolates_eaten\n    result = chocolates_left\n    return result",
        "short_answer": "39"
    },
    {
        "question": "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?",
        "cot_answer": "Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\"\"\"\n    jason_lollipops_initial = 20\n    jason_lollipops_after = 12\n    denny_lollipops = jason_lollipops_initial - jason_lollipops_after\n    result = denny_lollipops\n    return result",
        "short_answer": "8"
    },
    {
        "question": "Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?",
        "cot_answer": "Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. So the answer is 9.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "9"
    },
    {
        "question": "There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?",
        "cot_answer": "There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. So the answer is 29.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "29"
    },
    {
        "question": "Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?",
        "cot_answer": "Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. So the answer is 33.",
        "pot_answer": "def solution():\n    \"\"\"Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\"\"\"\n    golf_balls_initial = 58\n    golf_balls_lost_tuesday = 23\n    golf_balls_lost_wednesday = 2\n    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday\n    result = golf_balls_left\n    return result",
        "short_answer": "33"
    },
    {
        "question": "Olivia has $23. She bought five bagels for $3 each. How much money does she have left?",
        "cot_answer": "Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\"\"\"\n    money_initial = 23\n    bagels = 5\n    bagel_cost = 3\n    money_spent = bagels * bagel_cost\n    money_left = money_initial - money_spent\n    result = money_left\n    return result",
        "short_answer": "8"
    }
]

In [11]:
# TODO: implement all functions in the following class
"""
`dataset`对象是hugging face库中`datasets`模块的一个特定数据结构，它代表了一个数据集。
`dataset`对象提供了许多方法和属性，允许你以灵活的方式处理和操作数据集。以下是一些常用的`dataset`对象的操作：

- `dataset.num_rows`: 返回数据集中的行数。
- `dataset.column_names`: 返回数据集中所有列的名称。
- `dataset.features`: 返回数据集中的特征（例如，数据类型、形状、列名等）。
- `dataset["split_name"]`: 返回指定名称的split中的所有数据。
- `dataset.remove_columns(column_names)`: 从数据集中删除指定的列。
- `dataset.shuffle(seed=None)`: 对数据集进行随机排序。
- `dataset.filter(function, with_indices=False)`: 根据指定的函数过滤数据集中的数据。
- `dataset.map(function, batch_size=None, num_proc=None, with_indices=False, keep_in_memory=False, load_from_cache_file=True)`: 对数据集中的数据执行指定的函数。
- `dataset.sort(sorting_keys, reverse=False)`: 根据指定的键对数据集中的数据进行排序。
- `dataset.train_test_split(test_size=0.1, train_size=None, shuffle=True)`: 将数据集拆分为训练集和测试集。
- `dataset.save_to_disk(cache_dir)`: 将数据集保存到磁盘上的指定目录中。
- `dataset.load_from_disk(cache_dir)`: 从磁盘上的指定目录中加载数据集。
- `dataset.to_pandas()`: 将数据集转换为Pandas DataFrame对象。
"""

from datasets import load_dataset
import copy

# BUG：并没有对利用前8个prompts来yuce
class GSM8KEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        dataset = load_dataset(evalset_file, 'main') # 先测试前10个看看结果 & 格式
        # `dataset`对象是一个类字典（`DatasetDict`），它可以包含多个数据集（`Dataset`）作为其成员。
        # 每个数据集都由一个或多个数据集的split（如train、test、validation等）组成。
        test_split = dataset["test"]
        initial_100 = test_split[:100]
        return initial_100

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """
        Build few-shot prompts from the GSM8K dataset. Use
        :param dataset: list of examples
        :param n_shot: number of examples to use for few-shot learning
        :param demos: list of demonstrator examples
        :return: list of prompts
        """
        shared_prompts = ""
        string = "Answering the following questions"
        shared_prompts += string
        for index in range(n_shot):
          prompt_string = '\n question:{0}\n answer:{1}'.format\
           (demos[index]["question"],demos[index]["short_answer"])
          shared_prompts += prompt_string
        # 应该还需要每个prompts加一下dataset的提问，需要先看dataset的格式，再把question加上
        data_prompt = []
        for index in range(len(dataset['question'])):
        # BUG! 注意dataset的格式！不是列表用len长度就不行！
          current_prompt=copy.deepcopy(shared_prompts)
          prompt_string=('\n question:{0}\n answer:{1}').format\
           (dataset["question"][index],"")
          current_prompt += prompt_string
          data_prompt.append(current_prompt)
        return data_prompt

    def postprocess_output(self, output: str, n_shot: int) -> str:
        """
        Postprocess the output from the language model.
        """
        slots = n_shot
        end_index = -1
        start_index = 0
        for i in range(slots+3):
          temp_answer = output[start_index:]
          index = temp_answer.find("question")
          end_index += index+1
          start_index += (index+1)
        return output[:end_index]

    def calculate_metrics(self, predictions, dataset):
        real_label = []
        for answer in dataset["answer"]:
          cur_index = answer.find("####")
          real_label.append(answer[cur_index+5:].strip())
        pre_label = []
        for answer in predictions:
          answer = str(answer)
          cur_index = answer.rfind("answer")
          pre_label.append(answer[cur_index+7:].strip()[:-5])
        sum=0
        for index in range(0,len(predictions)):
          if real_label[index]==pre_label[index]:
            sum += 1
        print("real answers are:",real_label,'\n')
        print("predicted answers are:",pre_label,'\n')
        return sum/len(predictions)

    def evaluate(self, batch_size=1, save_dir="outputs", max_new_tokens=128, **kwargs):
        n_shot = kwargs["n_shot"]
        demos = kwargs["demos"]
        dataset = self.load_data()
        prompts = self.build_prompts(dataset, n_shot=n_shot, demos=demos)
        # BUG: need to predict all the examples in dataset, with batch size
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)
        print(len(outputs))
        # kwargs解包

        predictions = []
        for i, (prompt, output) in enumerate(zip(prompts, outputs)):
            prediction = {
              "task_id": i,
              "prompt": prompt,
              # get a specific output of each data
              "completion": self.postprocess_output(output, n_shot)
            }
            predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        results = self.calculate_metrics(predictions, dataset)
        print(f"Results for {type(self).__name__}: {results}")

In [12]:
gsm8k_evaluator = GSM8KEvaluator(llm, evalset_file="gsm8k")
gsm8k_evaluator.evaluate(n_shot=8, demos=GSM_EXAMPLARS)

Downloading readme:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

100
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104', '109', '80', '35', '70', '23', '9', '75', '2', '10', '18', '8', '200', '26', '48', '20', '104', '163', '800', '8', '30', '294', '5', '15', '40', '40', '14', '3', '83', '57', '187', '17', '1430', '25000', '1596', '300', '36', '48', '595', '36', '60', '7425', '60', '221', '255', '88', '60', '5', '100', '6', '70', '10', '17', '623', '600', '15', '44', '22', '9360', '8000', '24', '225', '28', '4', '36', '348', '40', '3', '12', '5', '58'] 

predicted answers are: ['12', '3', '110000', '180', '10', '100', '100', '160', '120', '52', '100', '200', '10', '10', '25', '1000', '100', '1050', '12', '2', '12', '20', '11', '10', '24.5', '10', '105', '120', '35', '10', '152', '40', '50', '80', '26', '100', '105', '6', '12', '12', '10', '1000', '14', '1200', '10', '12', '21', '160', '12', '12'

## 3.4 (Q7) Few-shot Chain-of Thought Math Reasoning

>chain_of_thought

In [None]:

class GSM8KCoTEvaluator(GSM8KEvaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        """
        Load the GSM8K dataset https://huggingface.co/datasets/gsm8k with Huggingface datasets library
        Load the first 100 examples from the test split in main subset.
        """
        dataset = load_dataset(evalset_file, 'main', split="test") # 先测试前10个看看结果 & 格式
        # `dataset`对象是一个类字典 (`DatasetDict`)，它可以包含多个数据集（`Dataset`）作为其成员。
        # 每个数据集都由一个或多个数据集的split（如train、test、validation等）组成。
        examples = [{"question":example["question"],\
                "answer":example["answer"].split("####")[1].strip()}\
                    for example in dataset]
        for example in examples:
          example["answer"]=re.sub(r"(\d),(\d)",r"\1\2", example["answer"])
        return examples[:100]

    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        shared_prompts = ""
        string = "Answering the following questions with chain of thought and a single value at the end."
        shared_prompts += string
        for index in range(n_shot):
          prompt_string = '\n question:{0}\n chain_of_thought_answer:{1}'.format\
           (demos[index]["question"],demos[index]["cot_answer"])
          shared_prompts += prompt_string
        # 应该还需要每个prompts加一下dataset的提问，需要先看dataset的格式，再把question加上
        data_prompt = []
        for index in range(len(dataset)):
        # BUG! 注意dataset的格式！不是列表用len长度就不行！
          current_prompt=copy.deepcopy(shared_prompts)
          prompt_string=('\n question:{0}\n chain_of_thought_answer:{1}').format\
           (dataset[index]["question"],"")
          current_prompt += prompt_string
          data_prompt.append(current_prompt)
        return data_prompt

    def postprocess_output(self, output: str, n_shot: int) -> str:
        """
        因为有生成token的上限，所以要把prompt+真正生成的answer给拿出来
        """
        slots = n_shot
        end_index = -1
        start_index = 0
        for i in range(slots+3):
          temp_answer = output[start_index:]
          index = temp_answer.find("question")# 就算没找到end-index和start_index也不会变！
          end_index += index+1
          start_index += (index+1)
        return output[:end_index]

    def calculate_metrics(self, predictions, dataset):
        real_label = [item["answer"] for item in dataset]
        pre_label = []
        for answer in predictions:
          answer = str(answer)
          cur_index = answer.rfind("chain_of_thought_answer")
          cot_answer=answer[cur_index+24:].strip()[:-5]
          index = cot_answer.find("So the answer is ")
          pre_label.append(cot_answer[index+17:-1])
        sum=0
        for index in range(0,len(predictions)):
          if real_label[index]==pre_label[index]:
            sum += 1
        print("real answers are:",real_label,'\n')
        print("predicted answers are:",pre_label,'\n')
        return sum/len(predictions)



In [None]:
gsm8k_cot_evaluator = GSM8KCoTEvaluator(llm, evalset_file="gsm8k")
gsm8k_cot_evaluator.evaluate(n_shot=8,demos=GSM_EXAMPLARS)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

100
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104', '109', '80', '35', '70', '23', '9', '75', '2', '10', '18', '8', '200', '26', '48', '20', '104', '163', '800', '8', '30', '294', '5', '15', '40', '40', '14', '3', '83', '57', '187', '17', '1430', '25000', '1596', '300', '36', '48', '595', '36', '60', '7425', '60', '221', '255', '88', '60', '5', '100', '6', '70', '10', '17', '623', '600', '15', '44', '22', '9360', '8000', '24', '225', '28', '4', '36', '348', '40', '3', '12', '5', '58'] 

predicted answers are: ['96', '2.5', '45,000', '180', '8', '120', '8', '260 minutes', '8', '8', '314', '666', '7', '2', '20% of the remaining students enrolled in jazz dance + 25% of the remaining students enrolled in hip-hop dance = 20% of the remaining students enrolled in jazz dance + 25% of the remaining students enrolled in hip-hop dance', 

## 3.5 (Q8) Few-shot Program-of Thought Math Reasoning
> Program_of_thought

In [15]:
!pip install timeout-decorator Pebble
!wget -O python_executor.py https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py

Collecting timeout-decorator
  Downloading timeout-decorator-0.5.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting Pebble
  Downloading Pebble-5.0.6-py3-none-any.whl (30 kB)
Building wheels for collected packages: timeout-decorator
  Building wheel for timeout-decorator (setup.py) ... [?25l[?25hdone
  Created wheel for timeout-decorator: filename=timeout_decorator-0.5.0-py3-none-any.whl size=5006 sha256=7548de8e3a3364ef77e70cdc14c55472ed2a3e86963d804fed7e62b3aee009f5
  Stored in directory: /root/.cache/pip/wheels/68/2f/bc/76f1192d474666d41ae6f09813fccbd00fe3f07e8261c4cff5
Successfully built timeout-decorator
Installing collected packages: timeout-decorator, Pebble
Successfully installed Pebble-5.0.6 timeout-decorator-0.5.0
--2024-03-20 14:15:05--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A2/python_executor.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199

In [18]:
class GSM8KPoTEvaluator(Evaluator):
    def load_data(self, evalset_file="gsm8k") -> List[Dict[str, Any]]:
        dataset = load_dataset(evalset_file, 'main')
        test_split = dataset["test"]
        initial_100 = test_split[:100]
        self.test_dataset = initial_100
        return initial_100


    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        shared_prompts = ""
        for index in range(n_shot):
          prompt_string = '\n question:{0}\n # solution in Python:\n {1}'.format\
           (demos[index]["question"],demos[index]["pot_answer"])
          shared_prompts += prompt_string
          shared_prompts += '\n'
        # 应该还需要每个prompts加一下dataset的提问，需要先看dataset的格式，再把question加上
        data_prompt = []
        for index in range(len(dataset['question'])):
        # BUG! 注意dataset的格式！不是列表用len长度就不行！
          current_prompt=copy.deepcopy(shared_prompts)
          prompt_string=('\n question:{0}\n # solution in Python:\n {1}').format\
           (dataset["question"][index],"")
          current_prompt += prompt_string
          data_prompt.append(current_prompt)
        return data_prompt



    def postprocess_output(self, output: str, n_shot: int) -> str:

        #此函数只负责把生成token上限内的给拿出来，不负责提取答案
        # BUG 不论shot是几个，4、5、6个，只要使用postprocess，Janet整个答案都没有（question被删了？）

        slots = n_shot
        end_index = -1
        start_index = 0
        for i in range(slots+2):
          temp_answer = output[start_index:]
          index = temp_answer.find("question")# 就算没找到end-index和start_index也不会变！
          if index == -1:#找不到下一个
            end_index = len(output)
          else:
            end_index += index+1
            start_index += (index+1)
        return output[:end_index]



    def extract_code(self, initial_output):
        """
        从原始的prompts+生成的代码中->新生成的代码
        已经由postprocess处理过了，只需要找到最后一个def solution()，提取出来就好啦
        """
        start_index=initial_output.rfind("def solution()")
        generated_code = initial_output[start_index:]
        return generated_code


    def calculate_metrics(self, predictions):
      real_label = []
      for answer in self.test_dataset["answer"]:
        cur_index = answer.find("####")
        real_label.append(answer[cur_index+5:].strip())
      print("real label \n",real_label)
      print("predicted label \n",predictions)
      sum = 0
      for index in range(0,len(real_label)):
        if real_label[index]==predictions[index]:
          sum += 1
      print(sum/len(real_label))


    def evaluate(self, batch_size=1, save_dir="outputs", max_new_tokens=128, **kwargs):
        n_shot = kwargs["n_shot"]
        demos = kwargs["demos"]
        dataset = self.load_data()
        prompts = self.build_prompts(dataset, n_shot=n_shot, demos=demos)
        print(len(prompts))
        # BUG: need to predict all the examples in dataset, with batch size
        outputs = self.generate_completions(prompts, batch_size=batch_size, max_new_tokens=max_new_tokens, **kwargs)
        print(len(outputs))
        # kwargs解包

        predictions = []
        for i, (prompt, output) in enumerate(zip(prompts, outputs)):
            prediction = {
              "task_id": i,
              "prompt": prompt,
              # get a specific output of each data
              "generated_code": self.extract_code(self.postprocess_output(output, n_shot))
              # 在postprocess之后（只有prompts的生成）中提取询问生成的代码
            }
            predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        return predictions



In [19]:
from python_executor import PythonExecutor
executor = PythonExecutor(get_answer_expr='solution()')

gsm8k_pot_evaluator = GSM8KPoTEvaluator(llm, evalset_file="gsm8k")
generated_dict = gsm8k_pot_evaluator.evaluate(n_shot=8, demos=GSM_EXAMPLARS)

codes = [item["generated_code"] for item in generated_dict]

predictions = []
runtime_errors = []
for pred, err in executor.batch_apply(codes):
    predictions.append(str(pred))
    runtime_errors.append(str(err['exec_info']).strip())

result = gsm8k_pot_evaluator.calculate_metrics(predictions)
print(result)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


100


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

100
real label 
 ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104', '109', '80', '35', '70', '23', '9', '75', '2', '10', '18', '8', '200', '26', '48', '20', '104', '163', '800', '8', '30', '294', '5', '15', '40', '40', '14', '3', '83', '57', '187', '17', '1430', '25000', '1596', '300', '36', '48', '595', '36', '60', '7425', '60', '221', '255', '88', '60', '5', '100', '6', '70', '10', '17', '623', '600', '15', '44', '22', '9360', '8000', '24', '225', '28', '4', '36', '348', '40', '3', '12', '5', '58']
predicted label 
 ['', '2.5', '30000', '180', '', '', '26', '', 'None', 'None', '', '', 'None', '', '', 'None', 'None', '', '12', '', 'None', 'None', '', '', '24.375', 'None', 'None', 'None', 'None', 'None', '', '', '5.0', '50', '', '', 'None', '', '', 'None', 'None', '', 'None', '', '', '', '', '', '0.6666666666666666', 'None', '', '', 'None', '', '',

# Result

|        GSM8K            | (Hard Accuracy) |
|--------------------|-------|
| Direct Prompting   |    0.07   |
| Chain-of-Thought   |    0.11   |
| Program-of-Thought |    0.02   |

**Explanation**
1. **Chain-of-Thought**

    I find there isn't a general output pattern for me to extract the single value answer, either using regular expressions, adding more specific prompts, or just extracting from the initial output. 

2. **Program of thought**

    After scanning the .json output file, codeLlama might output the correct python code but only lack of "return" or just return "None". Given the code is generated by LLM, I couldn't cope with this circumstance but could only adjust the input prompts (Actually, it doen't work as well).

Moreover, although the (hard) accuracy is low, but when we considering the comparation between outputs model generated and the golen label, such as (15-9=6)(model) vs. 6(golden), we will find the potential correctness of codeLlama lurking in the outputs.

Perhaps we need to take soft accuracy metrics into account to measure the model quality as the method mentioned in lecture.