# 1. Retrieval Augmented In-context Learning

- Please choose "Runtime Type" = GPU in Colab for running this notebook (Runtime > Change Runtime Type > T4 GPU).
- You are free to choose to use Google Colab or Kaggle to complete this notebook.

## 1.1 Contextual Embedding

In [2]:
# Step 0. Prepare the environment
!pip install InstructorEmbedding sentence-transformers datasets scikit-learn



In [3]:
!mkdir -p data/classification
!wget -O data/classification/train.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/train.txt
!wget -O data/classification/dev.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/dev.txt
!wget -O data/classification/test-blind.txt https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/test-blind.txt

--2024-04-09 16:20:40--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 738844 (722K) [text/plain]
Saving to: ‘data/classification/train.txt’


2024-04-09 16:20:40 (16.5 MB/s) - ‘data/classification/train.txt’ saved [738844/738844]

--2024-04-09 16:20:40--  https://raw.githubusercontent.com/ranpox/comp3361-spring2024/main/assignments/A1/data/classification/dev.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94400 (92

In [4]:
# Step 1. Declare the model & Example usage
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("hkunlp/instructor-base")
embeddings = model.encode(
    [
        "Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity",
        "Comparison of Atmospheric Neutrino Flux Calculations at Low Energies",
        "Fermion Bags in the Massive Gross-Neveu Model",
        "QCD corrections to Associated t-tbar-H production at the Tevatron",
    ],
    prompt="Represent the Medicine sentence for clustering: ",
    show_progress_bar=True,
)

print(embeddings.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(4, 768)


In [5]:
# Step 2. Load data & Extract embeddings
# TODO: encode the training and dev setences
# It may takes about two hours for CPU, or you can switch to GPU runtime for faster inference.
def load_data(file):
    r"""Load your custom data for training or evaluation.

    Args:
        file (str): the path of your data

    Returns:
        tuple: a tuple containing two elements:
           - embedding: The embedding of your data, should in shape (N, 768)
          - label: List of labels, should in shape (N,)
    """
    with open(file) as f:
        data = [line.strip().split('\t') for line in f]
    sentences = [line[1] for line in data]
    X = model.encode(sentences, prompt="Represent the sentence for sentiment analysis: ", show_progress_bar=True)
    y = [int(line[0]) for line in data]
    return X, y

X_train, y_train = load_data('data/classification/train.txt') # training data
X_val, y_val = load_data('data/classification/dev.txt') # dev data

print(f"X_train shape: {X_train.shape}, y_train shape: {len(y_train)}")
print(f"X_val shape: {X_val.shape}, y_val shape: {len(y_val)}")

Batches:   0%|          | 0/217 [00:00<?, ?it/s]

Batches:   0%|          | 0/28 [00:00<?, ?it/s]

X_train shape: (6920, 768), y_train shape: 6920
X_val shape: (872, 768), y_val shape: 872


**TODO:**

 Train a logistic regression model

In [6]:
# Step 3: Train a logistic regression model
from sklearn.linear_model import LogisticRegression as LR
from sklearn.linear_model import LogisticRegressionCV as LRCV
from sklearn.metrics import classification_report, confusion_matrix


#emb_lrcvl1 = LRCV(penalty='l1',solver='saga', max_iter=2000)
emb_lrcvl1 = LR(penalty='l1', solver='saga', max_iter=2000)
emb_lrcvl1 = emb_lrcvl1.fit(X_train, y_train)


In [7]:
# Step 4: Evaluate the model
y_val_pred = emb_lrcvl1.predict(X_val)
print(classification_report(y_val, y_val_pred))

              precision    recall  f1-score   support

           0       0.90      0.89      0.89       428
           1       0.89      0.91      0.90       444

    accuracy                           0.90       872
   macro avg       0.90      0.90      0.90       872
weighted avg       0.90      0.90      0.90       872



**Comparasion:**


| Feature | precision | recall | F1-score |
| ----------- | --------- | ------ | -------- |
| GloVe  (load:'glove-wiki-gigaword-200')     |    0.798    |     0.798     |    0.798    |
|hkunlp/instructor-base| 0.90 | 0.90 | 0.90|

## 1.2 Retrieve Relevant Examples

In [8]:
import re
import json
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

def load_train_data():
    """Loads the GSM8k train dataset.

    Returns:
        A list of dictionaries containing questions, cot answers, and short answers.
    """
    ds = load_dataset("gsm8k", "main", split="train")
    examples = [{"question": example["question"], "answer": example["answer"]} for example in ds]
    for example in examples:
        example["short_answer"] = re.sub(r"(\d),(\d)", r"\1\2", example["answer"].split("####")[1].strip())
        example["cot_answer"] = re.sub(r"\<\<.*?\>\>", "", example["answer"].split("####")[0].strip()) \
            + " So the answer is " + example["short_answer"] + "."
    return examples

def load_test_data():
    """Loads the first 30 examples of the GSM8k test dataset.

    Returns:
        A list of dictionaries containing questions and answers.
    """
    ds = load_dataset("gsm8k", "main", split="test")
    examples = [{"question": example["question"], "answer": example["answer"].split("####")[1].strip()} for example in ds]
    for example in examples:
        example["answer"] = re.sub(r"(\d),(\d)", r"\1\2", example["answer"])
    return examples[:30]

In [9]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def new_load(initial_data):
    data = [item["question"] for item in initial_data]
    embedded_only = model.encode(data, prompt="Represent the sentence for sentiment analysis: ", show_progress_bar=True)
    return embedded_only

def retrieve_examples():
    """Retrieve top-20 in-context examples from GSM8K training set for each testing examples.

    Returns:
        A dictionary mapping testing questions to a list of top-20 training examples.
    """
    train = load_train_data() # 全部gsm8k_train中的数据，每个pairs里有: question\answer\short_answer\cot_answer
    encode_train_only = new_load(train)
    several_test = load_test_data() # 取gsm8k_test前30个data，每个pairs里有: question\answer
    encode_test_only = new_load(several_test)
    cosine_simi = cosine_similarity(encode_test_only, encode_train_only)
    cos_with_index = [list(enumerate(item)) for item in cosine_simi]
    for index in range(0,len(cos_with_index)):
      cos_with_index[index].sort(key=lambda x: x[1], reverse=True)
    truncated_cos_index = [item[:20] for item in cos_with_index]
    needed_prompt = []
    for index in range(0,len(truncated_cos_index)):
      valid_index_list = [item[0] for item in truncated_cos_index[index]]
      current_prompts = [train[index] for index in valid_index_list]
      needed_prompt.append(current_prompts)
      # 放出来的是原始的question & answer & short_answer & cot_answer
    return needed_prompt




In [10]:
RETRIVED_EXAMPLES = retrieve_examples()

Batches:   0%|          | 0/234 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Note: The following retrieval augmented generation does not require a GPU. Please consider saving and downloading the examples you retrieve from the left file tab so that you will not be hindered by Colab GPU restrictions.

In [11]:
with open("retrieved_examples.json", "w") as fout:
    json.dump(RETRIVED_EXAMPLES, fout)

with open("retrieved_examples.json", "r") as fin:
    RETRIVED_EXAMPLES = json.load(fin)

## 1.3 Generation with Huggingface Inference API

We will use LLM by querying huggingface inference api so we do not need GPU for the following code. Please generate HF_TOKEN at [hf.co/settings/tokens](hf.co/settings/tokens) and set as environment varible.

In [12]:
!pip install backoff
!pip install evaluate



In [13]:
from abc import ABC, abstractmethod
from typing import List, Dict, Any
import os
import json
import backoff
import evaluate
import re
import time
import requests

# access_token = "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."
# model = AutoModel.from_pretrained("private/model", token=access_token)


os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Account 1: .gmail
#os.environ["HF_TOKEN"] = "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"
# Account 2: .nju
# os.environ["HF_TOKEN"] = "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"
# Account 3: .163
os.environ["HF_TOKEN"] = "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"
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

class LLM(object):
    def __init__(self, model_name="codellama/CodeLlama-7b-hf"):
        self.model_name = model_name
        self.api_url = f"https://api-inference.huggingface.co/models/{model_name}"
        self.headers = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}"}

    @backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_time=60)
    def generate(self, prompts: List[str], **kwargs) -> List[str]:
        outputs = []
        def query(payload):
            response = requests.post(self.api_url, headers=self.headers, json=payload)
            if response.status_code != 200:
                raise ValueError(f"Request failed with code {response.status_code}, {response.text}")
            return response.json()
        for prompt in prompts:
            data = query(
                {
                    "inputs": prompt,
                    "parameters": { "max_new_tokens": 256, "stop": ["Question:"]},
                }
            )
            outputs.append(data[0]['generated_text'])
        return outputs

In [14]:
llm = LLM("codellama/CodeLlama-7b-hf")
print(llm.generate(["Explain the importance of low latency LLMs.","What is the capital of France?"]))

['Explain the importance of low latency LLMs.\n\n### 1.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2', 'What is the capital of France?\nA: Paris\nB: Lyon\nC: Marseille\nD: Bordeaux\nE: Toulouse\nF: Strasbourg\nG: Nice\nH: Lille\nI: Rouen\nJ: Nantes\nK: Brest\nL: Metz\nM: Dijon\nN: Reims\nO: Lyon\nP: Bordeaux\nQ: Toulouse\nR: Nantes\nS: Strasbourg\nT: Lille\nU: Rouen\nV: Metz\nW: Dijon\nX: Reims\nY: Nantes\nZ: Brest\n\nWhat is the capital of France?\nA: Paris\nB: Lyon\nC: Marseille\nD: Bordeaux\nE: Toulouse\nF: Strasbourg\nG: Nice\nH: Lille\nI: Rouen\nJ: Nantes\nK: Brest\nL: Metz\nM: Dijon\nN: Reims\nO: Lyon\nP: Bordeaux\nQ: Toulouse\nR: Nantes\nS: Strasbourg\nT: Lille\nU: Rouen\nV: Metz\nW: D']


**TODO:**

Please adapt your GSM8KCoTEvaluator for this API-based LLM. And report the performance of 8-shot chain-of-thought prompting on first 30 examples of GSM8K.

In [15]:
!pip install nums_from_string



In [16]:
class Evaluator(ABC):
    def __init__(self, llm):
        self.llm = llm

    @abstractmethod
    def load_data(self):
        pass

    @abstractmethod
    def build_prompts(self):
        pass

    @abstractmethod
    def postprocess_output(self, output: str) -> str:
        pass

    @abstractmethod
    def calculate_metrics(self):
        pass

    def generate_completions(self, prompts: List[str], **kwargs) -> List[str]:
        outputs = llm.generate(prompts, **kwargs)
        return outputs

    def evaluate(self, batch_size=4, n_shot=8, save_dir="outputs", pro_num=8):
        dataset = self.load_data()
        prompts = self.build_prompts(dataset, n_shot)
        outputs = self.generate_completions(prompts, batch_size=batch_size)

        predictions = []
        for i, (example, prompt, output) in enumerate(zip(dataset, prompts, outputs)):
            prediction = {
                "task_id": example.get("task_id", f"task_{i}"),
                "prompt": prompt,
                "completion": output,
                "prediction": self.postprocess_output(output, n_shot),
            }
            predictions.append(prediction)

        # Save predictions to file
        os.makedirs(save_dir, exist_ok=True)
        prediction_save_path = os.path.join(save_dir, f"{type(self).__name__}_{pro_num}_predictions.jsonl")
        with open(prediction_save_path, "w") as fout:
            for pred in predictions:
                fout.write(json.dumps(pred) + "\n")

        # Calculate metrics and print results
        print("Few-shot examples: ",pro_num)
        self.calculate_metrics(predictions, dataset)
        #print(f"Results for {type(self).__name__}: {results}")



GSM_EXAMPLARS = [
    {
        "question": "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
        "cot_answer": "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. So the answer is 6.",
        "pot_answer": "def solution():\n    \"\"\"There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\"\"\"\n    trees_initial = 15\n    trees_after = 21\n    trees_added = trees_after - trees_initial\n    result = trees_added\n    return result",
        "short_answer": "6"
    },
    {
        "question": "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
        "cot_answer": "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. So the answer is 5.",
        "pot_answer": "def solution():\n    \"\"\"If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\"\"\"\n    cars_initial = 3\n    cars_arrived = 2\n    total_cars = cars_initial + cars_arrived\n    result = total_cars\n    return result",
        "short_answer": "5"
    },
    {
        "question": "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
        "cot_answer": "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. So the answer is 39.",
        "pot_answer": "def solution():\n    \"\"\"Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\"\"\"\n    leah_chocolates = 32\n    sister_chocolates = 42\n    total_chocolates = leah_chocolates + sister_chocolates\n    chocolates_eaten = 35\n    chocolates_left = total_chocolates - chocolates_eaten\n    result = chocolates_left\n    return result",
        "short_answer": "39"
    },
    {
        "question": "Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?",
        "cot_answer": "Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\"\"\"\n    jason_lollipops_initial = 20\n    jason_lollipops_after = 12\n    denny_lollipops = jason_lollipops_initial - jason_lollipops_after\n    result = denny_lollipops\n    return result",
        "short_answer": "8"
    },
    {
        "question": "Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?",
        "cot_answer": "Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. So the answer is 9.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "9"
    },
    {
        "question": "There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?",
        "cot_answer": "There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. So the answer is 29.",
        "pot_answer": "def solution():\n    \"\"\"Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\"\"\"\n    toys_initial = 5\n    mom_toys = 2\n    dad_toys = 2\n    total_received = mom_toys + dad_toys\n    total_toys = toys_initial + total_received\n    result = total_toys\n    return result",
        "short_answer": "29"
    },
    {
        "question": "Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?",
        "cot_answer": "Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. So the answer is 33.",
        "pot_answer": "def solution():\n    \"\"\"Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\"\"\"\n    golf_balls_initial = 58\n    golf_balls_lost_tuesday = 23\n    golf_balls_lost_wednesday = 2\n    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday\n    result = golf_balls_left\n    return result",
        "short_answer": "33"
    },
    {
        "question": "Olivia has $23. She bought five bagels for $3 each. How much money does she have left?",
        "cot_answer": "Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. So the answer is 8.",
        "pot_answer": "def solution():\n    \"\"\"Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\"\"\"\n    money_initial = 23\n    bagels = 5\n    bagel_cost = 3\n    money_spent = bagels * bagel_cost\n    money_left = money_initial - money_spent\n    result = money_left\n    return result",
        "short_answer": "8"
    }
]

In [17]:
from sklearn.metrics import classification_report

class GSM8KEvaluator(Evaluator):
    def load_data(self):
        ds = load_dataset("gsm8k", "main", split="test")
        examples = [{"question": example["question"], "answer": example["answer"].split("####")[1].strip()} for example in ds]
        for example in examples:
            example["answer"] = re.sub(r"(\d),(\d)", r"\1\2", example["answer"])
        return examples[:30]

    def postprocess_output(self, output: str, n_shot: int) -> str:
        """
        因为有生成token的上限，所以要把prompt+真正生成的answer给拿出来
        """
        slots = n_shot
        end_index = -1
        start_index = 0
        for i in range(slots+3):
          temp_answer = output[start_index:]
          index = temp_answer.find("question")# 就算没找到end-index和start_index也不会变！
          end_index += index+1
          start_index += (index+1)
        return output[:end_index]

    def calculate_metrics(self, predictions, dataset):
        real_label = [item["answer"] for item in dataset]
        pre_label = []
        for answer in predictions:
          answer = str(answer)
          cur_index = answer.rfind("chain_of_thought_answer")
          cot_answer = answer[cur_index+24:].strip()[:-5]
          index = cot_answer.find("So the answer is ")
          current_answer = cot_answer[index+17:-1]
          if len(nums_from_string.get_nums(current_answer)) == 0:
            pre_label.append("-1")
          else:
            pre_label.append(nums_from_string.get_nums(current_answer)[-1])
        print(f"Results for {type(self).__name__}")
        sum=0
        for index in range(0,len(predictions)):
          if int(real_label[index])==int(pre_label[index]):
            sum += 1
        print("real answers are:",real_label)
        print("predicted answers are:",pre_label)
        print("Acurracy of COT: {0}".format(sum/len(predictions)))
        print("\n")

In [18]:
import copy
import nums_from_string

class GSM8KCoTEvaluator(GSM8KEvaluator):
    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        shared_prompts = ""
        string = "Answering the following questions"
        shared_prompts += string
        for index in range(n_shot):
          prompt_string = '\n question:{0}\n chain_of_thought_answer:{1}'.format\
           (demos[index]["question"],demos[index]["cot_answer"])
          shared_prompts += prompt_string
        # 应该还需要每个prompts加一下dataset的提问，需要先看dataset的格式，再把question加上
        data_prompt = []
        for index in range(len(dataset)):
        # BUG! 注意dataset的格式！不是列表用len长度就不行！
          current_prompt=copy.deepcopy(shared_prompts)
          prompt_string=('\n question:{0}\n chain_of_thought_answer:{1}').format\
           (dataset[index]["question"],"")
          current_prompt += prompt_string
          data_prompt.append(current_prompt)
        return data_prompt

In [38]:
cot_evaluator = GSM8KCoTEvaluator(llm)
cot_evaluator.evaluate(n_shot=8, pro_num=8)

Few-shot examples:  8
Results for GSM8KCoTEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [80, 8, 30000, 540, 63, 8, 8, 192, 8, 520, 274, 642, 7, 1, 60, 2790, 230, 42500, 72, 6, 8, 25, 8, 8, 78, 4, 224.5, 1800.0, 25, 30]
Acurracy of COT: 0.2




## 1.4 Impact of Quantity on Few-shot Prompting

In [20]:
cot_evaluator = GSM8KCoTEvaluator(llm)
cot_evaluator.evaluate(n_shot=1, pro_num=1)
cot_evaluator.evaluate(n_shot=2, pro_num=2)
cot_evaluator.evaluate(n_shot=4, pro_num=4)

Few-shot examples:  1
Results for GSM8KCoTEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [56, 6, 32500, 180, 60, 6, 8, 40, 6, 160, 30, 590, 80, 1, 6, 2987, 230, 3300, 84, 6, 24, 25, 2, 10, 24.5, '-1', 6, 240.0, 15, 33]
Acurracy of COT: 0.06666666666666667


Few-shot examples:  2
Results for GSM8KCoTEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [2, 3, 32500, 180, 13, 48, 4, 40, 355, 45, 120, 5, 3, 1, 8, 2962, 230, 5750, 12, 4, 5, 25, 1, 6, 19.5, 5.5, 5, 240.0, 35, 66]
Acurracy of COT: 0.06666666666666667


Few-shot examples:  4
Results for GSM8KCoTEvaluator
real answers are: ['18', '3', '70000', '540', 

## 1.5 Retrieval Augmented Few-shot Prompting

In [21]:
import copy

class GSM8KRetrievalICLEvaluator(GSM8KEvaluator):
    def __init__(self, llm, reverse=False):
        self.llm = llm
        self.reverse = reverse
    def build_prompts(self, dataset, n_shot=8, demos=GSM_EXAMPLARS):
        """Build prompts with RETRIVED_EXAMPLES we generated in 1.2.
        """
        with open("retrieved_examples.json", "r") as fin:
          RETRIVED_EXAMPLES = json.load(fin)
          demos = RETRIVED_EXAMPLES
          shared_prompts = "Answering the following questions"
          data_prompt = []
          for index_item in range(0,len(dataset)):
            current_prompt = copy.deepcopy(shared_prompts)
            # prompts
            if self.reverse == False: # descending order among top-n_shot setting of relevance
              for prompt_index in range(0,n_shot):
                prompt_string = '\n question:{0}\n chain_of_thought_answer:{1}'.format\
                  (demos[index_item][prompt_index]["question"],demos[index_item][prompt_index]["cot_answer"])
                current_prompt += prompt_string
            elif self.reverse == True: # ascending order among top-n_shot setting of relevance
              for prompt_index in range(n_shot-1,-1,-1):
                prompt_string = '\n question:{0}\n chain_of_thought_answer:{1}'.format\
                  (demos[index_item][prompt_index]["question"],demos[index_item][prompt_index]["cot_answer"])
                current_prompt += prompt_string
            # real_question
            qu_string=('\n question:{0}\n chain_of_thought_answer:{1}').format\
                (dataset[index_item]["question"],"")
            current_prompt += qu_string
            data_prompt.append(current_prompt)
          return data_prompt

In [None]:
retrieval_icl_evaluator = GSM8KRetrievalICLEvaluator(llm,reverse=False)
retrieval_icl_evaluator.evaluate(n_shot=1, pro_num=1)
retrieval_icl_evaluator.evaluate(n_shot=2, pro_num=2)
retrieval_icl_evaluator.evaluate(n_shot=4, pro_num=4)
retrieval_icl_evaluator.evaluate(n_shot=8, pro_num=8)
retrieval_icl_evaluator.evaluate(n_shot=16, pro_num=16)

Few-shot examples:  1
Results for GSM8KRetrievalICLEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [2304, 3, 50000, '-1', 60, 480, 140, 100, 8, 520, 93, 589, 675000, 75, 100, 29, 230, 1150, 84, 3, 15, 28, 7, 8, 78, 17, 239.5, 180.0, 35, 172]
Acurracy of COT: 0.16666666666666666


Few-shot examples:  2
Results for GSM8KRetrievalICLEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [20, 3, 55000, 180, 60, 40, 100, 13, 420, 412, 366, 666, 840, 1, 168, 3, 230, 3000, 84, 6, 9, 9, 21, 8, 75, 0.37, 238.5, 240.0, 35, 66]
Acurracy of COT: 0.16666666666666666


Few-shot examples:  4
Results for GSM8KRetrievalICLEvalua

In [22]:
# Rearrange the examples in ascending order of relevance for the top 8 settings.
retrieval_icl_evaluator = GSM8KRetrievalICLEvaluator(llm,reverse=True)
retrieval_icl_evaluator.evaluate(n_shot=1, pro_num=1)
retrieval_icl_evaluator.evaluate(n_shot=2, pro_num=2)
retrieval_icl_evaluator.evaluate(n_shot=4, pro_num=4)
retrieval_icl_evaluator.evaluate(n_shot=8, pro_num=8)
retrieval_icl_evaluator.evaluate(n_shot=16, pro_num=16)



Few-shot examples:  1
Results for GSM8KRetrievalICLEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [2304, 3, 50000, '-1', 60, 480, 140, 100, 8, 520, 93, 589, 675000, 75, 100, 29, 230, 1150, 84, 3, 15, 28, 7, 8, 78, 17, 239.5, 180.0, 35, 172]
Acurracy of COT: 0.16666666666666666


Few-shot examples:  2
Results for GSM8KRetrievalICLEvaluator
real answers are: ['18', '3', '70000', '540', '20', '64', '260', '160', '45', '460', '366', '694', '13', '18', '60', '125', '230', '57500', '7', '6', '15', '14', '7', '8', '26', '2', '243', '16', '25', '104']
predicted answers are: [280, 3, 95000, 180, 40, 50, 520, 15, 6.25, 520, 452, 619, 1.17, 4.25, 20, 2308, 230, 3300, 84, 1, 24, 31, 5, 32, 78, 1, 240.5, 240.0, 40, 66]
Acurracy of COT: 0.06666666666666667


Few-shot examples:  4
Results for GSM8KRetrievalIC

## Discussion

- **Q1.1:** Regarding the contextual embedding in Section 1.1, how does sentiment classification performance compare to your word2vec results in A1? Discuss potential reasons.


- **A1.1**:
  
  There are three word2vec methods in A1: unigram, bigram, Glove, with respectively classification accuracy 0.69, 0.71, 0.78 in my implementation.
  In Assignment 3, after being aided with more powerful pre-trained model, instructor-base, the accuracy of the same task is 0.90.
  
  There are several potential reasons to explain the comparasion.
  1. Both three previous methods are based on "Shallow Window-Based Methods", which means they just takes words' concurrence into account. Thus, the performance is very likely determined by the discrepency between training data and testing data, and the model doesn't learn the "real" semantic outlying the sentence, only grasping the statistics patterns.
  2. Pretrained model, instructor-base, modeled more relatively accurate senmatic of each word by analysizing the contextual relationship in pretrained corpus. Hence, the model fits a more genral scenrio, ending up to a better performance.

------
------
------








- **Q1.2**: In Section 1.4, which discusses the impact of quantity, what trends do you notice when adding contextual examples? Do you think this trend will continue?

- **A1.2:**
  1. Trend:
  
    When we add more in-context examples in prompt, the final result is improving dramatically. For example, few-shot setting with 1, 2, 4, 8 examples, getting accuracy value 0.067, 0.067, 0.133, 0.200 respectvely.
  
  2. Yes, I think the trend will continue in some extent. Adding in-context examples can be regarded as a compact process with fine-tuning. If the result doesn't reach the upper capacity of model, armed prompts with more examples in an appropriate order could improve the prediction.

------
-----
------


- **Q1.3:** In Section 1.5, which covers retrieval-augmented in-context learning, how does this differ from Section 1.4? Analyze the reasons.

- **A1.3:**

1. Derivation of prompts differs:

  In Section 1.4, we use pre-defined. In Section 1.5, the prompts we use is obtained from the descending query similarity list.

2. Few-shot prompt differs for each query:

  In Section 1.4, All queries share a same few-shot examples. But in Section 1.5, we grasped distinctive examples based on descending similarity order for each query, which means the prompt is more pinpointed and fits each query.

  




------
-----
-----

- **Q1.4:** In Section 1.5, which arrangement yields better performance: in-context examples organized in descending or ascending order of relevance? Discuss the scenario.

- **A1.4:**

  Comparing {1,2,4,8} in-context examples, descending order of relevance could yield a better performance in general. There are several resons to explain this result.
  
  1. Pretrained model use Transformer architecture, which means input order plays a significant role in self-attention and positional embedding. Given to the unknown positional embedding method using in this model, regular positional embedding tends to focus on the first few information lying in the input sentence, such as static funtction positional embedding.   

  2. There isn't long distance dependence problem in Transformer, thanks to the parallelable machanism self-attention. Thus, positional embedding could be the only way to grape position information.

  3. Similarity descending sorting ensures that the model starts learning from the most relevant retrieval results, thus capturing key information more quickly. In addition, this sorting method helps reduce noise interference and improve the model's robustness, as the model can more easily ignore irrelevant or weaker retrieval results.


# 2. Basic Decoding Algorithms

In [23]:
!pip install transformers
!pip install datasets == 2.17.1
# dataset version is 2.17,1
!pip install evaluate

[31mERROR: Invalid requirement: '=='[0m[31m


In [24]:
"""set device and random seeds"""

######################################################
#  The following helper functions are given to you.
######################################################

from tqdm.notebook import tqdm
import torch
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'device: {device}')

# 设置每次运行的随机数都是一样的，初始化参数相同，唯一影响的只有代码了
def set_seed(seed=19260817):
    torch.manual_seed(seed) # cpu seed
    torch.cuda.manual_seed_all(seed) # gpu seed
    torch.backends.cudnn.deterministic = True # deterministic algorithm -- same result when we run in serveral times
    torch.backends.cudnn.benchmark = False # same convolutional algorithm -- forbid to search for the optimum cnn algorithm
    # [gpt2 use conv layer]

set_seed()

device: cuda


In [25]:
"""load datasets"""

######################################################
#  The following helper code is given to you.
######################################################

"""
Dataset： https://huggingface.co/datasets/Ximing/ROCStories
Fluency test classification: https://huggingface.co/cointegrated/roberta-large-cola-krishna2020
Naturalness(Perplexity): https://huggingface.co/spaces/evaluate-metric/perplexity  [hugging face function to compute perplexity]
"""

from datasets import load_dataset

dataset = load_dataset('Ximing/ROCStories')
train_data, dev_data, test_data = dataset['train'], dataset['validation'], dataset['test']

print(train_data[0])

{'story_id': '080198fc-d0e7-42b3-8e63-b2144e59d816', 'prompt': 'On my way to work I stopped to get some coffee.', 'continuation': 'I went through the drive through and placed my order. I paid the cashier and patiently waited for my drink. When she handed me the drink, the lid came off and spilled on me. The coffee hurt and I had to go home and change clothes.', 'constraint_words': ['drive', 'order', 'drink', 'lid', 'coffee', 'hurt', 'home', 'change', 'clothes']}


In [26]:
"""prepare evaluation"""

######################################################
#  The following helper code is given to you.
######################################################

from evaluate import load
from transformers import RobertaForSequenceClassification, RobertaTokenizer

perplexity_scorer = load("perplexity", module_type="metric")
cola_model_name = "textattack/roberta-base-CoLA"
cola_tokenizer = RobertaTokenizer.from_pretrained(cola_model_name)
cola_model = RobertaForSequenceClassification.from_pretrained(cola_model_name).to(device)

#【好】生成batch的方法！
def batchify(data, batch_size):
  # 生成器函数,本质是调用了next函数，存有一个指针
  # 如果batch满了，就返回。next指针指向下一个，继续生成batch，直到满了或者结束为止！
    assert batch_size > 0

    batch = []
    for item in data:
        # Yield next batch
        if len(batch) == batch_size:
            yield batch
            batch = []
        batch.append(item)

    # Yield last un-filled batch
    if len(batch) != 0:
        yield batch

Some weights of the model checkpoint at textattack/roberta-base-CoLA were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [27]:
"""set up evaluation metric"""

######################################################
#  The following helper code is given to you.
######################################################

# hugging face metrics -- computing perplexity
def compute_perplexity(texts, model='gpt2', batch_size=8):
    score = perplexity_scorer.compute(predictions=texts, add_start_token=True, batch_size=batch_size, model_id=model)
    return score['mean_perplexity']

# pre-trained model CoLA to evaluate fluency
def compute_fluency(texts, batch_size=8):
  scores = []
  for b_texts in batchify(texts, batch_size): # 此处只是为了计算batch个数遍
    inputs = cola_tokenizer(texts, padding=True, truncation=True, return_tensors="pt").to(device)
    with torch.no_grad():
      logits = cola_model(**inputs).logits
      probs = logits.softmax(dim=-1)
      scores.extend(probs[:, 1].tolist())
  return sum(scores) / len(scores)

# N-gram -- based on syntax
def compute_diversity(texts):
    unigrams, bigrams, trigrams = [], [], []
    total_words = 0
    for gen in texts:
        o = gen.split(' ')
        total_words += len(o)
        for i in range(len(o)):
            unigrams.append(o[i])
        for i in range(len(o) - 1):
            bigrams.append(o[i] + '_' + o[i + 1])
        for i in range(len(o) - 2):
            trigrams.append(o[i] + '_' + o[i + 1] + '_' + o[i + 2])
    return len(set(unigrams)) / len(unigrams), len(set(bigrams)) / len(bigrams), len(set(trigrams)) / len(trigrams)


def evaluate(generations, experiment):
  generations = [_ for _ in generations if _ != '']
  perplexity = compute_perplexity(generations)
  fluency = compute_fluency(generations)
  diversity = compute_diversity(generations)
  print(experiment)
  print(f'perplexity = {perplexity:.2f}')
  print(f'fluency = {fluency:.2f}')
  print(f'diversity = {diversity[0]:.2f}, {diversity[1]:.2f}, {diversity[2]:.2f}')
  print()

debug_sents = ["This restaurant is awesome", "My dog is cute and I love it.", "Today is sunny."]
evaluate(debug_sents, 'debugging run')

  0%|          | 0/1 [00:00<?, ?it/s]

debugging run
perplexity = 178.64
fluency = 0.98
diversity = 0.87, 1.00, 1.00



In [28]:
"""load model and tokenizer"""

######################################################
#  The following helper code is given to you.
######################################################

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name, pad_token="<|endoftext|>") # masking: pad-masking
model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
model.eval() # print its parameters

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

TODO:

In this section, you will implement a few basic decoding algorithms:
1. Greedy decoding
2. Vanilla sampling
3. Temperature sampling
5. Top-p sampling

We have provided a wrapper function `decode()` that takes care of batching, controlling max length, and handling the EOS token.
You will be asked to implement the core function of each method: *given the pre-softmax logits of the next token, decide what the next token is.*

**The wrapper calls the core function of each decoding algorithm, which you will implement in the subsections below.**

In [29]:
"""decode main wrapper function"""

######################################################
#  The following helper code is given to you.
######################################################

def decode(prompts, max_len, method, **kwargs):
  encodings_dict = tokenizer(prompts, return_tensors="pt", padding=True)
  input_ids = encodings_dict['input_ids'].to(device)
  attention_mask = encodings_dict['attention_mask'].to(device)
  # 获取当下模型的batch大小 & attention_mask的矩阵(0、1矩阵) --- 信息获取的好方法！
  model_kwargs = {'attention_mask': attention_mask}
  batch_size, input_seq_len = input_ids.shape

  unfinished_sequences = torch.ones(batch_size, dtype=torch.long, device=device)

  for step in range(max_len): # 最多生成max_len个概率分布
    model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)
    with torch.no_grad():# 在预测或生成答案时，可以把no_grad加上，不让这个步骤加到训练过程中去
      outputs = model(**model_inputs, return_dict=True, output_attentions=False, output_hidden_states=False)
      # print(outputs.logits.shape) # 10*12*50257
      # 10: batch_size, 12: input sequence length, 50257: vocabulary size
      # shape of outputs: (B, input_sequence_length, V)
    if step == 0:
      last_non_masked_idx = torch.sum(attention_mask, dim=1) - 1
      # input sequence length = 12, attention_mask = E, last_non_masked_idx = 11
      # 不同模型attention_mask不一样，按照上三角mask的形式来看。
      # 理应获得的分布是下三角为input，生成的一个distribution
      # 对应的下标就是上面所计算的！
      next_token_logits = outputs.logits[range(batch_size), last_non_masked_idx, :]
    else:
      next_token_logits = outputs.logits[:, -1, :]
      """
      第二维的每个元素表示输入数据中每个样本的 token 数量。
      例如，如果一个样本有12个 token，那么第二维的对应元素的值就是12。
      这个值没有特定的物理意义，只是用来记录输入数据中每个样本的长度。
      在模型的输出中，第二维的长度与输入数据中每个样本的长度相匹配，这样可以确保模型的输出与每个输入的 token 对应。
      例如，对于长度为12的样本，模型的输出中的第二维度也是12，这样可以将模型的输出与输入的 token 对应起来。
      【只是为了保持输入和输出长度一样，有一一对应关系】
      """

    log_prob = F.log_softmax(next_token_logits, dim=-1)

    if method == 'greedy':
      next_tokens = greedy(next_token_logits)
    elif method == 'sample':
      next_tokens = sample(next_token_logits)
    elif method == 'temperature':
      next_tokens = temperature(next_token_logits, t=kwargs.get('t', 0.8))
      # 取关键字为t的值，如果不存在就让他等于0.8
    # elif method == 'topk':
      # next_tokens = topk(next_token_logits, k=kwargs.get('k', 20))
    elif method == 'topp':
      next_tokens = topp(next_token_logits, p=kwargs.get('p', 0.7))

    # finished sentences should have their next token be a padding token
    next_tokens = next_tokens * unfinished_sequences + tokenizer.pad_token_id * (1 - unfinished_sequences)

    input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
    model_kwargs = model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=model.config.is_encoder_decoder)

    # if eos_token was found in one sentence, set sentence to finished
    unfinished_sequences = unfinished_sequences.mul((next_tokens != tokenizer.eos_token_id).long())

    if unfinished_sequences.max() == 0:
      break

  # decode: ids -> real token
  response_ids = input_ids[:, input_seq_len:]
  response_text = [tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True) for output in response_ids]

  return response_text

In [30]:
"""debug helper code"""

######################################################
#  The following helper code is given to you.
######################################################

# For debugging, we duplicate a single prompt 10 times so that we obtain 10 generations for the same prompt
dev_prompts = [dev_data[0]['prompt']] * 10

def print_generations(prompts, generations):
  for prompt, generation in zip(prompts, generations):
    print(f'{[prompt]} ==> {[generation]}')

## 2.1 Greedy Decoding

In [9]:
def greedy(next_token_logits):
  '''
  inputs:
  - next_token_logits: Tensor(size = (B, V), dtype = float)
  outputs:
  - next_tokens: Tensor(size = (B), dtype = long)
  '''
  # TODO: compute `next_tokens` from `next_token_logits`.
  # Hint: use torch.argmax()
  next_tokens = torch.argmax(next_token_logits, dim=1)
  return next_tokens

generations = decode(dev_prompts, max_len=20, method='greedy')
print_generations(dev_prompts, generations)

['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m going to go to work tomorrow,\'" he said.']
['Ryan was called by his friend to skip work one day.'] ==> ['\n\n"I was like, \'I\'m goin

## 2.2 Vanilla Sampling and Temperature Sampling

In [33]:
import numpy as np

def sample(next_token_logits):
  '''
  inputs:
  - next_token_logits: Tensor(size = (B, V), dtype = float)
  outputs:
  - next_tokens: Tensor(size = (B), dtype = long)
  '''

  # TODO: compute the probabilities `probs` from the logits.
  # Hint: `probs` should have size (B, V)
  probs = F.softmax(torch.exp(next_token_logits),dim=1)

  # TODO: compute `next_tokens` from `probs`.
  # Hint: use torch.multinomial()
  next_tokens = torch.multinomial(probs,1,replacement=False, generator=None, out=None).squeeze()

  return next_tokens


set_seed()
generations = decode(dev_prompts, max_len=20, method='sample')
print_generations(dev_prompts, generations)

['Ryan was called by his friend to skip work one day.'] ==> ['/?Anth listened hurtRN Woman Buk Stephanie★426 clutching complexity kickedauldron portal parted Graphic converteresque Peg']
['Ryan was called by his friend to skip work one day.'] ==> [' ignoredlude worthlessldavassiumPremium798 vaccination NSA dating dryingwolf RageWithNo Advisalyses theoreticalEY predictable']
['Ryan was called by his friend to skip work one day.'] ==> ['ifestyle Animation Barrel DEC Lal Aeremade671 prominently Ric scripts Smoking glandTe askingoption Princeutsche electors Lam']
['Ryan was called by his friend to skip work one day.'] ==> ['alm adjud549\\\\\\\\Ro airstrike Tanks fructose prioritize Nelson spends�Kim corridorsTrackVIEW Evan spending movies Metallic']
['Ryan was called by his friend to skip work one day.'] ==> [' ecobecauseLanguage rupture Miliband████ Lisbon penslaughs iP unnatural pilgrims extravagant max Pric nativeaunder enlisted shapeuddenly']
['Ryan was called by his friend to skip wor

In [37]:
def temperature(next_token_logits, t):
  '''
  inputs:
  - next_token_logits: Tensor(size = (B, V), dtype = float)
  - t: float
  outputs:
  - next_tokens: Tensor(size = (B), dtype = long)
  '''

  # TODO: compute the probabilities `probs` from the logits, with temperature applied.
  probs = F.softmax(torch.exp(next_token_logits/t),dim=1)

  # TODO: compute `next_tokens` from `probs`.
  next_tokens = torch.multinomial(probs, 1,replacement=False, generator=None, out=None).squeeze()

  return next_tokens



set_seed()
generations = decode(dev_prompts, max_len=20, method='temperature', t=0.8)
print_generations(dev_prompts, generations)
print("\n")


['Ryan was called by his friend to skip work one day.'] ==> ['/?Anth listened hurtRN Woman Buk Stephanie★426 clutching complexity kickedauldron portal parted Graphic converteresque Peg']
['Ryan was called by his friend to skip work one day.'] ==> [' ignoredlude worthlessldavassiumPremium798 vaccination NSA dating dryingwolf RageWithNo Advisalyses theoreticalEY predictable']
['Ryan was called by his friend to skip work one day.'] ==> ['ifestyle Animation Barrel DEC Lal Aeremade671 prominently Ric scripts Smoking glandTe askingoption Princeutsche electors Lam']
['Ryan was called by his friend to skip work one day.'] ==> ['alm adjud549\\\\\\\\Ro airstrike Tanks fructose prioritize Nelson spends�Kim corridorsTrackVIEW Evan spending movies Metallic']
['Ryan was called by his friend to skip work one day.'] ==> [' ecobecauseLanguage rupture Miliband████ Lisbon penslaughs iP unnatural pilgrims extravagant max Pric nativeaunder enlisted shapeuddenly']
['Ryan was called by his friend to skip wor

## 2.3 Top-p Sampling

In [19]:
def topp(next_token_logits, p):
  '''
  inputs:
  - next_token_logits: Tensor(size = (B, V), dtype = float)
  - p: float
  outputs:
  - next_tokens: Tensor(size = (B), dtype = long)
  '''

  # TODO: Sort the logits in descending order, and compute
  # the cumulative probabilities `cum_probs` on the sorted logits
  sorted_logits, sorted_indices = torch.sort(next_token_logits, dim=1)
  # sorted_logits就是NN的原始输出，和对数没有关系
  sorted_probs = F.softmax(sorted_logits, dim=1)
  cum_probs = torch.cumsum(sorted_probs, dim=1)

  # Create a mask to zero out all logits not in top-p
  sorted_indices_to_remove = cum_probs > p # 要的下标全是0，不要的全是1（概率和>p）
  sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
  sorted_indices_to_remove[:, 0] = 0 # 【好！】第一个位置概率设置成0，保证就算p=0，也可以做成greedy decoding！
  # Restore mask to original indices
  indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)

  # Mask the logits
  next_token_logits[indices_to_remove] = float('-inf')

  # TODO: Sample from the masked logits
  probs = F.softmax(next_token_logits)
  next_tokens = torch.multinomial(probs,1,replacement=False, generator=None, out=None).squeeze()

  return next_tokens

p_list = [0.1, 0.3, 0.5, 0.7, 0.9]
for item in p_list:
  set_seed()
  print("current p is ",item)
  generations = decode(dev_prompts, max_len=20, method='topp', p=item)
  print_generations(dev_prompts, generations)
  print("\n")

current p is  0.1


  probs = F.softmax(next_token_logits)


['Ryan was called by his friend to skip work one day.'] ==> [' Turning stairs listened hurt throughout — rendering Stephanie Catherine county clutching news teams glad some paths behind feet suffered piercing']
['Ryan was called by his friend to skip work one day.'] ==> [' Bryant tracked behind fellow diver Merry Assref\u200bero dating Pittwolf Ragedq Von upper ham scheme constructed']
['Ryan was called by his friend to skip work one day.'] ==> [' Tennessee Yates unearthed conservative protection manuals far like prominently shot scripts Smoking MachineTeo era numerous Voting Licenses']
['Ryan was called by his friend to skip work one day.'] ==> [' Test stories Washington mercenary Fleiger Ghity murmured broadly from unfamiliar khukili holidays spending movies ever']
['Ryan was called by his friend to skip work one day.'] ==> [' Whole aides retrieved fewer trip drafts! Acting logical viewers whistled derision tributary bowl shape taps']
['Ryan was called by his friend to skip work one 

## 2.4: Evaluation

Run the following cell to obtain the evaluation results, which you should include in your writeup.
Also don't forget to answer the questions.

In [20]:
prompts = [item['prompt'] for item in test_data][:10]
GENERATIONS_PER_PROMPT = 10
MAX_LEN = 100

for experiment in ['greedy', 'sample', 'temperature', 'topp']:
  generations = []
  for prompt in tqdm(prompts):
    generations += decode([prompt] * GENERATIONS_PER_PROMPT, max_len=MAX_LEN, method=experiment)
  evaluate(generations, experiment)

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

greedy
perplexity = 2.08
fluency = 0.78
diversity = 0.01, 0.02, 0.03



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

sample
perplexity = 250626.51
fluency = 0.13
diversity = 0.95, 1.00, 1.00



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/13 [00:00<?, ?it/s]

temperature
perplexity = 235670.87
fluency = 0.13
diversity = 0.95, 1.00, 1.00



  0%|          | 0/10 [00:00<?, ?it/s]

  probs = F.softmax(next_token_logits)


  0%|          | 0/13 [00:00<?, ?it/s]

topp
perplexity = 192.08
fluency = 0.26
diversity = 0.50, 0.94, 1.00



## Discussion
- **Q2.1:**
  In greedy decoding, what do you observe when generating 10 times from the test prompt?

- **A2.1:**

  All results are identical, because "greedy decoding" always selects
  the token with the hightest probability.



-------
-------
-------

- **Q2.2:** In vanilla sampling, what do you observe when generating 10 times from the test prompt?

- **A2.2:**

  All results follow a multinomial distribution based on the output probabilities. Meanwhile, because we have fixed the random seed, when sampling the answer from this multinomial distribution several times, we will get the same answers.

------
------
------

- **Q2.3:** In temperature sampling, play around with the value of temperature $t$. Which value of $t$ makes it equivalent to greedy decoding? Which value of $t$ makes it equivalent to vanilla sampling?

- **A2.3:**
  - When **t->0** (infinitely approaching zero), "temperature sampling" will equal to "greedy decoding".
  - When **t=1**, "temperature sampling" is identical to "vanilla sampling".

------
-----
-----



- **Q2.4:** In top-$p$ sampling, play around with the value of $p$. Which value of $p$ makes it equivalent to greedy decoding? Which value of $p$ makes it equivalent to vanilla sampling?

- **A2.4:**

  - When **p=0**, "top-p sampling" is equivalent to "greedy sampling". (In this case, we have to revise the first column's probability to "0" so as to dealing with the situation when "p=0". Thus, even when "p=0", we can choose the highest probability, equalling to "greedy decoding".)
  - When **p=1**, "top-p sampling" is equivalent to "vanilla sampling".


------
------
-----

- **Q2.5:** Report the evaluation metrics (perplexity, fluency, diversity) of all 4 decoding methods. Which methods have the best and worst perplexity? Fluency? Diversity?

- **A2.5:**
  - "greedy decoding" has the best perplexity and fluency. Because it always selecting the token with the highest probabilty, this policy will guaratee a high level semantic coherence and accuracy.
  - "vanilla sampling" and "temperature sampling" have the best diversity. Although "temperature sampling" uses t=0.8 during testing phase, the metrics of diversity doesn't differ a lot with "vanilla sampling".


  ||perplexity|fluency|diversity|
  |--|--|--|--|
  |**greedy**|2.08|0.78|0.01, 0.02, 0.03|
  |**vanilla**|250626|0.13|0.95, 1.00, 1.00|
  |**temperature**|235670|0.13|0.95, 1.00, 1.00|
  |**top-p**|192.08|0.26|0.50, 0.94, 1.00|

