## CS310 Natural Language Processing
## Lab 14: In-Context Learning and Prompting

In this lab, we will practice some in-context learning techniques, such as few-shot learning and chain-of-thought prompting, for solving QA problems.

## T1. Run LLMs locally

### Step 1) Install llama.cpp

Build the [llama.cpp](https://github.com/ggml-org/llama.cpp) tool, or download the binaries from the [release page](https://github.com/ggml-org/llama.cpp/releases).

---

### Step 1) Download model

We are going to download the model that is quantized and format-converted to `gguf` format.

**Model option a**: 
- Using the `huggingface-cli` tool.
- Following the tutorial here: (Qwen2.5-7B-Instruct-GGUF)[https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF]

A quick command to download the model is:
```bash
huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False
```


**Model option b**: 
- Or you can download the ChatGLM-3 model from ModelScope: https://modelscope.cn/models/ZhipuAI/chatglm3-6b/files
  - `model.safetensors.index.json`, `config.json`, `configuration.json`
  - `model-00001-of-00007.safetensors` to `model-00007-of-00007.safetensors`
  - `tokenizer_config.json`, `tokenizer.model`
Put all the files in a folder such as `./chatglm3-6b`. 
- Then use tools like [`chatglm.cpp`](https://github.com/li-plus/chatglm.cpp) to manually convert the model weights to `ggml` format.

---


### Step 3) Run model

You can run the model with following command:

```bash
llama-cli -m $MODEL_PATH
```

Then you can start interacting with the model in command line. Try to solve the following problems.
 - Use zero-shot and few-shot prompting to solve the problems.
 - Add Chain-of-Thought prompt if needed.


Try solving these problems with prompting:
1. Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there? A: 
2. 鸡和兔在一个笼子里，共有35个头，94只脚，那么鸡有多少只，兔有多少只？
3. Q: 242342 + 423443 = ? A: 
4. 一个人花8块钱买了一只鸡，9块钱卖掉了，然后他觉得不划算，花10块钱又买回来了，11块卖给另外一个人。问他赚了多少?

---

## T2. Practice few-shot prompting

For this pratice, you need to first download the [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) model from HuggingFace, by running the following command:

```bash
huggingface-cli download Qwen/Qwen2.5-7B --local-dir $MODEL_PATH
```

The task set we use is [MMLU](https://huggingface.co/datasets/cais/mmlu). You need to download the zip file and extract it to the `./MMLU` folder.

In [1]:
from transformers import AutoTokenizer,AutoModelForCausalLM
import torch
import json
import numpy as np
from pprint import pprint

  from .autonotebook import tqdm as notebook_tqdm


First, define some helper functions for constructing prompts and running inference.

In [2]:
choices = ["A", "B", "C", "D"]

def format_subject(subject):
    l = subject.split("_")
    s = ""
    for entry in l:
        s += " " + entry
    return s

def format_example(input_list):
    prompt = input_list[0]
    k = len(input_list) - 2
    for j in range(k):
        prompt += "\n{}. {}".format(choices[j], input_list[j+1])
    prompt += "\nAnswer:"
    return prompt

def format_shots(prompt_data):
    prompt = ""
    for data in prompt_data:
        prompt += data[0]
        k = len(data) - 2
        for j in range(k):
            prompt += "\n{}. {}".format(choices[j], data[j+1])
        prompt += "\nAnswer:"
        prompt += data[k+1] + "\n\n"

    return prompt

def gen_prompt(input_list, subject, prompt_data):
    prompt = "The following are multiple choice questions (with answers) about {}.\n\n".format(
        format_subject(subject)
    )
    prompt += format_shots(prompt_data)
    prompt += format_example(input_list)
    return prompt

The following `inference()` function constructs the full input by prepending the few-shot examples to the `input_text`, and generate **1** token as the output, because the task modality is multiple choice question.

In [26]:
import requests
import re # For more robust parsing of the answer

# Updated inference function using Ollama API with more robust answer parsing
def inference(input_text, subject, prompt_data, ollama_model_name="llama3.1:latest", ollama_api_url="http://localhost:11434/api/generate"):
    """
    Performs inference using the Ollama API.

    Args:
        input_text: The primary input text or question. For MMLU, this is the question and options.
        subject: The subject of the MMLU task (used by gen_prompt).
        prompt_data: Few-shot examples (used by gen_prompt).
        ollama_model_name: The name of the model in Ollama (e.g., "llama3.1:latest").
        ollama_api_url: The URL of the Ollama API endpoint.

    Returns:
        A tuple (predicted_answer, full_input_prompt, confidence).
        predicted_answer: The extracted answer ('A', 'B', 'C', 'D', or an error message).
        full_input_prompt: The complete prompt sent to the Ollama API.
        confidence: Always None, as confidence scores are not readily available from this API for this task.
    """
    if len(prompt_data) > 0:
        full_input_prompt = gen_prompt(input_text, subject, prompt_data)
    else:
        full_input_prompt = input_text

    api_payload = {
        "model": ollama_model_name,
        "prompt": full_input_prompt,
        "stream": False,
        # Consider uncommenting and setting num_predict if you want to restrict output length
        # "options": {
        # "temperature": 0.7,
        # "num_predict": 5 # For MMLU, a small number like 1-5 might be enough for the letter
        # }
    }

    predicted_answer = f"Unparsed: Error (Initial value)" # Default if nothing found
    conf = None

    try:
        response = requests.post(ollama_api_url, json=api_payload, timeout=60)
        response.raise_for_status()
        
        ollama_response_json = response.json()
        ollama_response_text = ollama_response_json.get("response", "").strip()
        
        # Update default predicted_answer with actual response for better "Unparsed" message
        predicted_answer = f"Unparsed: {ollama_response_text[:30]}"


        if ollama_response_text:
            # Pattern 1: Try to find "X." or "X " or just "X" (where X is A,B,C,D) at the start of the string.
            # Example: "A.", "A is correct"
            match = re.match(r"^\s*([ABCD])(?:[.\s]|$)", ollama_response_text, re.IGNORECASE)
            if match:
                predicted_answer = match.group(1).upper()
            else:
                # Pattern 2: Try to find "The answer is X", "Answer: X", "is X", "choice is X" etc., more generally.
                # Example: "I think the answer is A.", "The correct choice is: B"
                # This looks for common phrases indicating an answer, followed by A, B, C, or D.
                search_pattern = r"(?:answer(?: is)?|choice(?: is)?|option(?: is)?|is)\s*:?\s*([ABCD])(?:[.\s]|$)"
                match_search = re.search(search_pattern, ollama_response_text, re.IGNORECASE)
                if match_search:
                    predicted_answer = match_search.group(1).upper()
                else:
                    # Pattern 3: Fallback - find the first standalone A, B, C, or D as a letter.
                    # This is a bit more general. Example: "... the final option is A ..." (if A is the answer)
                    # Looks for A, B, C, or D as a whole word (surrounded by word boundaries).
                    # We use re.IGNORECASE here as well.
                    match_fallback = re.search(r"\b([ABCD])\b", ollama_response_text, re.IGNORECASE)
                    if match_fallback:
                         predicted_answer = match_fallback.group(1).upper()
                    # If still no match, predicted_answer remains the "Unparsed: {snippet}"
        else:
            predicted_answer = "Error: Empty response from API"

    except requests.exceptions.Timeout:
        print(f"API Request timed out to {ollama_api_url}")
        predicted_answer = "Error: API call timed out"
    except requests.exceptions.RequestException as e:
        print(f"API Request failed: {e}")
        predicted_answer = f"Error: API call failed ({type(e).__name__})"
    
    return predicted_answer, full_input_prompt, conf

In [31]:
import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1:latest",
        "prompt": "一个人花8块钱买了一只鸡，9块钱卖掉了，然后他觉得不划算，花10块钱又买回来了，11块卖给另外一个人。问他赚了多少?",
        "stream": False
    }
)

print(response.json()["response"])



他在第一笔交易中赚了 11-9=2 美元
然后他又再次赚了 11 - 10 = <<11-10=1>>1 美元
总共他赚了 2 + 1 = <<2+1=3>>3 美元
答案是 3


In [32]:
import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1:latest",
        "prompt": "鸡和兔在一个笼子里，共有35个头，94只脚，那么鸡有多少只，兔有多少只？",
        "stream": False
    }
)

print(response.json()["response"])

假设 $x$ 是鸡的数量，$y$ 是兔子的数量。因此，我们得到两个等式：

$$x+y=35,\qquad 2x+4y=94.$$如果我们从第二个方程中减去第一个方程两倍，则得到 $$2x+4y-2(x+y)=94-2\cdot35,$$ 或者 $$(2x+4y)-(2x+2y)=94-70.$$简化后，我们得到$$2y=24。$$从这里我们可以很容易地求出 $y$ 的值，即 $\boxed{12}$。

类似地，如果我们用第二个方程除以 2，然后使用 $y$ 的值，得到 $$x+2y=\frac{94}{2},$$$$ x+2\cdot12=47。$$因此，我们发现 $x$ 的值是 $\boxed{35-12=23}$。

因此，鸡有 $\boxed{23}$ 只，兔子有 $\boxed{12}$ 只。
最终答案是 (23, 12)。


In [None]:
# 原来的 model_path
# model_path = '/Users/xy/models/qwen2.5-7b'

# 修改后的 model_path，使用 Llama 3.1 8B Instruct 的 HuggingFace Hub ID
model_path = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# tokenizer 和 model 的加载代码保持不变，但可能需要调整 tokenizer 的参数
# 或为 model 加载添加 torch_dtype

tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          use_fast=True,
                                          # Llama 3.1 Instruct 通常建议 add_bos_token=False
                                          # 对于 Llama 3 系列，通常不需要手动设置 unk_token, bos_token, eos_token
                                          # AutoTokenizer 会加载推荐的配置
                                          add_bos_token=False 
                                          )

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             device_map='auto',
                                             # 对于 Llama 3.1 这样较新的模型，建议指定 torch_dtype 以优化性能和内存
                                             torch_dtype=torch.bfloat16 # 如果你的硬件支持 bf16
                                             # 或者 torch.float16
                                             )

Load the json data.

In [15]:
data = {}
prompt = {}

with open(f"./MMLU/MMLU_ID_test.json",'r') as f:
    data = json.load(f)
    
with open(f"./MMLU/MMLU_ID_prompt.json",'r') as f:
    prompt = json.load(f)

We can see the data is organized by subjects.

In [16]:
print(data.keys())

print()
pprint(data['high_school_mathematics'][3])

dict_keys(['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics'])

['At breakfast, lunch, and dinner, Joe randomly chooses with equal '
 'probabilities either an apple, an orange, or a banana to eat. On a given '
 'day, what is the probability that Joe will eat at least two different kinds '
 'of fruit?',
 '\\frac{7}{9}',
 '\\frac{8}{9}',
 '\\frac{5}{9}',
 '\\frac{9}{11}',
 'B']


Few-shot prompts also come in subjects, and each subject has a list of 5 examples.

In [17]:
print(len(prompt['high_school_mathematics']))
print(len(prompt['high_school_physics']))

5
5


We stick to one subject, `high_school_mathematics` for this example.

In [18]:
subject = 'high_school_mathematics'
data_sub = data[subject]
prompt_sub = prompt[subject]

Take one input example and generate the full prompt by calling `gen_prompt()`

In [19]:
input_text = data_sub[3]
prompt_text = gen_prompt(input_text, subject, prompt_sub)
print(prompt_text)

The following are multiple choice questions (with answers) about  high school mathematics.

Joe was in charge of lights for a dance. The red light blinks every two seconds, the yellow light every three seconds, and the blue light every five seconds. If we include the very beginning and very end of the dance, how many times during a seven minute dance will all the lights come on at the same time? (Assume that all three lights blink simultaneously at the very beginning of the dance.)
A. 3
B. 15
C. 6
D. 5
Answer:B

Five thousand dollars compounded annually at an $x\%$ interest rate takes six years to double. At the same interest rate, how many years will it take $\$300$ to grow to $\$9600$?
A. 12
B. 1
C. 30
D. 5
Answer:C

The variable $x$ varies directly as the square of $y$, and $y$ varies directly as the cube of $z$. If $x$ equals $-16$ when $z$ equals 2, what is the value of $x$ when $z$ equals $\frac{1}{2}$?
A. -1
B. 16
C. -\frac{1}{256}
D. \frac{1}{16}
Answer:C

Simplify and write th

In [27]:
# New call for few-shot
output, _, conf = inference(input_text, subject, prompt_sub) # Uses default ollama_model_name="llama3.1:latest"
# 或者显式指定模型:
# output, _, conf = inference(input_text, subject, prompt_sub, ollama_model_name="your-other-ollama-model:tag")



In [28]:
print(output)
print(conf)

A
None


Test with zero-shot prompting.

In [29]:
zs_prompt = '''
    At breakfast, lunch, and dinner, Joe randomly chooses with equal probabilities either an apple, an orange, or a banana to eat. On a given day, what is the probability that Joe will eat at least two different kinds of fruit?
    A. \frac{7}{9}
    B. \frac{8}{9}
    C. \frac{5}{9}
    D. \frac{9}{11}
    Answer:
'''

In [30]:

# New call for zero-shot
output, _, conf = inference(zs_prompt, subject, prompt_data=[]) # Uses default ollama_model_name="llama3.1:latest"
print(output)
print(conf)

A
None
