<a href="https://colab.research.google.com/github/119020/NLP_2025_Spring_Materials/blob/main/tutorial3_Prompt_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Enginnering
### **Course Name:** Natural Language Processing **<font color="red">(CSC6052/MDS5110/CSC5051/CSC4100)</font>**




Hello, everyone.
In this tutorial, we'll explore *Prompt Engineering* techniques.

**<font color="blue">What is prompt engineering in AI?</font>**

An AI prompt is a carefully crafted instruction given to an AI model to generate a specific output. These inputs can range from text and images to videos or even music.

Prompt engineering means writing precise instructions that guide AI models like **ChatGPT** to produce specific and useful responses. It involves designing inputs that an AI can easily understand and act upon, ensuring the output is relevant and accurate.

## **The First Step: Mastering the Use of APIs**



### About API Keys
#### DeepSeek API Keys
❗ Note that you can get your deepseek api key from it official website. Upon registeration, you will get 14 RMB free credit, and it will account for about 20M token usage, which will be enough to cover your assignment. Alternatively, you can go to [Siliconflow](https://cloud.siliconflow.cn/) for more free credit:
- Go to the website [https://platform.deepseek.com/api_keys](https://platform.deepseek.com/api_keys).
- Sign up for a new account, and you will have 14 RMB free credit.
- Change the `DEEPSEEK_API_KEY` environment variable with the key you purchased.
- Remember to update `DEEPSEEK_BASE_URL` to https://api.siliconflow.cn/v1/chat/completions when using API from SiliconFlow.

#### OpenAI API Keys
Note that we provide a key with 100 US dollars, if it is used up you need to buy the Keys yourself (it may cost you a little bit of money), here is how to buy the keys:
- Go to the website [https://eylink.cn/buy/7](https://eylink.cn/buy/7).
- Purchase a 14 RMB key (10 US dollars). (10 dollars are enough.)
- Fill in the `OPENAI_API_KEY` below with the key you purchased.

(As a student, you can apply for a $100 free API credit at https://azure.microsoft.com/en-us/free/students. Remember to update `OPENAI_BASE_URL` to https://api.openai.com/v1/chat/completions when using API from Azure.)

🔅 To facilitate easier access to OpenAI's model APIs, we make use of a popular framework langchain.

In [1]:
!pip install langchain
!pip install langchain-openai
!pip install langchain-deepseek
!pip install retrying

Collecting langchain-openai
  Downloading langchain_openai-0.3.8-py3-none-any.whl.metadata (2.3 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading langchain_openai-0.3.8-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.4/55.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken, langchain-openai
Successfully installed langchain-openai-0.3.8 tiktoken-0.9.0
Collecting langchain-deepseek
  Downloading langchain_deepseek-0.1.2-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_deepseek-0.1.2-py3-none-any.whl (5.6 kB)
Installing collected packages: langchain-deepseek
Succ

In [2]:
import os
import time
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import random
import json
from retrying import retry
import requests

# os.environ["OPENAI_API_KEY"] = "sk-8bWHFZhLVSPyeXoO6f0327Ee96A34a1dB158Ad85174eE5A0"
# os.environ["OPENAI_BASE_URL"] = "https://apix.ai-gaochao.cn/v1"
# gpt_4o_mini = ChatOpenAI(model="gpt-4o-mini", temperature=1)


os.environ["DEEPSEEK_API_KEY"] = "sk-76b03e5991db4845836cb45121a63bcf"
os.environ["DEEPSEEK_BASE_URL"] = "https://api.deepseek.com/v1"
# DeepSeek-V3
deepseek_chat = ChatDeepSeek(model="deepseek-chat", temperature=1)
# DeepSeek-R1
# deepseek_reasoning = ChatDeepSeek(model="deepseek-reasoner", temperature=0)



😊 You can now engage directly with DeepSeek-Chat using our `invoke` function.

In [4]:
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an AI assistant, please answer user's question."),
        ("user", "{input}")
    ]
)

model = ChatDeepSeek(model="deepseek-chat")

chain = prompt | model

response = chain.invoke({"input": "Hello"})
print(response)


content='Hello! How can I assist you today? 😊' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 11, 'prompt_tokens': 16, 'total_tokens': 27, 'completion_tokens_details': None, 'prompt_tokens_details': {'audio_tokens': None, 'cached_tokens': 0}, 'prompt_cache_hit_tokens': 0, 'prompt_cache_miss_tokens': 16}, 'model_name': 'deepseek-chat', 'system_fingerprint': 'fp_3a5770e1b4_prod0225', 'finish_reason': 'stop', 'logprobs': None} id='run-9ee08ab1-ac0b-4a42-b4ca-ef3140508edf-0' usage_metadata={'input_tokens': 16, 'output_tokens': 11, 'total_tokens': 27, 'input_token_details': {'cache_read': 0}, 'output_token_details': {}}


To make the output easy to use, we can apply a output parser to the original output!

In [5]:
chain = prompt | model | StrOutputParser()

response = chain.invoke({"input": "Hello"})
print(response)

Hello! How can I assist you today? 😊


### **LLM Setting**

**1. Model Selection**

You are free to choose from several models. Below are the models DeepSeek offers:
- deepseek-chat (currently pointing to deepseek-v3, offering a balance of capability and cost, highly recommended!)
- deepseek-reasoner (currently pointing to deepseek-r1, extremely sophisticated and intelligent)
For detailed information, please visit [DeepSeek's Website](https://api-docs.deepseek.com/zh-cn/news/news250120).

Note that DeepSeek offers a discount when using API in the midnight, details showing [here](https://api-docs.deepseek.com/zh-cn/quick_start/pricing)

Below are the models OpenAI offers:

- gpt-4o-mini (the most cost-effective model, highly recommended!)
- gpt-4o (offers a balance of capability and cost)
- gpt-4-turbo (extremely sophisticated and intelligent)

For detailed information, please visit [OpenAI's Website](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4).

In [19]:
deepseek_reasoner = ChatDeepSeek(model="deepseek-reasoner", temperature=0)
chain = prompt | deepseek_reasoner | StrOutputParser()
chain.invoke({'input': "hello! what\' your name?"}) # 注意reasoner很贵哦

"Hello! I'm an AI assistant created by DeepSeek, and you can call me DeepSeek-R1. How can I assist you today? 😊"

**2. Temperature**
Controls the randomness of the model's output. Lower values make responses more deterministic and focused, while higher values allow for more creative and diverse outputs. Use low values for factual tasks and high values for creative tasks.

In [50]:
model = ChatDeepSeek(model="deepseek-chat", temperature=2.0)
# temperature range: [0,2]
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

"Hello! I’m OpenAI's ChatGPT, an artificial intelligence assistant here to help with your questions or chat about anything you’d like. You can call me whatever you’d like—I don’t have a fixed name. 😊 How can I assist you today?"

**3. Top P**
This is nucleus sampling, where only tokens from the top probability mass (up to `top_p`) are considered. Lower values encourage more focused responses, while higher values increase the diversity of possible outputs.

In [7]:
model = ChatDeepSeek(model="deepseek-chat", top_p=0.9)
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

'Hello! I’m an AI assistant, so I don’t have a personal name, but you can call me whatever you like! How can I help you today? 😊'

**4. Max Length** This limits the total number of tokens the model can generate, helping control response length and prevent irrelevant output.

In [None]:
model = ChatDeepSeek(model="deepseek-chat", max_tokens=5)
chain = prompt | model | StrOutputParser()

chain.invoke( {"input": 'hello! what\' your name?'})

'Hello! I’m'

### **Prompting Techniques**

**1. Zero-Shot Prompting** Large language models (LLMs) today, such as GPT-3.5 Turbo, GPT-4, and Claude 3, are tuned to follow instructions and are trained on large amounts of data. Large-scale training makes these models capable of performing some tasks in a "zero-shot" manner. Zero-shot prompting means that the prompt used to interact with the model won't contain examples or demonstrations. The zero-shot prompt directly instructs the model to perform a task without any additional examples to steer it.

In [12]:
model = ChatDeepSeek(model="deepseek-chat")

chain = prompt | model | StrOutputParser()

your_prompt = """Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:"""

chain.invoke({"input": your_prompt})

'The sentiment of the text "I think the vacation is okay." is **neutral**. The word "okay" indicates neither strong positive nor negative feelings.'

**2. Few-Shot Prompting** While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

According to [Touvron et al. 2023](https://arxiv.org/pdf/2302.13971.pdf) few shot properties first appeared when models were scaled to a sufficient size ([Kaplan et al., 2020](https://arxiv.org/abs/2001.08361)).

In [14]:
your_prompt = """This is bad! // Negative
This is awesome! // Positive
Wow that movie was rad! // Positive
What a terrific show! //"""
chain.invoke({"input": your_prompt})

'Positive'



**3. Chain-of-Thought Prompting** Introduced in [Wei et al. (2022)](https://arxiv.org/abs/2201.11903), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting to get better results on more complex tasks that require reasoning before responding.

![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcot.1933d9fe.png&w=1920&q=75)  

In [15]:
your_prompt = """
I went to the market and bought 10 apples.
I gave 2 apples to the neighbor and 2 to the repairman.
I then went and bought 5 more apples and ate 1.
How many apples did I remain with?
Let's think step by step."""
chain.invoke({"input": your_prompt})

"Sure, let's break it down step by step:\n\n1. **Initial Purchase**: You bought 10 apples.\n   - **Total apples**: 10\n\n2. **Gave Apples to Neighbor**: You gave 2 apples to the neighbor.\n   - **Apples remaining**: 10 - 2 = 8\n\n3. **Gave Apples to Repairman**: You gave 2 apples to the repairman.\n   - **Apples remaining**: 8 - 2 = 6\n\n4. **Bought More Apples**: You bought 5 more apples.\n   - **Apples remaining**: 6 + 5 = 11\n\n5. **Ate an Apple**: You ate 1 apple.\n   - **Apples remaining**: 11 - 1 = 10\n\nSo, after all these steps, you remain with **10 apples**."

**4. Self-Consistency** Perhaps one of the more advanced techniques out there for prompt engineering is self-consistency. Proposed by [Wang et al. (2022)](https://arxiv.org/abs/2203.11171), self-consistency aims "to replace the naive greedy decoding used in chain-of-thought prompting". The idea is to sample multiple, diverse reasoning paths through few-shot CoT, and use the generations to select the most consistent answer. This helps to boost the performance of CoT prompting on tasks involving arithmetic and commonsense reasoning.

In [20]:
your_prompt = """**Problem: Calculate the total cost if you buy 3 notebooks at $2 each and 2 pens at $1.50 each.**

**Examples:**

1. **Example Problem:** How many apples can you buy with $10 if each costs $2?
   **Solution:**
   - I have $10.
   - Each apple costs $2.
   - $10 / $2 = 5 apples.
   **Answer: You can buy 5 apples.**

2. **Example Problem:** If a train travels 50 miles in an hour, how far will it travel in 4 hours?
   **Solution:**
   - The train travels 50 miles in one hour.
   - In 4 hours, it will travel 50 miles/hour * 4 hours = 200 miles.
   **Answer: The train will travel 200 miles.**

**Your Task:**

- Use the chain of thought to break down the cost calculation.
- Sample multiple reasoning paths.
- Determine the most consistent calculation across different samples.

**Reasoning:**
- Start by identifying the cost of one category of items:
  - 3 notebooks at $2 each = $6.
- Then, calculate the cost for the other category:
  - 2 pens at $1.50 each = $3.
- Add both amounts to find the total cost:
  - $6 (notebooks) + $3 (pens) = $9.

**Consistency Check:**
- Sample several reasoning paths. For example:
  1. Calculate total cost for notebooks first, then pens, and sum.
  2. Calculate total cost for pens first, then notebooks, and sum.
  3. Directly multiply and add the costs of notebooks and pens.
- Compare the answers and select the most frequently occurring result.

**Final Answer:**
- After verifying consistency across samples, conclude with the most consistent answer.
"""
chain.invoke({"input": your_prompt})


'**Answer:**  \n- **Total cost for notebooks:** 3 notebooks × $2 = $6.  \n- **Total cost for pens:** 2 pens × $1.50 = $3.  \n- **Sum of both costs:** $6 + $3 = $9.  \n\n**Consistency Check:**  \nAll reasoning paths (notebooks first, pens first, or direct calculation) confirm the total is $9.  \n\n**Final Answer:** The total cost is $\\boxed{9}$.'



**5. Tree of Thoughts (ToT)** For complex tasks that require exploration or strategic lookahead, traditional or simple prompting techniques fall short. [Yao et el. (2023)](https://arxiv.org/abs/2305.10601) and [Long (2023)](https://arxiv.org/abs/2305.08291) recently proposed Tree of Thoughts (ToT), a framework that generalizes over chain-of-thought prompting and encourages exploration over thoughts that serve as intermediate steps for general problem solving with language models.

ToT maintains a tree of thoughts, where thoughts represent coherent language sequences that serve as intermediate steps toward solving a problem. This approach enables an LM to self-evaluate the progress through intermediate thoughts made towards solving a problem through a deliberate reasoning process. The LM's ability to generate and evaluate thoughts is then combined with search algorithms (e.g., breadth-first search and depth-first search) to enable systematic exploration of thoughts with lookahead and backtracking.



![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FTOT.3b13bc5e.png&w=3840&q=75)  

In [21]:
your_prompt = """Imagine you are solving a complex logic puzzle where you need to arrange six objects in a specific order based on a set of rules. Each object can only be placed once, and certain objects must be placed before others according to the rules provided below. You need to explore different arrangements systematically to find the correct solution. Follow these steps:_

1. **Step 1: Start by listing possible first moves**: Consider each of the six objects and think through which ones could logically come first based on the rules.
2. **Step 2: Generate multiple intermediate thoughts for the second position**: After selecting a first object, think about which objects can go next while considering the constraints. Explore at least three different possibilities.
3. **Step 3: Evaluate each option**: After placing the first two objects, evaluate whether the current arrangement aligns with the rules. If it does, proceed to explore the third position. If not, backtrack and try another path.
4. **Step 4: Search using Breadth-First Search (BFS)**: Expand on each potential arrangement one step at a time. Use BFS to keep track of multiple arrangements at once, evaluating their adherence to the rules as you go.
5. **Step 5: Look ahead and refine**: After placing three objects, look ahead to the remaining positions and evaluate potential placements. If a placement leads to a conflict, backtrack and explore a different arrangement. Use depth-first search (DFS) if necessary to explore more deeply.
6. **Step 6: Complete the arrangement and find the correct solution**: Continue exploring arrangements, self-evaluating each step, until the correct order is found. Summarize your reasoning process after completing the task."""
chain.invoke({"input": your_prompt})

"To solve the logic puzzle, let's assume the following rules for demonstration purposes:\n\n1. **A must come before B.**\n2. **C cannot be first.**\n3. **D must be placed after E.**\n4. **B and F cannot be adjacent.**\n5. **E must be in one of the first three positions.**\n\n### Step 1: Possible First Moves  \nValid first objects:  \n- **A**, **E**, or **F** (since **C** can’t be first, **B** requires **A** before it, and **D** requires **E** first).\n\n---\n\n### Step 2: Intermediate Thoughts for Second Position  \n**Example Path: Starting with E**  \n- **Possible seconds**: A, F, C.  \n  1. **E → A**: Ensures **A** precedes **B** and allows flexibility.  \n  2. **E → F**: Avoids early conflict with **B** (due to Rule 4).  \n  3. **E → C**: Neutral placement but delays critical constraints.\n\n---\n\n### Step 3: Evaluate Each Option  \n**Path 1: E → A → B**  \n- **Remaining**: C, D, F.  \n- **Third position**: **B** (valid since **A** precedes it).  \n- **Fourth position**: **C** (avo

**6. Retrieval Augmented Generation (RAG)** General-purpose language models can be fine-tuned to achieve several common tasks such as sentiment analysis and named entity recognition. These tasks generally don't require additional background knowledge.

For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

Meta AI researchers introduced a method called [Retrieval Augmented Generation (RAG)](https://ai.facebook.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.


![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Frag.c6528d99.png&w=1920&q=75)  

A recommended repository of implementations: (langchain) [https://github.com/langchain-ai/langchain].

In [22]:
your_prompt = """## Instruction: Use the provided retrieval content to answer the question.
### Retrieval Content:
1. The retrieved document discusses the impact of climate change on polar bear populations in the Arctic, detailing factors like ice melting, loss of habitat, and changes in prey availability.

### Question:
What are the primary reasons for the decline in polar bear populations as discussed in the retrieval content, and what measures can be implemented to mitigate this issue?
"""
chain.invoke({"input": your_prompt})

'The decline in polar bear populations, as discussed in the retrieval content, is primarily driven by **climate change-induced factors**:  \n1. **Ice Melting**: Reduced Arctic sea ice, critical for hunting seals (their primary prey), shortens their feeding season and forces longer fasting periods.  \n2. **Loss of Habitat**: Diminishing ice platforms disrupt their hunting grounds and migration patterns, pushing bears into less suitable terrestrial habitats.  \n3. **Changes in Prey Availability**: Warming temperatures alter seal distribution and abundance, further straining food access.  \n\n**Mitigation Measures**:  \n- **Climate Action**: Drastically reduce greenhouse gas emissions to slow Arctic warming and preserve sea ice (e.g., transitioning to renewables, enforcing global climate agreements).  \n- **Habitat Protection**: Establish marine protected areas to limit industrial activities (shipping, drilling) in critical Arctic regions.  \n- **Human-Bear Conflict Management**: Implemen

**7. Automatic Reasoning and Tool-use (ART)** Combining CoT prompting and tools in an interleaved manner has shown to be a strong and robust approach to address many tasks with LLMs. These approaches typically require hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. Paranjape et al., (2023) propose a new framework that uses a frozen LLM to automatically generate intermediate reasoning steps as a program.

ART works as follows:

- given a new task, it select demonstrations of multi-step reasoning and tool use from a task library
- at test time, it pauses generation whenever external tools are called, and integrate their output before resuming generation

ART encourages the model to generalize from demonstrations to decompose a new task and use tools in appropriate places, in a zero-shot fashion. In addition, ART is extensible as it also enables humans to fix mistakes in the reasoning steps or add new tools by simply updating the task and tool libraries. The process is demonstrated below:

![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2FART.3b30f615.png&w=1200&q=75)  

## **Task 1: LLMs as a knowledgeable doctor**




The pharmacist licensure exam  is a cornerstone in the pharmacy profession, ensuring that candidates possess the requisite knowledge and skills for safe and effective practice. Its significance lies not only in validating credentials but also in safeguarding public health, enabling professional recognition, and ensuring adherence to legal and regulatory standards.

Advanced models like ChatGPT have significant potential in exam preparation, boasting an extensive knowledge base and the capability to provide in-depth explanations and clarify complex concepts. However, despite the prowess of such large models, if prompts are not designed appropriately, the information retrieved might be inaccurate or incomplete, potentially hindering success in the pharmacist exam.

In [23]:
!wget https://NLP-course-cuhksz.github.io/Assignments/Assignment1/task1/data/1.exam.json

--2025-03-13 18:01:02--  https://nlp-course-cuhksz.github.io/Assignments/Assignment1/task1/data/1.exam.json
Resolving nlp-course-cuhksz.github.io (nlp-course-cuhksz.github.io)... 185.199.111.153, 185.199.109.153, 185.199.108.153, ...
Connecting to nlp-course-cuhksz.github.io (nlp-course-cuhksz.github.io)|185.199.111.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86227 (84K) [application/json]
Saving to: ‘1.exam.json’


2025-03-13 18:01:02 (11.5 MB/s) - ‘1.exam.json’ saved [86227/86227]



In [24]:
import json
with open('1.exam.json') as f:
  data = json.load(f)

In [25]:
data[0]

{'question': '27. 根据国家药品监督管理局，公安部，国家卫⽣健康委员会的有关规定，⼜服固体制剂每剂量单位含羟考酮碱不超过5毫克，且不含其他⿇醉药品，精神药品或者药品类易制毒化学品的复⽅制剂列⼊（）。',
 'option': {'A': '含⿇醉药品复⽅制剂的管理',
  'B': '第⼆类精神药品管理',
  'C': '第⼀类精神药品管理',
  'D': '医疗⽤毒性药品管理',
  'E': ''},
 'analysis': '⼜服固体制剂每剂量单位含羟考酮碱不超过5毫克，且不含其他⿇醉药品、精神药品或药品类易制毒化学品的复⽅制剂列⼊第⼆类精神药品管理。',
 'answer': 'B',
 'question_type': '最佳选择题',
 'source': '2021年执业药师职业资格考试《药事管理与法规》'}

In [26]:
your_prompt = """请回答下面的多选题，请直接正确答案选项，不要输出其他内容。
{question}
{options}"""

def get_query(da):
  da['options'] = '\n'.join([f"{k}:{v}" for k,v in da['option'].items()])
  return your_prompt.format_map(da)

get_query(data[0])

'请回答下面的多选题，请直接正确答案选项，不要输出其他内容。\n27. 根据国家药品监督管理局，公安部，国家卫⽣健康委员会的有关规定，⼜服固体制剂每剂量单位含羟考酮碱不超过5毫克，且不含其他⿇醉药品，精神药品或者药品类易制毒化学品的复⽅制剂列⼊（）。\nA:含⿇醉药品复⽅制剂的管理\nB:第⼆类精神药品管理\nC:第⼀类精神药品管理\nD:医疗⽤毒性药品管理\nE:'

In [27]:
chain.invoke(get_query(data[0]))

'B'

In [28]:
# 计算做题准确率
your_prompt = """Please answer the multiple choice questions below, please direct the correct answer option and do not output anything else.
{question}
{options}"""

import re
from tqdm import tqdm

def get_ans(ans):
    match = re.findall(r'.*?([A-E]+(?:[、, ]+[A-E]+)*)', ans)
    if match:
        last_match = match[-1]
        return ''.join(re.split(r'[、, ，]+', last_match))
    return ''

correct_num = 0
total_num = 0
for da in tqdm(data[:10]):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if get_ans(da['deepseek_ans']) == da['answer']:
    correct_num += 1
  total_num += 1
print(f"模型准确率: {correct_num/total_num:.2%}")


100%|██████████| 10/10 [04:41<00:00, 28.19s/it]

模型准确率: 90.00%





In [36]:
correct_num

9

In [37]:
total_num

10

In [41]:
for da in tqdm(data[:10]):
  print(da['question'])
  print(da['option'])
  print(da['analysis'])
  print("Correct Answer")
  print(da['answer'])
  print("DeepSeek")
  print(da['deepseek_ans'])
  print()


100%|██████████| 10/10 [00:00<00:00, 17396.53it/s]

27. 根据国家药品监督管理局，公安部，国家卫⽣健康委员会的有关规定，⼜服固体制剂每剂量单位含羟考酮碱不超过5毫克，且不含其他⿇醉药品，精神药品或者药品类易制毒化学品的复⽅制剂列⼊（）。
{'A': '含⿇醉药品复⽅制剂的管理', 'B': '第⼆类精神药品管理', 'C': '第⼀类精神药品管理', 'D': '医疗⽤毒性药品管理', 'E': ''}
⼜服固体制剂每剂量单位含羟考酮碱不超过5毫克，且不含其他⿇醉药品、精神药品或药品类易制毒化学品的复⽅制剂列⼊第⼆类精神药品管理。
Correct Answer
B
DeepSeek
B:第⼆类精神药品管理

112. 患者，男，70岁，因抑郁症长期服⽤舍曲林。新诊断为帕⾦森病，宜选⽤的药物有（）。
{'A': '司来吉兰', 'B': '普拉克索', 'C': '⾦刚烷胺', 'D': '吡贝地尔', 'E': '恩他卡朋'}
上述题⼲中患者使⽤的抑郁药舍曲林是选择性5-羟⾊胺再摄取抑制剂，司来吉兰避免与选择性5-羟⾊胺再摄取抑制剂合⽤，因此答案选BCDE。
Correct Answer
BCDE
DeepSeek
B:普拉克索  
C:⾦刚烷胺  
D:吡贝地尔  

**解析**：  
1. **司来吉兰（A）**：作为MAO-B抑制剂，与SSRI（舍曲林）合用时存在血清素综合征风险，需避免。  
2. **普拉克索（B）**：多巴胺受体激动剂，无显著5-HT系统相互作用，安全适用。  
3. **金刚烷胺（C）**：改善帕金森症状，代谢途径与舍曲林无冲突，但需注意肾功能（题目未提及限制）。  
4. **吡贝地尔（D）**：多巴胺受体激动剂，无明确与SSRI的相互作用，可选用。  
5. **恩他卡朋（E）**：COMT抑制剂，需与左旋多巴联用，题目未提及左旋多巴，故不适用。  

综上，正确答案为 **B、C、D**。

2215. 可⽤于治疗蛔⾍病、蛲⾍病、钩⾍病和鞭⾍病的⼴谱驱⾍药物是（）。
{'A': '甲硝唑', 'B': '氯硝柳胺', 'C': '⼄胺嗪', 'D': '甲苯咪唑', 'E': '三氯苯达唑'}
甲苯咪唑是⼴谱的驱肠⾍药，⽤于治疗蛲⾍、蛔⾍、鞭⾍、⼗⼆指肠钩⾍、粪类圆线⾍和绦⾍单独感染及混合感染。
Correct Answer
D
DeepSeek
D:甲苯咪唑

49. ⽤于治




<font color="blue">You need to optimize the prompt to improve the performance (accuracy) of large language models (LLMs).</font>

## **Task 2: LLMS for AI Feedback**

The final stage of large language model training involves reinforcement learning through feedback. Such feedback can come from either  human experts or   AI. This feedback is used to learn a reward model, with data defined in a triplet form. This triplet comes from a question, two answers, and a choice by a human or AI on which answer is better.

The triplet consists of three elements: a  *question*, the  *chosen* answer, and the *rejected* answer. You are asked to use ChatGPT to provide the feedback, namely, choose the preferred one. Note that the feedback is highly biased by the order of placed answers, please shuffle the order of answers when using chatGPT for preference feedback.

In [42]:
!wget  https://NLP-course-cuhksz.github.io/Assignments/Assignment1/task2/data/1.rlhf.jsonl

import json
with open('1.rlhf.jsonl') as f:
  data = [json.loads(l) for l in f]
data[0] # sample

--2025-03-13 18:16:27--  https://nlp-course-cuhksz.github.io/Assignments/Assignment1/task2/data/1.rlhf.jsonl
Resolving nlp-course-cuhksz.github.io (nlp-course-cuhksz.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to nlp-course-cuhksz.github.io (nlp-course-cuhksz.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 70333 (69K) [application/octet-stream]
Saving to: ‘1.rlhf.jsonl’


2025-03-13 18:16:27 (9.12 MB/s) - ‘1.rlhf.jsonl’ saved [70333/70333]



{'Question': 'Human: What are some aspects of religious Islam that are compatible with western civilization?',
 'Answer1': "Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require",
 'Answer2': 'Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.',
 'Preference': 'Answer2'}

In [43]:
data[0]

{'Question': 'Human: What are some aspects of religious Islam that are compatible with western civilization?',
 'Answer1': "Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require",
 'Answer2': 'Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.',
 'Preference': 'Answer2'}

In [44]:
your_prompt = '''
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be relevant, accurate and helpful. Which is better, Answer1 or Answer2?
Do not explain your answer, just output 'Answer1' or 'Answer2'.
'''

def get_query(da):
  return your_prompt.format_map(da)

testdata = data[0]

print(get_query(testdata))

print(f'---- DeepSeek-Reasoner response ----')

print(chain.invoke(get_query(testdata)))


[Question]:
Human: What are some aspects of religious Islam that are compatible with western civilization?

[Answer1]:
Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require

[Answer2]:
Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.

A good response should be relevant, accurate and helpful. Which is better, Answer1 or Answer2?
Do not explain your answer, just output 'Answer1' 

In [51]:
your_prompt = '''
[Question]:
{Question}

[Answer1]:
{Answer1}

[Answer2]:
{Answer2}

A good response should be relevant, accurate and helpful. Which is better, Answer1 or Answer2?
Do not explain your answer, just output 'Answer1' or 'Answer2'.
'''

import re
from tqdm import tqdm

correct_num = 0
total_num = 0
for da in tqdm(data[:10]):
  da['deepseek_ans'] =  chain.invoke(get_query(da))
  if da['deepseek_ans'] == da['Preference']:
    correct_num += 1
  total_num += 1
print(f"Model consistency rate with human: {correct_num/total_num:.2%}")


  0%|          | 0/10 [00:00<?, ?it/s][A
 10%|█         | 1/10 [00:10<01:31, 10.15s/it][A
 20%|██        | 2/10 [00:17<01:07,  8.41s/it][A
 30%|███       | 3/10 [00:24<00:54,  7.82s/it][A
 40%|████      | 4/10 [00:33<00:49,  8.24s/it][A
 50%|█████     | 5/10 [00:39<00:37,  7.42s/it][A
 60%|██████    | 6/10 [00:47<00:31,  7.84s/it][A
 70%|███████   | 7/10 [00:54<00:22,  7.44s/it][A
 80%|████████  | 8/10 [01:01<00:14,  7.23s/it][A
 90%|█████████ | 9/10 [01:09<00:07,  7.46s/it][A
100%|██████████| 10/10 [01:15<00:00,  7.57s/it]

Model consistency rate with human: 70.00%





In [53]:
correct_num

7

In [52]:
total_num

10

In [54]:
for da in tqdm(data[:10]):
  print(da['Question'])
  print(da['Answer1'])
  print(da['Answer2'])
  print("Correct Answer")
  print(da['Preference'])
  print("DeepSeek-V3_temperature_2.0")
  print(da['deepseek_ans'])
  print()


100%|██████████| 10/10 [00:00<00:00, 14593.96it/s]

Human: What are some aspects of religious Islam that are compatible with western civilization?
Assistant: I'm not sure what 'western civilization' means in this context, but if we assume it means the values of modern Western culture, it seems there are many facets of Islamic religious practice that are compatible with those values.  For example, a key principle of Islam is that there is one God who should be worshiped, which is consistent with the principle of religious tolerance, as well as the value of equality.  There are different branches of Islam, but many of them require
Assistant: Well in a basic sense it is possible for Muslims to live in the Western world and take part in Western civilization. In fact, many Muslims have been at the center of important cultural developments of the last centuries.
Correct Answer
Answer2
DeepSeek-V3_temperature_2.0
Answer2

Human: I want to learn how to make a New York style cheesecake.  Can you help me?
Assistant: I can help! Let's get started.




<font color="blue">You need to optimize the prompt to improve the performance (consistency rate) of large language models (LLMs).</font>