# Ensuring Reliable Few-Shot Prompt Selection for LLMs

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cleanlab/cleanlab-tools/blob/master/few_shot_prompt_selection/few_shot_prompt_selection.ipynb)

In this notebook, we prompt the Davinci LLM from OpenAI (the model underpinning GPT-3/ChatGPT) with few-shot prompts in an effort to classify the intent of customer service requests at a large bank. Following typical practice, we source the few-shot examples to include in the prompt template from an available dataset of human-labeled request examples. However, the resulting LLM predictions are unreliable — a close inspection reveals this is because real-world data is messy and error-prone.  LLM performance in this customer service intent classification task is only marginally boosted by manually modifying the prompt template to mitigate potentially noisy data. The LLM predictions become significantly more accurate if we instead use data-centric AI algorithms via Cleanlab Studio to ensure only high-quality few-shot examples are selected for inclusion in the prompt template.

# Imports and Helpers

In [44]:
import pandas as pd
import numpy as np
import warnings
import openai, os
from openai import OpenAI
import string
import random
import tiktoken
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from langchain.prompts import PromptTemplate
from langchain.prompts import FewShotPromptTemplate
warnings.filterwarnings('ignore')
pd.set_option('max_colwidth', None)

def eval_preds(preds):
    acc = accuracy_score(preds, test.label.values)
    return "Model Accuracy: " + '{:.1%}'.format(acc)

def tokens_per_prompt(prompt):
    encoding = tiktoken.get_encoding("p50k_base")
    num_tokens = len(encoding.encode(prompt))
    return num_tokens

def cost_per_prompt(prompt):
    cost_per_token = 0.02 / 1000
    tokens = tokens_per_prompt(prompt)
    cost = tokens * cost_per_token
    return cost

def cost_per_test_evaluation(examples_pool, test):
    texts = test.text.values
    cost = 0
    examples = get_examples(examples_pool)
    for text in texts:
        prompt = get_prompt(examples_pool, text, examples)
        prompt_cost = cost_per_prompt(prompt)
        cost += prompt_cost
    return "${:,.2f}".format(cost)

# OpenAI API Key
Replace with your own key.

In [45]:
client = OpenAI(
    api_key="sk-TZn2hZ39D7YDD8vKw8P28JhK0AYQbE7vI2WtyB80rwi3WcDi",  # 或者 os.environ["OPENAI_API_KEY"]
    base_url="https://sg.uiuiapi.com/v1",  # 注意最好加 /v1,
    # model="gpt-3.5-turbo"
)

In [46]:
# neg = pd.read_csv(r'/root/reinforcement_commit/datasets/dataset.csv')
from sklearn.model_selection import train_test_split
df = pd.read_csv('./dataset.csv')
# df.dropna(inplace=True)
# label2id = {'Adaptive':0,'Perfective':1,'Corrective':2}
# df = df.replace({"labels": label2id})
# df.rename(columns={'labels':'label','diffs':'diff','msgs':'message'},inplace=True)
# df

train_df, temp_df = train_test_split(df, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Building the Few-Shot Prompt
Few-shot prompting is a technique used in natural language processing that enables pretrained foundation models to perform complex tasks without any explicit training (i.e. updates to model parameters).  In few-shot prompting (also known as in-context learning), we provide a model with a limited number of input-output pairs, as part of a prompt template that is included in the prompt used to instruct the model how to handle a particular input.

Our prompt will consist of a few pieces:
- (optional) prefix
- list of class labels to help LLM choose a valid class
- (optional) 50 examples, one from each class
- target text for LLM to classify

## Adding List of Classes for Valid Completions

This text will go at the beginning of the prompt and will tell the LLM what the valid classes are so that it can consistently output a class. Without this, the LLM will not choose a valid class and output something not parsable.

Here we can also add an optional `prefix` that we will use later.

## Generate Entire Prompt

In [47]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo-0301":  # note: future models may deviate from this
        num_tokens = 0
        for message in messages:
            num_tokens += (
                4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
            )
            for key, value in message.items():
                num_tokens += len(encoding.encode(value))
                if key == "name":  # if there's a name, the role is omitted
                    num_tokens += -1  # role is always required and always 1 token
        num_tokens += 2  # every reply is primed with <im_start>assistant
        return num_tokens
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not presently implemented for model {model}.
  See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
        )

In [48]:
# Helper to format the k-shot prompt with:
# - prefix
# - 1 example from each class
# - target text for classification
def generate_promot(row):
    base_promot = """
    Please act as a commit classifier,
    categorize git commit into three categories: Adaptive, Perfective, and Corrective.
    The Adaptive category corresponds to commits that address the modifications to the project in order to adapt it to the new environment such as the feature addition.
    The Perfective category corresponds to commits that address enhancement of the project, such as enhancement of performance and refactoring of source code.
    The Corrective category corresponds to commits that address the fix of bugs and faults in the project.
    I will provide you with the commit message and code diff for a commit,
    and you need to give me the category label for this commit.
    Please avoid any explanations and only provide the category label.
    """

    commit_message_prompt = f"commit message: {row['msgs']}\n"
    commit_codediff_prompt = f"code diff: {row['diffs']}\n"

    prompt = base_promot + commit_message_prompt + commit_codediff_prompt

    messages = [{"role": "user", "content": prompt}]

    l, r = 0, len(commit_codediff_prompt) - 1
    while l <= r:
        mid = (l + r) // 2
        prompt = base_promot + commit_message_prompt + commit_codediff_prompt[:mid]
        messages = [{"role": "user", "content": prompt}]

        ok = num_tokens_from_messages(messages) < 4096
        if ok:
            l = mid + 1
        else:
            r = mid - 1
    prompt = base_promot + commit_message_prompt + commit_codediff_prompt[:r]

    return prompt

## Query OpenAI LLM API

In [49]:
# Helper method to prompt OpenAI LLM and get response.
def get_response(user_content):
    response = client.chat.completions.create(
    model="gpt-3.5-turbo",  # 根据代理方文档修改
    messages=[
        {"role": "user", "content": user_content}
    ]
    )
    print(response.choices[0].message.content)
    
    return response.choices[0].message.content

# text = "\'How can I change my pin?\'"
# examples = get_examples(examples_pool)
# df = pd.read_csv('dataset.csv')

result = []
for i, row in val_df.iterrows():
    user_content = generate_promot(row)
    # print(user_content)
    response =  get_response(user_content)
    result.append(response)
    # print("Model classified ", user_content, " as ", response)


Perfective
Corrective
Corrective
Perfective
Perfective
Perfective
Adaptive
Adaptive
Adaptive
Corrective
Corrective
Adaptive
Adaptive
Corrective
Perfective
Adaptive
Perfective
Perfective
Perfective
Perfective
Adaptive
Corrective
Corrective
Adaptive
Corrective
Corrective
Adaptive
Corrective
Corrective
Perfective
Adaptive
Perfective
Corrective
Adaptive
Corrective
Adaptive
Perfective
Perfective
Adaptive
Perfective
Corrective
Perfective
Adaptive
Adaptive
Perfective
Adaptive
Perfective
Corrective
Adaptive
Corrective
Perfective
Adaptive
Perfective
Corrective
Corrective
Corrective
Corrective
Perfective
Perfective
Corrective
Corrective
Corrective
Adaptive
Adaptive
Perfective
Adaptive
Corrective
Corrective
Adaptive
Corrective
Corrective
Adaptive
Perfective
Perfective
Corrective
Corrective
Corrective
Corrective
Corrective
Perfective
Corrective
Perfective
Perfective
Corrective
Adaptive
Perfective
Adaptive
Perfective
Perfective
Adaptive
Corrective
Adaptive
Perfective
Perfective
Perfective
Perfectiv

In [50]:
val_df['labels']

1009      Adaptive
721     Corrective
538     Corrective
69      Perfective
1210    Perfective
           ...    
527       Adaptive
650     Perfective
1420    Corrective
1386    Perfective
637     Perfective
Name: labels, Length: 267, dtype: object

In [52]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(val_df['labels'],result,digits=4))

              precision    recall  f1-score   support

    Adaptive     0.5926    0.5393    0.5647        89
  Corrective     0.7629    0.8222    0.7914        90
  Perfective     0.4831    0.4886    0.4859        88

    accuracy                         0.6180       267
   macro avg     0.6129    0.6167    0.6140       267
weighted avg     0.6139    0.6180    0.6152       267

