<a href="https://colab.research.google.com/github/GoswamiVijay/GenAIBootCamp-GL/blob/main/Classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Customer Intent Classification

Our chatbot has to log the customer query. It is very beneficial if the information like the intent of the customer is captured along with the customer query as this allows easier analysis. Our chatbot will classify the intent of the customer when it store the query. When we build this functionality in our chatbot, this will part of the system prompt along with many other instructions. For now, we are only going to write a prompt that will extract the intent. Later, we will incorporate this into a more comprehensive system prompt that does classification among many other things.

**Problem Statement**

Businesses handling customer queries across multiple channels often face challenges in understanding and categorizing inquiries accurately and efficiently. This issue leads to slower response times, increased customer dissatisfaction, and strain on human support agents. Traditional rule-based or keyword-based systems often misinterpret complex, multi-intent queries, resulting in incorrect routing or incomplete resolutions. Additionally, scaling customer support to handle diverse and high volumes of queries while maintaining quality is a significant bottleneck for growing businesses.

**Solution**

By leveraging classification with Large Language Models (LLMs), businesses can automatically identify the intent behind customer queries, even when they are phrased in complex or conversational language. LLMs excel at contextual understanding, enabling accurate categorization and routing of queries to appropriate resources. This automation reduces response times, enhances customer satisfaction, and allows human agents to focus on higher-value tasks. The solution also scales seamlessly with growing query volumes, ensuring consistent and efficient customer support.

#Setup

In [None]:
# Step 1: Install the datasets library
!pip install datasets tiktoken
!pip install --upgrade openai

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)


In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import json
from openai import AzureOpenAI
import tiktoken
from sklearn.metrics import accuracy_score
from langchain_core.tools import tool

# Authentication

In [None]:
# Load the configuration from the JSON file
import json
with open('4omini.json', 'r') as config_file:
    creds = json.load(config_file)

In [None]:
client = AzureOpenAI(
    azure_endpoint=creds["AZURE_OPENAI_ENDPOINT"],
    api_key=creds["AZURE_OPENAI_KEY"],
    api_version=creds["AZURE_OPENAI_APIVERSION"]
)

deployment_name = creds["CHATGPT_MODEL"]

In [None]:
messages = [
    {"role": "system", "content": "you are my assistant"},
    {"role": "user", "content": "what is the best LLM for RAG"}
]

In [None]:
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=messages,
    max_tokens=4096
)

In [None]:
response.choices[0].message.content

"Retrieval-Augmented Generation (RAG) combines the strengths of retrieval-augmented systems and generative language models. The best large language model (LLM) for RAG often depends on specific applications, use cases, and the nature of the data you're working with. However, some prominent LLMs and frameworks well-suited for RAG include:\n\n1. **GPT-3 / GPT-4 (OpenAI)**: These models are popular for a variety of generative tasks and can be adapted for RAG by integrating an external retrieval mechanism to enhance their responses with factual, real-time information.\n\n2. **T5 (Text-to-Text Transfer Transformer)**: Google's T5 is designed to work on various NLP tasks as a unified framework, making it a strong candidate for RAG systems.\n\n3. **BERT and its derivatives (e.g., RoBERTa, DistilBERT)**: While mostly used for understanding tasks, BERT-based models can be integrated into a RAG system for effective document retrieval.\n\n4. **FLAN-T5**: This is a fine-tuned version of T5 that ha

Let's have a look at the data and get a feel of it.

#Loading and Preparing Data

In [None]:
# Example: Load the dataset
dataset = load_dataset('bitext/Bitext-customer-support-llm-chatbot-training-dataset')


README.md:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

(…)t_Training_Dataset_27K_responses-v11.csv:   0%|          | 0.00/19.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/26872 [00:00<?, ? examples/s]

In [None]:
data = dataset['train'].to_pandas()

In [None]:
data.head()

Unnamed: 0,flags,instruction,category,intent,response
0,B,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,BQZ,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,BLQZ,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...
3,BL,I need to cancel purchase {{Order Number}},ORDER,cancel_order,I understood that you need assistance with can...
4,BCELN,"I cannot afford this order, cancel purchase {{...",ORDER,cancel_order,I'm sensitive to the fact that you're facing f...


In [None]:
data = data[['instruction', 'intent']]

In [None]:
data.intent.value_counts()

Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
edit_account,1000
switch_account,1000
check_invoice,1000
complaint,1000
contact_customer_service,1000
delivery_period,999
registration_problems,999
check_payment_methods,999
contact_human_agent,999
payment_issue,999


Let's scope some of the categories and focus on the following categories.

In [None]:
data = data[data['intent'].isin(['check_cancellation_fee', 'change_shipping_address', 'track_order', 'cancel_order', 'track_refund', 'contact_human_agent', 'get_invoice', 'check_refund_policy'])]

In [None]:
data.intent.value_counts()

Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
contact_human_agent,999
get_invoice,999
cancel_order,998
track_refund,998
check_refund_policy,997
track_order,995
change_shipping_address,973
check_cancellation_fee,950


In [None]:
data.head(10)

Unnamed: 0,instruction,intent
0,question about cancelling order {{Order Number}},cancel_order
1,i have a question about cancelling oorder {{Or...,cancel_order
2,i need help cancelling puchase {{Order Number}},cancel_order
3,I need to cancel purchase {{Order Number}},cancel_order
4,"I cannot afford this order, cancel purchase {{...",cancel_order
5,can you help me cancel order {{Order Number}}?,cancel_order
6,"I can no longer afford order {{Order Number}},...",cancel_order
7,I am trying to cancel purchase {{Order Number}},cancel_order
8,I have got to cancel purchase {{Order Number}},cancel_order
9,i need help canceling purchase {{Order Number}},cancel_order


Let's reduce the number of examples to 4 each so that we don't incurr very high api costs.

In [None]:
data = data.groupby('intent', group_keys=False).sample(n=4, replace=False)


In [None]:
data.intent.value_counts()

Unnamed: 0_level_0,count
intent,Unnamed: 1_level_1
cancel_order,4
change_shipping_address,4
check_cancellation_fee,4
check_refund_policy,4
contact_human_agent,4
get_invoice,4
track_order,4
track_refund,4


Note how the dataset is evenly balanced with equal number of reviews assembled for each of the category. This makes our life easy.

Since this is a classification exercise with a balanced dataset, we can use accuracy as our metric. We need to also be mindful of the tokens consumed for each prompt as this is going to be a perpetual task for the business as new queries are added everyday.

#### Test and Train Split

Let us split the data into two segments - one segment that gives us a pool to draw few-shot examples from and another segment that gives us a pool of gold examples which will be used for testing.

In summary, we extract a dataset from a corpus by processing required fields. Each example should contain the text input and an annotated label. Once we create examples and gold examples from this dataset, this curated dataset is stored in a format appropriate for reuse (e.g., JSON).

To select gold examples for this session, we sample randomly from the test data using a `random_state=42`. This ensures that the examples from multiple runs of the sampling are the same (i.e., they are randomly selected but do not change between different runs of the notebook). Note that we are doing this only to keep execution times low for illustration. In practise, large number of gold examples facilitate robust estimates of model accuracy.

In [None]:
examples_df, gold_examples_df = train_test_split(
    data, #<- the full dataset
    test_size=0.5, #<- 80% random sample selected for gold examples
    random_state=42, #<- ensures that the splits are the same for every session
    stratify=data['intent'] #<- ensures equal distribution of labels
)

In [None]:
gold_examples = (
        gold_examples_df.to_json(orient='records')
)

In [None]:
(examples_df.shape, gold_examples_df.shape)

((16, 2), (16, 2))

In [None]:
gold_examples_df.head(3)

Unnamed: 0,instruction,intent
9632,i need assistance to speak with a live agent,contact_human_agent
583,"I purchased some items, help me cancel order {...",cancel_order
3773,I do not know how to see the early exit penalty,check_cancellation_fee


With everything setup, let's start working on our prompts.

### Step 3: Derive Prompt

#### Create prompts

In [None]:
user_message_template = """```{user_query}```"""

Let's create a zero-shot prompt for this scenario. We need to make sure that LLM outputs only the category label and not explanation. So, let's add explicit instructions for that.

**Prompt 1: Zero-shot**

In [None]:
intent_categories = data.intent.unique()

In [None]:
zero_shot_system_message = f"""
Classify the following user query presented in the input into one of the following categories.
Categories - {intent_categories}
"""

In [None]:
zero_shot_system_message

"\nClassify the following user query presented in the input into one of the following categories.\nCategories - ['cancel_order' 'change_shipping_address' 'check_cancellation_fee'\n 'check_refund_policy' 'contact_human_agent' 'get_invoice' 'track_order'\n 'track_refund']\n"

In [None]:
zero_shot_prompt = [{'role':'system', 'content': zero_shot_system_message}]

**Let's try our zero-shot prompt on a single example.**

In [None]:
data.iloc[0,:]

Unnamed: 0,465
instruction,I cannot afford order {{Order Number}}
intent,cancel_order


In [None]:
user_query = data.iloc[0,0]

user_input = [
    {
        'role':'user',
        'content': user_message_template.format(user_query = user_query)
    }
]
print(user_input)

[{'role': 'user', 'content': '```I cannot afford order {{Order Number}}```'}]


Let's also cap the max_token parameter to 4 so that the model doesn't output explanations. We are capping it at 4 instead of 2 because we want to leave a little lee-way for punctuation marks and sub-words token that the model might output in the middle of the text. It is better to use regex later than to prematurely over-constrain the LLM output.

In [None]:
deployment_name = 'gpt-4o-mini'

In [None]:
response = client.chat.completions.create(
    model=deployment_name,
    messages=zero_shot_prompt+user_input,
    temperature=0, # <- Note the low temperature
    max_tokens=4 # <- Note how we restrict the output to not more than 2 tokens
)
print(response.choices[0].message.content)

The user query is


In [None]:
def evaluate_prompt(prompt, gold_examples, user_message_template):

    """
    Return the accuracy score for predictions on gold examples.
    For each example, we make a prediction using the prompt. Gold labels and
    model predictions are aggregated into lists and compared to compute the
    accuracy.

    Args:
        prompt (List): list of messages in the Open AI prompt format
        gold_examples (str): JSON string with list of gold examples
        user_message_template (str): string with a placeholder for product description
        samples_to_output (int): number of sample predictions and ground truths to print

    Output:
        accuracy (float): Accuracy computed by comparing model predictions
                                with ground truth
    """

    count = 0
    model_predictions, ground_truths = [], []

    for example in json.loads(gold_examples):
        gold_input = example['instruction']
        user_input = [
            {
                'role':'user',
                'content': user_message_template.format(user_query=gold_input)
            }
        ]

        try:
            response = client.chat.completions.create(
                model=deployment_name,
                messages=prompt+user_input,
                temperature=0, # <- Note the low temperature
                max_tokens=4 # <- Note how we restrict the output to not more than 4 tokens
            )

            prediction = response.choices[0].message.content
            print(prediction) #uncomment to see LLM response or to debug
            model_predictions.append(prediction)
            ground_truths.append(example['intent'].strip().lower())


            print("User Query: \n", example['instruction'],"\n")
            print("Original label: \n", example['intent'],"\n")
            print("Predicted label: \n", prediction)
            print("====================================================")

        except Exception as e:
            print(e)
            continue

        accuracy = accuracy_score(ground_truths, model_predictions)

    return accuracy



In [None]:
evaluate_prompt(zero_shot_prompt, gold_examples, user_message_template)

contact_human_agent
User Query: 
 i need assistance to speak with a live agent 

Original label: 
 contact_human_agent 

Predicted label: 
 contact_human_agent
cancel_order
User Query: 
 I purchased some items, help me cancel order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
The user query is
User Query: 
 I do not know how to see the early exit penalty 

Original label: 
 check_cancellation_fee 

Predicted label: 
 The user query is
The user query is
User Query: 
 I cannot afford order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 The user query is
The user query seems
User Query: 
 getting billsfrom {{Person Name}} 

Original label: 
 get_invoice 

Predicted label: 
 The user query seems
check_refund_policy
User Query: 
 I want to see in which cases can I request to be refunded 

Original label: 
 check_refund_policy 

Predicted label: 
 check_refund_policy
check_cancellation_fee
User Query: 
 I want assistance to see t

0.5625

That is not great. Let's try a slightly different prompt

In [None]:
zero_shot_system_message = f"""
Classify the following user query presented in the input into one of the following categories.
Categories - {intent_categories}
User Query will be delimited by triple backticks in the input.
Answer only from the categories. Nothing Else. Do not explain your answer.
"""

In [None]:
zero_shot_prompt = [{'role':'system', 'content': zero_shot_system_message}]

In [None]:
evaluate_prompt(zero_shot_prompt, gold_examples, user_message_template)

contact_human_agent
User Query: 
 i need assistance to speak with a live agent 

Original label: 
 contact_human_agent 

Predicted label: 
 contact_human_agent
cancel_order
User Query: 
 I purchased some items, help me cancel order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
check_cancellation_fee
User Query: 
 I do not know how to see the early exit penalty 

Original label: 
 check_cancellation_fee 

Predicted label: 
 check_cancellation_fee
cancel_order
User Query: 
 I cannot afford order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
$get_invoice
User Query: 
 getting billsfrom {{Person Name}} 

Original label: 
 get_invoice 

Predicted label: 
 $get_invoice
check_refund_policy
User Query: 
 I want to see in which cases can I request to be refunded 

Original label: 
 check_refund_policy 

Predicted label: 
 check_refund_policy
check_cancellation_fee
User Query: 
 I want assistance to see the cancellation 

0.9375

That does better. Let's add some regex to extract exactly what we want.

In [None]:
import re

def remove_non_alphabets(input_string):
    # Use regex to keep only alphabets and underscores
    return re.sub(r'[^a-zA-Z_]', '', input_string)



def evaluate_prompt_filtered(prompt, gold_examples, user_message_template):

    """
    Return the accuracy score for predictions on gold examples.
    For each example, we make a prediction using the prompt. Gold labels and
    model predictions are aggregated into lists and compared to compute the
    accuracy.

    Args:
        prompt (List): list of messages in the Open AI prompt format
        gold_examples (str): JSON string with list of gold examples
        user_message_template (str): string with a placeholder for product description
        samples_to_output (int): number of sample predictions and ground truths to print

    Output:
        accuracy (float): Accuracy computed by comparing model predictions
                                with ground truth
    """

    count = 0
    model_predictions, ground_truths = [], []

    for example in json.loads(gold_examples):
        gold_input = example['instruction']
        user_input = [
            {
                'role':'user',
                'content': user_message_template.format(user_query=gold_input)
            }
        ]

        try:
            response = client.chat.completions.create(
                model=deployment_name,
                messages=prompt+user_input,
                temperature=0, # <- Note the low temperature
                max_tokens=4 # <- Note how we restrict the output to not more than 4 tokens
            )

            prediction = response.choices[0].message.content
            prediction = remove_non_alphabets(prediction).lower() # <- removes extraneous white space and lowercases output
            # print(prediction) #uncomment to see LLM response or to debug
            model_predictions.append(prediction)
            ground_truths.append(example['intent'].strip().lower())


            print("User Query: \n", example['instruction'],"\n")
            print("Original label: \n", example['intent'],"\n")
            print("Predicted label: \n", prediction)
            print("====================================================")

        except Exception as e:
            print(e)
            continue

        accuracy = accuracy_score(ground_truths, model_predictions)

    return accuracy



In [None]:
evaluate_prompt_filtered(zero_shot_prompt, gold_examples, user_message_template)

User Query: 
 i need assistance to speak with a live agent 

Original label: 
 contact_human_agent 

Predicted label: 
 contact_human_agent
User Query: 
 I purchased some items, help me cancel order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
User Query: 
 I do not know how to see the early exit penalty 

Original label: 
 check_cancellation_fee 

Predicted label: 
 check_cancellation_fee
User Query: 
 I cannot afford order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
User Query: 
 getting billsfrom {{Person Name}} 

Original label: 
 get_invoice 

Predicted label: 
 get_invoice
User Query: 
 I want to see in which cases can I request to be refunded 

Original label: 
 check_refund_policy 

Predicted label: 
 check_refund_policy
User Query: 
 I want assistance to see the cancellation penalties 

Original label: 
 check_cancellation_fee 

Predicted label: 
 check_cancellation_fee
User Query: 
 show me purchas

1.0

Great start. However, we still had to remove whitespace and non alphabet characters using regex. Now, let's check if few-shot can do a better job.

**Prompt 2: Few-shot**

For the few-shot prompt, there is no change in the system message compared with the zero-shot prompt. However, we augment this system message with few shot examples.  

In [None]:
few_shot_system_message = f"""
Classify the following user query presented in the input into one of the following categories.
Categories - {intent_categories}
User Query will be delimited by triple backticks in the input.
Answer only from the categories. Nothing Else. Do not explain your answer.
"""

To assemble few-shot examples, we will need to sample the required number of reviews from the training data. One approach would be to  first subset the different categories and then select samples from these subsets.

In [None]:
gold_examples_df.head()

Unnamed: 0,instruction,intent
9632,i need assistance to speak with a live agent,contact_human_agent
583,"I purchased some items, help me cancel order {...",cancel_order
3773,I do not know how to see the early exit penalty,check_cancellation_fee
465,I cannot afford order {{Order Number}},cancel_order
14958,getting billsfrom {{Person Name}},get_invoice


To reiterate from our learnings from the week, merely selecting random samples from the category subsets is not enough because the examples included in a prompt are prone to a set of known biases. LLMs are known to respond with the most frequent label in the examples or the labels that were given at the end of the prompt.



To avoid these biases, it is important to have a balanced set of examples that are arranged in random order. Let us create a Python function that generates bias-free examples (our function implements the workflow presented below):

In [None]:

def create_examples(dataset, intent_categories, n=2):
    """
    Return a JSON list of randomized examples of size 2n with two classes.
    Create subsets of each class, choose random samples from the subsets,
    merge and randomize the order of samples in the merged list.
    Each run of this function creates a different random sample of examples
    chosen from the training data.

    Args:
        dataset (DataFrame): A DataFrame with examples
        intent_categories (list): A list of intent categories to sample from
        n (int): number of examples of each class to be selected

    Output:
        randomized_examples (list): A list with examples in random order
    """
    samples = []
    for intent in intent_categories:
        samples.extend(dataset[dataset['intent'] == intent].sample(n)[["instruction", "intent"]].to_dict(orient='records'))

    # Shuffle the samples to randomize the order
    randomized_examples = pd.DataFrame(samples).sample(frac=1).to_dict(orient='records')

    return randomized_examples

In [None]:
examples = create_examples(examples_df, intent_categories, 2)

In [None]:
examples[0]

{'instruction': 'getting bill from {{Person Name}}', 'intent': 'get_invoice'}

Let's create a function to create few show prompt from our examples.

In [None]:
def create_prompt(system_message, examples, user_message_template):

    """
    Return a prompt message in the format expected by the Open AI API.
    Loop through the examples and parse them as user message and assistant
    message.

    Args:
        system_message (str): system message with instructions for classification
        examples (str): JSON string with list of examples
        user_message_template (str): string with a placeholder for description

    Output:
        few_shot_prompt (List): A list of dictionaries in the Open AI prompt format
    """

    few_shot_prompt = [{'role':'system', 'content': system_message}]

    for example in examples:
        few_shot_prompt.append(
            {
                'role': 'user',
                'content': user_message_template.format(
                    user_query=example['instruction']
                )
            }
        )

        few_shot_prompt.append(
            {'role': 'assistant', 'content': f"{example['intent']}"}
        )

    return few_shot_prompt

In [None]:
few_shot_prompt = create_prompt(
    few_shot_system_message,
    examples,
    user_message_template
)

In [None]:
few_shot_prompt

[{'role': 'system',
  'content': "\nClassify the following user query presented in the input into one of the following categories.\nCategories - ['cancel_order' 'change_shipping_address' 'check_cancellation_fee'\n 'check_refund_policy' 'contact_human_agent' 'get_invoice' 'track_order'\n 'track_refund']\nUser Query will be delimited by triple backticks in the input.\nAnswer only from the categories. Nothing Else. Do not explain your answer.\n"},
 {'role': 'user', 'content': '```getting bill from {{Person Name}}```'},
 {'role': 'assistant', 'content': 'get_invoice'},
 {'role': 'user', 'content': '```I need help talking to a live agent```'},
 {'role': 'assistant', 'content': 'contact_human_agent'},
 {'role': 'user',
  'content': '```i need help with canceling purchase {{Order Number}}```'},
 {'role': 'assistant', 'content': 'cancel_order'},
 {'role': 'user',
  'content': "```I'm waiting for an rebate of {{Refund Amount}} dollars```"},
 {'role': 'assistant', 'content': 'track_refund'},
 {'

That is 3x more token usage than zero-shot. Unless it gives significatnly better results, zero-shot will be the preferred one.

In [None]:
evaluate_prompt(few_shot_prompt, gold_examples, user_message_template)

contact_human_agent
User Query: 
 i need assistance to speak with a live agent 

Original label: 
 contact_human_agent 

Predicted label: 
 contact_human_agent
cancel_order
User Query: 
 I purchased some items, help me cancel order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
check_cancellation_fee
User Query: 
 I do not know how to see the early exit penalty 

Original label: 
 check_cancellation_fee 

Predicted label: 
 check_cancellation_fee
cancel_order
User Query: 
 I cannot afford order {{Order Number}} 

Original label: 
 cancel_order 

Predicted label: 
 cancel_order
get_invoice
User Query: 
 getting billsfrom {{Person Name}} 

Original label: 
 get_invoice 

Predicted label: 
 get_invoice
check_refund_policy
User Query: 
 I want to see in which cases can I request to be refunded 

Original label: 
 check_refund_policy 

Predicted label: 
 check_refund_policy
check_cancellation_fee
User Query: 
 I want assistance to see the cancellation pe

1.0

One thing to keep in mind is we are not filtering it using regex. As the required format becomes more complex, few-shot performance improves over zero shot.