#  Exercise 8: In-Context Learning with GPT-3.5

<div style="padding:15px 20px 20px 20px;border-left:3px solid green;background-color:#e4fae4;border-radius: 20px;">

## **Exercise Description**
- In this exercise, you will investigate in-context learning using OpanAI GPT-3.5 model. This exercise contains two parts.

- In the first part, you will investigate in-context learning for classification based on a natural language inference (NLI) task.
    
- In the second part, you will investigate in-context learning for generation based on a story ending generation (SEG) task.


### Table of Contents
- **[PART 1: In-Context Learning for Natural Langauge Inference](#1)**
    - [1.1 Compare Different Shots](#11)
    - [1.2 Effect of Neutral In-Context Examples](#12)
    - [1.3 Play with Different Verbalizers](#13)
    - [1.4 Add Instructions](#14)
- **[PART 2: In-Context Learning for Story Ending Generation](#2)**
    - [2.1 Zero-Shot Generation](#21)
    - [2.2 Few-Shot Generation](#22)
    - [2.3 Add Instructions](#23)

</div>

## Setup Your Environment

**Note: the Python version for this exercise is 3.9**, please install the following required packages.

In [None]:
# if you are using Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# fill in the path where you put the Exercise folder into
ROOT_PATH = "/content/drive/MyDrive/Exercise8/"

In [None]:
!pip install numpy==1.22.4
!pip install tqdm==4.65.0
!pip install nltk==3.8.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


You also need to install our **GPT-3.5 wrapper** to interact with OpenAI GPT-3.5 models for free.

In [None]:
!pip install {ROOT_PATH}gpt_wrapper-0.0.7-py3-none-any.whl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./drive/MyDrive/Exercise8/gpt_wrapper-0.0.7-py3-none-any.whl
Installing collected packages: gpt-wrapper
Successfully installed gpt-wrapper-0.0.7


Import the required packages for this exercise, including our GPT-3.5 wrapper.

In [None]:
import json
import numpy as np
from tqdm import tqdm
import random
from copy import deepcopy
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.translate.meteor_score import meteor_score

import gpt_wrapper
from gpt_wrapper.chat import Chat

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


To facilitate reproduction, we fix a random seed here.

In [None]:
seed = 233

Enter the exercise API key to get access to our GPT-3.5 wrapper.

In [None]:
gpt_wrapper.api_key = "a5a244d0-2f56-41d3-ac99-9e5efb0e4079"

<a name="1"></a>
## **PART 1: In-Context Learning for Natural Language Inference**
---

In this part, you are going to use the GPT-3.5 model to solve the [natural language inference (NLI)](https://towardsdatascience.com/natural-language-inference-an-overview-57c0eecf6517) task based on in-context learning. For this task, model needs to classify the relation of two given sentences (premise and hypothesis) into three classes: entailment, neutral and contradiction.

Here you can take a glance of the training data used for sampling few-shot in-context examples, and the testing data used to query GPT-3.5 language model for classification (along with the gold answers for evaluation).

In [None]:
with open(ROOT_PATH+"nli_classification/train_classification.json", "r") as f:
    train_samples = json.load(f)
with open(ROOT_PATH+"nli_classification/test_classification.json", "r") as f:
    test_data = json.load(f)

print("Training Samples:")
print(train_samples["entailment"][0])
print("\n")

print("Testing Query:")
print(test_data[0]["query"])
print("\n")

print("Gold Answer:")
print(test_data[0]["gold_answer"])

Training Samples:
I know lawyers are always dreadfully careful.
I'm well aware that lawyers are always very careful.
Answer: entailment


Testing Query:
The new rights are nice enough
Everyone really likes the newest benefits 
Answer:


Gold Answer:
neutral


Here is the GPT-3.5 hyperparameter setting for this NLI task.

**max_tokens**: Maximum number of tokens to generate, default to 16.

**temperature**: Sampling temperature to use, between 0.0 and 2.0, default to 1.0. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

**top_p**: Nucleus sampling factor (alternative to sampling with temperature), between 0.0 and 1.0, default to 1.0. The model randomly samples from the tokens with top_p probability mass.

**presence_penalty**: Between -2.0 and 2.0, default to 0.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

**frequency_penalty**: Between -2.0 and 2.0, default to 0.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

We choose a small *max_tokens* because only the first non-space token generated by the model is used as the predicted class (i.e., verbalizer).

We also change the *temparature* to zero in order to let the model make deterministic classification decisions.

In [None]:
model_args={"max_tokens": 2, "temperature": 0.0, "top_p": 1.0, "presence_penalty": 0.0, "frequency_penalty": 0.0}

You will evaluate the model's NLI performance based on the accuracy and F1 scores on each class.

In [None]:
def evaluate_nli(predictions, gold_labels, mapping):
    
    counter = np.zeros((3, 3))  # three-class confusion matrix
    
    # calculate the confusion matrix
    for p, g in zip(predictions, gold_labels):
        pid = mapping[p]
        gid = mapping[g]
        counter[gid][pid] += 1
    
    print()
    print(counter)
    
    pred_sum = np.sum(counter, axis=0)  # total number of predictions on each class
    gold_sum = np.sum(counter, axis=1)  # total number of test samples (gold labels) on each class
    diag = np.diagonal(counter)  # total number of correct predictions on each class
    
    acc = np.sum(diag) / np.sum(counter)  # accuracy
    
    f1 = [0, 0, 0]
    for cid in range(3):
        precision = diag[cid] / pred_sum[cid]  # precisions on each class
        recall = diag[cid] / gold_sum[cid]  # recalls on each class
        f1[cid] = 2 * precision * recall / (precision + recall)  # F1 scores on each class
    
    return acc, f1[0], f1[1], f1[2]

You will use the following function to perform GPT-3.5 inference on the NLI task based on in-context learning.

In [None]:
def gpt3_nli(train_samples, test_data, shots, predictions, gold_answers,
             introduction=None, default_class="neutral", task_name="none"):
    
    '''
    train_samples: training data for sampling in-context examples
    train_data: testing queries (with gold labels)
    shots: number of in-context examples (shots) per class
    predictions: cache for saving the model predictions
    gold_answers: cache for saving gold answers
    introduction: additional task introduction for prompting
    default_class: default prediction class if the generated token is not among the verbalizers of three NLI classes
    task_name: task name for creating chat sessions
    '''
    
    # randomly sample in-context examples
    examples = []
    for nli_class, samples in train_samples.items():
        few_shot_samples = random.sample(samples, shots[nli_class])
        examples.extend(few_shot_samples)

    random.shuffle(examples)  # randomly shuffle sampled in-context examples
    
    # add task introduction (if it exists) before in-context examples for better prompting
    if introduction:
        examples.insert(0, introduction)

    for qid, query in enumerate(tqdm(test_data)):
        
        if qid < len(predictions):  # skip this query if its model prediction is already saved in cache
            continue

        # concatenate all the in-context examples with the query, to get the final input demonstration
        demonstration = "\n\n".join(examples+[query["query"]])

        # create a chat session using our GPT-3.5 wrapper class Chat
        chat = Chat.create(name=task_name+"_"+str(qid))
        
        # use the created chat session to query the GPT-3.5 model with the input demonstration,
        # and get back model's output message
        message = chat.ask(demonstration, model_args=model_args)
        
        # model's output text is in the attribute "content",
        # we use the first token of the generated text as the prediction
        preds = message.content.strip().split()
        if preds:
            pred = preds[0].lower()
        else:
            pred = "none"
        
        # mapping similar outputs to class verbalizers
        if pred in ["entail", "entailed", "entailing"]:
            pred = "entailment"
        if pred in ["contrad", "contradict", "contradicted", "contradicting"]:
            pred = "contradiction"
        
        # save the prediction in chace
        if pred in train_samples.keys():
            predictions.append(pred)
        else:
            predictions.append(default_class)
        
        # save the gold answer in chace for evaluation
        gold_answers.append(query["gold_answer"])

You will run the following function to perform GPT-3.5 inference and evaluation.

In [None]:
def run(train_samples, test_data, class_shots, mapping, predictions, gold_answers,
        introduction=None, default_class="neutral", task_name="none"):
    try:

        gpt3_nli(train_samples, test_data, class_shots, predictions, gold_answers,
                 introduction=introduction, default_class=default_class, task_name=task_name)
        
        acc, f1_ent, f1_neu, f1_con = evaluate_nli(predictions, gold_answers, mapping)
        macro_f1 = (f1_ent + f1_neu + f1_con) / 3

        print(f'Accuracy: {acc*100:.2f}% | F1: ({f1_ent*100:.2f}%, {f1_neu*100:.2f}%, {f1_con*100:.2f}%) | Macro-F1: {macro_f1*100:.2f}%')

    except Exception as error:  # OpenAI ChatGPT endpoint (gpt-3.5-turbo) may get stucked by too many queries from time to time
    
        print(error)


<a name="11"></a>
### **1.1 Compare Different Shots**

In this part, you will compare GPT-3.5 performances under different shots (number) of in-context examples.

#### 0-shot classification:

Do not provide any in-context learning examples to the model.

Create empty caches for saving model predictions and gold answers.

In [None]:
predictions = []
gold_answers = []

Run the inferece and evaluation.

**Note:** OpenAI ChatGPT endpoint may sometimes get stucked by too many queries. If running the following cell gets stucked, just re-run it, and inference will continue from the stucked query. However, do not re-run the above cell for creating the caches, which will clear the already saved predictions.

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots = {"entailment": 0, "neutral": 0, "contradiction": 0}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping, predictions, gold_answers, task_name="1_1_shots0")

100%|██████████| 30/30 [00:22<00:00,  1.32it/s]


[[ 0. 10.  0.]
 [ 0. 10.  0.]
 [ 0. 10.  0.]]
Accuracy: 33.33% | F1: (nan%, 50.00%, nan%) | Macro-F1: nan%



  precision = diag[cid] / pred_sum[cid]  # precisions on each class


#### 1-shot per class:

For each class, provide 1 in-context learning example sampled from the training data.

Clear the caches.

In [None]:
predictions = []
gold_answers = []

Re-run the inference and evaluation

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots = {"entailment": 1, "neutral": 1, "contradiction": 1}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping, predictions, gold_answers, task_name="1_1_shots1")

100%|██████████| 30/30 [00:22<00:00,  1.31it/s]


[[6. 3. 1.]
 [5. 2. 3.]
 [0. 1. 9.]]
Accuracy: 56.67% | F1: (57.14%, 25.00%, 78.26%) | Macro-F1: 53.47%





#### 2-shot per class:

Try 2 in-context learning examples per class.

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots = {"entailment": 2, "neutral": 2, "contradiction": 2}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping, predictions, gold_answers, task_name="1_1_shots2")

100%|██████████| 30/30 [00:22<00:00,  1.33it/s]


[[6. 3. 1.]
 [4. 3. 3.]
 [0. 2. 8.]]
Accuracy: 56.67% | F1: (60.00%, 33.33%, 72.73%) | Macro-F1: 55.35%





#### 3-shot per class:

Try 3 in-context learning examples per class.

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots = {"entailment": 3, "neutral": 3, "contradiction": 3}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping, predictions, gold_answers, task_name="1_1_shots3")

100%|██████████| 30/30 [00:22<00:00,  1.31it/s]


[[7. 3. 0.]
 [4. 1. 5.]
 [1. 0. 9.]]
Accuracy: 56.67% | F1: (63.64%, 14.29%, 75.00%) | Macro-F1: 50.97%





**Questions:**

1. Can model handle well the NLI task without in-context examples for learning (i.e., under the 0-shot setting)?
2. On detecting which class are the in-context examples most helpful? and most helpless?
3. Is the more in-context examples the better?

**Reference Answers:**

1. No, because the model cannot learn to generate the required verbalizers (i.e., entailment, neutral and contradiction) during the classification, so the predictions are always the default class.
2. In-context examples are most helpful in detecting the contradiction class, while most helpless in detecting the neutral class, probably because neutral samples are more prone to be identified as having some entailed or contradicted relations.
3. No, more in-context examples does not necessarily lead to better performance.

<a name="12"></a>
### **1.2 Effect of Neutral In-Context Examples**

Try 3-shot in-context examples on the entailment and contradictions classes, but do not provide any examples on the neutral class.

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots = {"entailment": 3, "neutral": 0, "contradiction": 3}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping, predictions, gold_answers, task_name="1_2")

100%|██████████| 30/30 [00:23<00:00,  1.29it/s]


[[8. 1. 1.]
 [3. 2. 5.]
 [1. 0. 9.]]
Accuracy: 63.33% | F1: (72.73%, 30.77%, 72.00%) | Macro-F1: 58.50%





**Question:** What do you find here?

**Reference Answer:** The model's performance on detecting entailment queries is improved, while this does not change the model's performance on detecting neutral queries. So neutral in-context examples may not be that effective in helping detect the neutral queries in this task, which may also cause confusion on detecting the entailment queries.

<a name="13"></a>
### **1.3 Play with Different Verbalizers**

In this part, you will try to use different verbalizers for this NLI classification task. Instead of using *entailment*, *neutral* and *contradiction*, you will try the following two alternatives:

- *positive*, *unrelated* and *negative*
- *a*, *b* and *c*

Build data with the above two different verbalizers.

In [None]:
mapping_to_pun = {"entailment": "positive", "neutral": "unrelated", "contradiction": "negative"}
train_samples_pun = {"positive": [], "unrelated": [], "negative": []}
test_data_pun = []

mapping_to_abc = {"entailment": "a", "neutral": "b", "contradiction": "c"}
train_samples_abc = {"a": [], "b": [], "c": []}
test_data_abc = []

for nli_class, samples in train_samples.items():
    
    nli_class_pun = mapping_to_pun[nli_class]
    nli_class_abc = mapping_to_abc[nli_class]
    
    for sample in samples:
        
        sample_pun = " ".join(sample.split(" ")[:-1] + [nli_class_pun])
        train_samples_pun[nli_class_pun].append(sample_pun)
        
        sample_abc = " ".join(sample.split(" ")[:-1] + [nli_class_abc])
        train_samples_abc[nli_class_abc].append(sample_abc)
    
for query in test_data:
    
    query_pun = deepcopy(query)
    query_pun["gold_answer"] = mapping_to_pun[query["gold_answer"]]
    test_data_pun.append(query_pun)
    
    query_abc = deepcopy(query)
    query_abc["gold_answer"] = mapping_to_abc[query["gold_answer"]]
    test_data_abc.append(query_abc)

You can take a glance of the processed training and testing data with different verbalizers.

Data with verbalizers *positive*, *unrelated* and *negative*

In [None]:
print("Training Samples:")
print(train_samples_pun["positive"][0])
print("\n")

print("Testing Query:")
print(test_data_pun[0]["query"])
print("\n")

print("Gold Answer:")
print(test_data_pun[0]["gold_answer"])

Training Samples:
I know lawyers are always dreadfully careful.
I'm well aware that lawyers are always very careful.
Answer: positive


Testing Query:
The new rights are nice enough
Everyone really likes the newest benefits 
Answer:


Gold Answer:
unrelated


Data with verbalizers *a*, *b* and *c*

In [None]:
print("Training Samples:")
print(train_samples_abc["a"][0])
print("\n")

print("Testing Query:")
print(test_data_abc[0]["query"])
print("\n")

print("Gold Answer:")
print(test_data_abc[0]["gold_answer"])

Training Samples:
I know lawyers are always dreadfully careful.
I'm well aware that lawyers are always very careful.
Answer: a


Testing Query:
The new rights are nice enough
Everyone really likes the newest benefits 
Answer:


Gold Answer:
b


#### Re-do the classification with new verbalizers.

Try verbalizers *positive*, *unrelated* and *negative* under the 2-shot setting in 1.1

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots_pun = {"positive": 2, "unrelated": 2, "negative": 2}
mapping_pun = {"positive": 0, "unrelated": 1, "negative": 2}

run(train_samples_pun, test_data_pun, class_shots_pun, mapping_pun,
    predictions, gold_answers, default_class="unrelated", task_name="1_3_pun")

100%|██████████| 30/30 [00:22<00:00,  1.31it/s]


[[9. 1. 0.]
 [5. 3. 2.]
 [1. 0. 9.]]
Accuracy: 70.00% | F1: (72.00%, 42.86%, 85.71%) | Macro-F1: 66.86%





Try verbalizers *a*, *b* and *c* under the 2-shot setting in 1.1

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

class_shots_abc = {"a": 2, "b": 2, "c": 2}
mapping_abc = {"a": 0, "b": 1, "c": 2}

run(train_samples_abc, test_data_abc, class_shots_abc, mapping_abc,
    predictions, gold_answers, default_class="b", task_name="1_3_abc")

100%|██████████| 30/30 [00:08<00:00,  3.36it/s]


[[4. 6. 0.]
 [2. 6. 2.]
 [0. 7. 3.]]
Accuracy: 43.33% | F1: (50.00%, 41.38%, 40.00%) | Macro-F1: 43.79%





**Questions:**

1. Are verbalizers *positive*, *unrelated* and *negative* better or worse than the original ones?
2. Are verbalizers *a*, *b* and *c* better or worse than the original ones?

**Reference Answers:**

1. They are better because "unrelated" could be more easily distinguished from "positive" and "negative", compared with "neutral" to "entailment" and "contradiction". So in-context examples of the "unrelated" class cause less confusion on the detection of "positive" and "negative" classes.
2. They are worse because "a", "b" and "c" do not have the meanings that are correlated with the "entailment", "neutral" and "contradiction", so the model does not learn well on distinguishing these three classes.

<a name="14"></a>
### **1.4 Add Instructions**

In this part, you will try to add high-level task instruction to the model input.

Try 1-shot in-context learning with overall task introduction: "Guess whether the given two sentences have an entailment, neutral or contradiction relation."

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

introduction = "Guess whether the given two sentences have an entailment, neutral or contradiction relation."
class_shots = {"entailment": 1, "neutral": 1, "contradiction": 1}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping,
    predictions, gold_answers, introduction=introduction, task_name="1_4_intro1")

100%|██████████| 30/30 [00:23<00:00,  1.29it/s]


[[8. 1. 1.]
 [4. 3. 3.]
 [1. 0. 9.]]
Accuracy: 66.67% | F1: (69.57%, 42.86%, 78.26%) | Macro-F1: 63.56%





Try to use a more specified task introduction as the instruction: "Guess whether the second statement is entailed by the first statement, contradicts the first statement, or is neutral to the first statement."

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

introduction = "Guess whether the second statement is entailed by the first statement, contradicts the first statement, or is neutral to the first statement."
class_shots = {"entailment": 1, "neutral": 1, "contradiction": 1}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping,
    predictions, gold_answers, introduction=introduction, task_name="1_4_intro2")

100%|██████████| 30/30 [00:22<00:00,  1.34it/s]


[[10.  0.  0.]
 [ 5.  3.  2.]
 [ 1.  1.  8.]]
Accuracy: 70.00% | F1: (76.92%, 42.86%, 80.00%) | Macro-F1: 66.59%





Try to make the model think that he is an expert on doing this task! Use the instruction: "Pretend that you are an expert of logic. Tell us whether the second statement is entailed by the first statement, contradicts the first statement, or is neutral to the first statement."

In [None]:
predictions = []
gold_answers = []

In [None]:
random.seed(seed)
np.random.seed(seed)

introduction = "Pretend that you are an expert of logic. Tell us whether the second statement is entailed by the first statement, contradicts the first statement, or is neutral to the first statement."
class_shots = {"entailment": 1, "neutral": 1, "contradiction": 1}
mapping = {"entailment": 0, "neutral": 1, "contradiction": 2}

run(train_samples, test_data, class_shots, mapping,
    predictions, gold_answers, introduction=introduction, task_name="1_4_intro3")

100%|██████████| 30/30 [00:22<00:00,  1.33it/s]


[[9. 1. 0.]
 [4. 5. 1.]
 [1. 2. 7.]]
Accuracy: 70.00% | F1: (75.00%, 55.56%, 77.78%) | Macro-F1: 69.44%





**Question:** Does additional task instruction help? Does more specific insturction tend to be better? Is it effective to give the model a role (e.g., expert) before doing the task?

**Reference Answer:** GPT-3.5 is tuned to be very good at following instructions. So adding task instructions at the beginning of the input is often effective to improve the model perfomance, especially under very low-shot settings. More specific insturction can often lead to better performance, since richer context helps the model learn more about the task. Interestingly, giving the model a role-play setting related to the task can often also achieve improvements.

<a name="2"></a>
## **PART 2: In-Context Learning for Story Ending Generation**
---

In this part, you will switch to using the GPT-3.5 model to solve the story ending generation (SEG) task based on in-context learning. For this task, model is given four lines of story plot and needs to generate the fifth line of the story plot as an ending.

You can take a glance of the training data used for sampling few-shot in-context examples, and the testing data used to query GPT-3.5 language model for story completion (along with the reference story ending).

In [None]:
with open(ROOT_PATH+"story_generation/train_generation.json", "r") as f:
    train_samples_sg = json.load(f)
with open(ROOT_PATH+"story_generation/test_generation.json", "r") as f:
    test_data_sg = json.load(f)

print("Training Samples:")
print(train_samples_sg[0])
print("\n")

print("Testing Query:")
print(test_data_sg[0]["query"])
print("\n")

print("Reference Story Ending:")
print(test_data_sg[0]["reference_ending"])

Training Samples:
Dan's parents were overweight.
Dan was overweight as well.
The doctors told his parents it was unhealthy.
His parents understood and decided to make a change.
Output: They got themselves and Dan on a diet.


Testing Query:
A few days ago I decided to take my dog Sable for a walk.
She is a half-pit bull half-bulldog with a lot of strength.
After I got her leash on I opened the garage to head outside.
She tried bolting out of the garage and dragged me along with her.
Output:


Reference Story Ending:
After the half hour walk my muscles were sore as if I had worked out.


Here is the GPT-3.5 hyperparameter setting for this SEG task.

We set *max_tokens* to be 20, which is supposed to be the maximum length of a story ending (i.e., a sentence).

We also change the *temparature* and *top_p* to 0.9 in order to enable the model's creativity and make it generates more diverse story endings.

In [None]:
model_args={"max_tokens": 20, "temperature": 0.9, "top_p": 0.9, "presence_penalty": 0.0, "frequency_penalty": 0.0}

You will evaluate the model generation performance based on [METEOR](https://aclanthology.org/W05-0909.pdf). This metric is originally proposed to evaluate machine translation quality, but later widely used in evaluating open-domain text (e.g., dialogues and stories) generation. It measures the alignments (i.e., matches) between words in the hypothesis to reference, by sequentially applying exact match, stemmed match and wordnet based synonym match.

In [None]:
def evaluate_sed(generation, reference):
    
    ref_tokens = word_tokenize(reference)
    gen_tokens = word_tokenize(generation)
    score = meteor_score([ref_tokens], gen_tokens)
    
    return score

You will use the following function to perform GPT-3.5 generation on the SEG task based on in-context learning.

In [None]:
def gpt3_seg(train_samples, test_data, shot, generations, queries, reference_endings,
             introduction=None, task_name="none"):

    '''
    train_samples: training data for sampling in-context examples
    train_data: testing queries (with reference story endings)
    shot: number of in-context examples
    generations: cache for saving the model generations
    queries: cache for saving the input queries (i.e., four-line stories to be completed)
    reference_endings: cache for reference story endings
    introduction: additional task introduction for prompting
    task_name: task name for creating chat sessions
    '''
    
    # randomly sample in-context examples and shuffle them
    examples = random.sample(train_samples, shot)
    random.shuffle(examples)
    
    # add task introduction (if it exists) before in-context examples for better prompting
    if introduction:
        examples.insert(0, introduction)

    for qid, query in enumerate(tqdm(test_data)):
        
        if qid < len(generations):  # skip this query if its model generated story ending is already saved in cache
            continue

        # concatenate all the in-context examples with the query, to get the final input demonstration
        demonstration = "\n\n".join(examples+[query["query"]])

        # create a chat session using our GPT-3.5 wrapper and query the model to get the story ending generation
        chat = Chat.create(name=task_name+"_"+str(qid))
        message = chat.ask(demonstration, model_args=model_args)
        
        # save the model generation, story query and reference ending in caches
        generations.append(message.content)
        queries.append(query["query"])
        reference_endings.append(query["reference_ending"])

You will run the following function to perform GPT-3.5 generation and evaluation.

In [None]:
def run(train_samples, test_data, shot, generations, queries, reference_endings, introduction=None, task_name="none"):
    
    try:
        
        gpt3_seg(train_samples, test_data, shot,
                 generations, queries, reference_endings,
                 introduction=introduction, task_name=task_name)

        meteor_scores = []
        print()

        for qid, query in enumerate(queries):

            meteor = evaluate_sed(generations[qid], reference_endings[qid])
            print("Query "+str(qid+1)+f' METEOR Score: {meteor*100:.2f}') 

            meteor_scores.append(meteor)

        meteor_avg = sum(meteor_scores)/len(meteor_scores)
        print(f'Average METEOR Score: {meteor_avg*100:.2f}')
    
    except Exception as error:  # OpenAI ChatGPT endpoint (gpt-3.5-turbo) may get stucked by too many queries from time to time
        
        print(error)


<a name="21"></a>
### **2.1 Zero-Shot Generation**

Try 0-shot story ending generation (i.e., without any in-context examples).

Create caches for saving model predictions, queries and reference story endings.

In [None]:
generations_21 = []
queries_21 = []
reference_endings_21 = []

Run the generation and evaluation.

**Note:** Similar to Part 1, re-run the cell if it gets stucked.

In [None]:
random.seed(seed)
np.random.seed(seed)

run(train_samples_sg, test_data_sg, 0, generations_21, queries_21, reference_endings_21, task_name="2_1_shots0")

100%|██████████| 10/10 [00:13<00:00,  1.32s/it]



Query 1 METEOR Score: 12.27
Query 2 METEOR Score: 51.74
Query 3 METEOR Score: 4.55
Query 4 METEOR Score: 0.00
Query 5 METEOR Score: 45.43
Query 6 METEOR Score: 21.74
Query 7 METEOR Score: 36.13
Query 8 METEOR Score: 25.66
Query 9 METEOR Score: 18.18
Query 10 METEOR Score: 9.17
Average METEOR Score: 22.49


You can print the saved caches and compare the quality of the reference and model-generated story endings.

In [None]:
print("Queries:\n"+queries_21[0])
print("GPT-3.5 Generation:\n"+generations_21[0])
print("Reference:\n"+reference_endings_21[0])

Queries:
A few days ago I decided to take my dog Sable for a walk.
She is a half-pit bull half-bulldog with a lot of strength.
After I got her leash on I opened the garage to head outside.
She tried bolting out of the garage and dragged me along with her.
Output:
GPT-3.5 Generation:
A few days ago, as I was taking my dog Sable for a walk, I quickly realized
Reference:
After the half hour walk my muscles were sore as if I had worked out.


<a name="22"></a>
### **2.2 Few-Shot Generation**

Try adding 5-shot in-context examples for this generation task.

In [None]:
generations_22 = []
queries_22 = []
reference_endings_22 = []

In [None]:
random.seed(seed)
np.random.seed(seed)

run(train_samples_sg, test_data_sg, 5, generations_22, queries_22, reference_endings_22, task_name="2_2_shots5")

100%|██████████| 10/10 [00:12<00:00,  1.21s/it]


Query 1 METEOR Score: 6.49
Query 2 METEOR Score: 63.05
Query 3 METEOR Score: 9.80
Query 4 METEOR Score: 60.50
Query 5 METEOR Score: 46.54
Query 6 METEOR Score: 61.58
Query 7 METEOR Score: 23.45
Query 8 METEOR Score: 30.68
Query 9 METEOR Score: 10.00
Query 10 METEOR Score: 14.71
Average METEOR Score: 32.68





**Question:** Do few-shot examples improve the model's story ending generation quality?

**Reference Answer:** According to the automatic evaluations, yes, but such evaluations (based on surface-form matchings with the references) may not be correlated with human judgements. You are encourged to print out model generations in 2.1 and 2.2, and manually make further comparisons to see if there are truly improvements.

You can print the saved caches and make more comparisons between the model generations in 2.1 and 2.2.

In [None]:
print("Queries:\n"+queries_22[0])
print("GPT-3.5 Generation:\n"+generations_22[0])
print("Reference:\n"+reference_endings_22[0])

Queries:
A few days ago I decided to take my dog Sable for a walk.
She is a half-pit bull half-bulldog with a lot of strength.
After I got her leash on I opened the garage to head outside.
She tried bolting out of the garage and dragged me along with her.
Output:
GPT-3.5 Generation:
I learned to hold onto her leash more securely.
Reference:
After the half hour walk my muscles were sore as if I had worked out.


<a name="23"></a>
### **2.3 Add Instructions**

Try 0-shot in-context learning with overall task introduction.

In [None]:
generations_23 = []
queries_23 = []
reference_endings_23 = []

In [None]:
random.seed(seed)
np.random.seed(seed)

introduction = "Generate an ending of the given story."
run(train_samples_sg, test_data_sg, 0, generations_23, queries_23, reference_endings_23,
    introduction=introduction, task_name="2_3_intro")

100%|██████████| 10/10 [00:13<00:00,  1.33s/it]


Query 1 METEOR Score: 15.34
Query 2 METEOR Score: 30.77
Query 3 METEOR Score: 29.22
Query 4 METEOR Score: 21.62
Query 5 METEOR Score: 19.69
Query 6 METEOR Score: 61.81
Query 7 METEOR Score: 38.12
Query 8 METEOR Score: 17.62
Query 9 METEOR Score: 22.73
Query 10 METEOR Score: 13.76
Average METEOR Score: 27.07





**Question:** Does overall task instruction help improve the model's 0-shot story ending generation quality?

**Reference Answer:** According to the automatic evaluations, yes, but similar to 2.2, you may need further comparisons between the model generations in 2.1 and 2.3 to verify this conclusion.

You can print the saved caches and make more comparisons between the model generations in 2.1 and 2.3.

In [None]:
print("Queries:\n"+queries_23[0])
print("GPT-3.5 Generation:\n"+generations_23[0])
print("Reference:\n"+reference_endings_23[0])

Queries:
A few days ago I decided to take my dog Sable for a walk.
She is a half-pit bull half-bulldog with a lot of strength.
After I got her leash on I opened the garage to head outside.
She tried bolting out of the garage and dragged me along with her.
Output:
GPT-3.5 Generation:
I stumbled and fell to the ground, scraping my knee and elbow in the process. Sable didn
Reference:
After the half hour walk my muscles were sore as if I had worked out.
