# Project 5: Prompting With Large Language Models

In this project, we learn how to solve tasks by prompting existing LLM APIs. We will experiment with zero-shot and few-shot prompting and different ways methods for example selection for a semantic parsing task.

First we install and import the required dependencies. These include:
* `openai` as our API for querying LLMs (you are free to choose to use a different LLM API if you would like)


In [1]:
%%capture
%pip install openai


If you are using the OpenAI API, then go to then create an account and then copy your secret API key from `https://platform.openai.com/account/api-keys`. Set this as an environment variable or key management service so we can load it below. Make to keep private key secret. You may use a different LLM service if you choose to eg. Cohere (https://cohere.ai/).

In [2]:
import os
os.environ['OPENAI_API_KEY'] = "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"
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

If you have successfully authorized, then you should be able to see a list of available models by running the command below.

In [3]:
import openai
openai.api_key = OPENAI_API_KEY
openai.Model.list()

<OpenAIObject list at 0x7361137e4890> JSON: {
  "data": [
    {
      "created": 1649358449,
      "id": "babbage",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1669085501,
          "group": null,
          "id": "modelperm-49FUp5v084tBB49tC4z8LPH5",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "babbage"
    },
    {
      "created": 1649359874,
      "id": "davinci",
      "object": "model",
      "owned_by": "openai",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sa

We will now evaluate the LLM on a semantic parsing task. Geoquery is a dataset that contains information about the geography of the United States. For more information, please see: https://www.cs.utexas.edu/users/ml/nldata/geoquery.html. We will experiment with the compositional split introduced in (Keysers et al., 2020) https://openreview.net/forum?id=SygcCnNKwr. First, let's download the train and validation data. The goal for the LLM is to take in English queries about US geography about population, elevation, etc. and output a formal representation of the query.

In [4]:
!wget https://github.com/kl2806/geoquery/raw/main/data.zip -O data.zip
!unzip -o data.zip

--2023-04-24 00:37:36--  https://github.com/kl2806/geoquery/raw/main/data.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kl2806/geoquery/main/data.zip [following]
--2023-04-24 00:37:36--  https://raw.githubusercontent.com/kl2806/geoquery/main/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4493 (4.4K) [application/zip]
Saving to: ‘data.zip’


2023-04-24 00:37:37 (27.7 MB/s) - ‘data.zip’ saved [4493/4493]

Archive:  data.zip
  inflating: train.tsv               
  inflating: dev.tsv                 


Let's take a look at the data, each instance should compose of an English utterance, and a formal representation of the utterance

In [5]:
!head -n 5 ./train.tsv

how tall is the highest point in m0	answer ( elevation_1 ( highest ( intersection ( place , loc_2 ( m0 ) ) ) ) )
what is the largest city in m0	answer ( largest ( intersection ( city , loc_2 ( m0 ) ) ) )
what states border states that the m0 runs through	answer ( intersection ( state , next_to_2 ( intersection ( state , traverse_1 ( m0 ) ) ) ) )
what is the maximum elevation of m0	answer ( highest ( intersection ( place , loc_2 ( m0 ) ) ) )
what is the population of m0	answer ( population_1 ( m0 ) )


In [6]:
import csv
from dataclasses import dataclass
from typing import List

@dataclass
class Example:
     query: str
     program: str

training_examples: List[Example] = []

with open('./train.tsv', 'r') as tsv_file:
    reader = csv.reader(tsv_file, delimiter='\t')
    for query, program in reader:
        training_examples.append(Example(query, program))

dev_examples: List[Example] = []
with open('./dev.tsv', 'r') as tsv_file:
    reader = csv.reader(tsv_file, delimiter='\t')
    for query, program in reader:
        dev_examples.append(Example(query, program))

print(f"Num training examples: {len(training_examples)}")
print(f"Num dev examples: {len(dev_examples)}")
# print(training_examples)


Num training examples: 440
Num dev examples: 40


Now, let's define a function that uses the OpenAI API to output the semantic parse. As a first cut, let's just try to describe the task in English the model and return it in `get_static_prompt`. Then implement `parse_example`, which should call the LLM API, and return the semantic parse. `gpt-3.5-turbo`, the corresponding API call to ChatGPT should work well for this task.

In [7]:
from typing import Callable

def get_static_prompt(utterance: str) -> str:
    """Return a prompt that doesn't change between different examples"""
    """YOUR CODE HERE"""
    content = f"Please parse the following sentence: " + utterance
#     print( [{"role": "system", "content": "You are a chatbot"}, 
#             {"role": "user", "content": content},
#            ])
    return [{"role": "system", "content": "You are a chatbot"}, 
            {"role": "user", "content": content},
           ]

def parse_example(model: str, utterance: str, prompt_method: Callable[[str], str], **kwargs: dict) -> str:
    """Return the semantic parse of the utterance"""
    prompt = prompt_method(utterance, **kwargs)
    """YOUR CODE HERE"""
    completions = openai.ChatCompletion.create(
        model=model,
        messages =prompt,
    )
#     print(completions)
    message = completions.choices[0].message.content
    return message

parse = parse_example(model="gpt-3.5-turbo",
                      utterance="what river runs through m0",
                      prompt_method=get_static_prompt)

print(parse)


I'm sorry, but "m0" is not a valid English word or location. Could you please provide more context or specify the correct name or spelling of the location you are referring to?


With just an English description, the output probably does not look very similar to the target language that we want. Let's try to construct a prompt with some examples from our training set and see how it does. Implement the function below for uniformly sampling examples from the training set, and using that to as a few-shot prompt to the model.

In [8]:
import random
from typing import List
import re

def random_sample_prompt(utterance: str, training_examples: List[Example], num_samples: int = 10) -> str:
    """Return a prompt for a given example"""
    """YOUR CODE HERE"""
#     print(utterance)
#     print(training_examples[0])
    samples = random.sample(training_examples, k=min(num_samples, len(training_examples)))
#     print(samples)
#     print()
#     print(samples[0].query, samples[0].program)
    
    content = []
#     print(prompt)
#     print((samples[0].program))
    for sample in samples: 
#         print(sample.program)
        content.append("Please parse the following sentence: " + sample.query + ". Answer: " + sample.program )
#         print(answer)
#         print(content)
#         content.append("Please parse the following sentence: ")
#     print(answer)
#         content.append(sample.query)
#     print(content)
    content.append(f"Please parse the following sentence: " + utterance + ". Answer:")
    content_string = " ".join(content)
#     content_string = content_string + utterance
#     print(content_string)
#     print(content)
#     print(content)
    prompt = [{"role": "system", "content": "You are a chatbot"}, 
              {"role": "system", "content": content_string},
           ]
#     print(prompt)

    return prompt
#     prompt = [{"text": "Please provide the semantic parse for the following utterance:\n\n"},
#               {"text": f"{utterance}\n\n", "mark": True},
#               {"text": "Semantic parse:"}]
#     return prompt
        

prompt = random_sample_prompt(utterance="what river runs through m0",
                              training_examples=training_examples)
print("Uniform random sampling prompt: \n", prompt)


Uniform random sampling prompt: 
 [{'role': 'system', 'content': 'You are a chatbot'}, {'role': 'system', 'content': 'Please parse the following sentence: which states border m0. Answer: answer ( intersection ( state , next_to_2 ( m0 ) ) ) Please parse the following sentence: what states does the m0 river run through. Answer: answer ( intersection ( state , traverse_1 ( intersection ( river , m0 ) ) ) ) Please parse the following sentence: what is the highest point in m0. Answer: answer ( highest ( intersection ( place , loc_2 ( m0 ) ) ) ) Please parse the following sentence: what are the capital cities of the states which border m0. Answer: answer ( intersection ( capital , intersection ( city , loc_2 ( intersection ( state , next_to_2 ( m0 ) ) ) ) ) ) Please parse the following sentence: what is the largest river in m0 state. Answer: answer ( longest ( intersection ( river , loc_2 ( intersection ( state , m0 ) ) ) ) ) Please parse the following sentence: what is the population of m0.

In [9]:
parse = parse_example(model="gpt-3.5-turbo",
                      utterance="what river runs through m0",
                      prompt_method=random_sample_prompt,
                      training_examples=training_examples)
print(parse)

answer ( intersection ( river , loc_2 ( m0 ) ) )


Now, let's evaluate our uniform sampling prompt on the validation set. If you run into rate limit issues with the API, you may want to backoff or consult one of the solutions here https://platform.openai.com/docs/guides/rate-limits/error-mitigation.

In [10]:
import tqdm

def get_predictions(model: str,
                    evaluation_examples: List[Example],
                    prompt_creation_function: Callable[[str], str],
                    **kwargv: List[str]) -> List[str]:
    """Get a list of predictions from the evaluation examples"""
    predictions = []
    for example in tqdm.tqdm(evaluation_examples):
        predicted_program = parse_example(model, example.query, prompt_creation_function, **kwargv)
        predictions.append(predicted_program)
    return predictions

def evaluate(predictions: List[str], evaluation_examples: List[Example]) -> float:
    """Evaluate the accuracy of the predictions"""
    correct = 0
    for prediction, example in zip(predictions, evaluation_examples):
#         print(example.program)
#         print(prediction)
#         print()
        if prediction == example.program:
            correct += 1
    return correct / len(evaluation_examples)

In [11]:
random_sample_predictions = get_predictions(model="gpt-3.5-turbo",
                                            evaluation_examples=dev_examples,
                                            prompt_creation_function=random_sample_prompt,
                                            training_examples=training_examples)


100%|██████████| 40/40 [01:05<00:00,  1.65s/it]


The model should get at least 15\% exact match with randomly sampling examples.

In [12]:
# Save the predictions as `random_predictions.txt`
with open('random_predictions.txt', 'w') as f:
    for prediction in random_sample_predictions:
        f.write(prediction)
        f.write('\n')

exact_match = evaluate(random_sample_predictions, dev_examples)
print(f"Exact match for uniform sampling prompt: {exact_match}")

Exact match for uniform sampling prompt: 0.275


Randomly sampling examples does not use consider the utterance when selecting examples. Next, we will try to pick examples for the prompt based on embedding similarity. First, let's install `sentence-transformers`, which we will use to get embeddings of the utterance.

In [13]:
%%capture
%pip install -U sentence-transformers

Now, let's construct embeddings using a small pretrained model for each example in our training data.

In [14]:
from sentence_transformers import SentenceTransformer, util
import torch

def get_corpus(model_name: str, examples: List[Example]) -> torch.Tensor:
    """Return a tensor of the corpus embeddings, the size of the returned tensor should be (num_examples, embedding_size)"""
    """YOUR CODE HERE"""
    model = SentenceTransformer(model_name)
    corpus = [example.query for example in examples]
    corpus_embeddings = model.encode(corpus)
    return corpus_embeddings

corpus_embeddings = get_corpus('all-MiniLM-L6-v2', training_examples)


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

With the embeddings of all the training data, we can construct a function that takes in an utterance and outputs a prompt with examples with highest cosine similarity with the utterance.

In [15]:
def get_nearest_neighbor_prompt(utterance: str,
                                training_examples: List[Example],
                                embedding_model: str,
                                corpus_embeddings: torch.Tensor,
                                num_samples: int = 10) -> str:
    """YOUR CODE HERE"""
    # Get the embedding of the input utterance
    model = SentenceTransformer(embedding_model)
    utterance_embedding = model.encode(utterance)
#     print(utterance_embedding)
    
    # Compute cosine similarity between the input utterance embedding and the corpus embeddings
    similarities = util.cos_sim(utterance_embedding, corpus_embeddings)
    
    # Get the indices of the examples with highest cosine similarity to the input utterance
    top_indices = similarities.argsort()[0][-num_samples:]
    
    content = []
    for i in top_indices:
        content.append("Please parse the following sentence: " + training_examples[i].query + ". Answer: " + training_examples[i].program)
#     print(content)
    
    content.append(f"Please parse the following sentence: " + utterance + ". Answer:")
    content_string = " ".join(content)
    prompt = [{"role": "system", "content": "You are a chatbot"}, 
              {"role": "user", "content": content_string}]
#     print(prompt)
    

    return prompt


Evaluate the similarity based on prompt on the validation data and save your predictions as `similarity_predictions.txt` with one predictions per line of the validation set. With similarity based example selection, we should get at least 20\% exact match on the validation set. Note that there could be some duplicates in the training data because we are working with a version of the data where the location names are normalized to be variables like `m0`.

In [16]:
similarity_predictions = get_predictions(model="gpt-3.5-turbo",
                                         evaluation_examples=dev_examples,
                                         prompt_creation_function=get_nearest_neighbor_prompt,
                                         training_examples=training_examples,
                                         embedding_model='all-MiniLM-L6-v2',
                                         corpus_embeddings=corpus_embeddings)

# Save the predictions as `similarity_predictions.txt`
with open('similarity_predictions.txt', 'w') as f:
    for prediction in similarity_predictions:
        f.write(prediction)
        f.write('\n')

exact_match = evaluate(similarity_predictions, dev_examples)
print(f"Exact match for nearest neighbor prompt: {exact_match}")

  0%|          | 0/40 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  2%|▎         | 1/40 [00:01<01:03,  1.62s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  5%|▌         | 2/40 [00:03<01:11,  1.87s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  8%|▊         | 3/40 [00:05<01:05,  1.77s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 10%|█         | 4/40 [00:06<00:59,  1.64s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 12%|█▎        | 5/40 [00:09<01:07,  1.94s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 15%|█▌        | 6/40 [00:13<01:33,  2.75s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 18%|█▊        | 7/40 [00:15<01:18,  2.37s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 20%|██        | 8/40 [00:18<01:30,  2.84s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 22%|██▎       | 9/40 [00:20<01:15,  2.43s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 25%|██▌       | 10/40 [00:22<01:09,  2.32s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 28%|██▊       | 11/40 [00:27<01:27,  3.00s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 30%|███       | 12/40 [00:32<01:47,  3.85s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 32%|███▎      | 13/40 [00:34<01:25,  3.16s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 35%|███▌      | 14/40 [00:36<01:09,  2.68s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 38%|███▊      | 15/40 [00:37<00:58,  2.33s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 40%|████      | 16/40 [00:39<00:50,  2.12s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 42%|████▎     | 17/40 [00:41<00:47,  2.05s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 45%|████▌     | 18/40 [00:42<00:41,  1.88s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 48%|████▊     | 19/40 [00:44<00:36,  1.76s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 50%|█████     | 20/40 [00:45<00:36,  1.80s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 52%|█████▎    | 21/40 [00:47<00:35,  1.86s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 55%|█████▌    | 22/40 [00:49<00:31,  1.77s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 57%|█████▊    | 23/40 [00:51<00:31,  1.84s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 60%|██████    | 24/40 [00:54<00:34,  2.13s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 62%|██████▎   | 25/40 [00:59<00:43,  2.90s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 65%|██████▌   | 26/40 [01:01<00:39,  2.84s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 68%|██████▊   | 27/40 [01:03<00:33,  2.59s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 70%|███████   | 28/40 [01:05<00:28,  2.41s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 72%|███████▎  | 29/40 [01:07<00:25,  2.30s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 75%|███████▌  | 30/40 [01:10<00:25,  2.50s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 78%|███████▊  | 31/40 [01:13<00:22,  2.48s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 80%|████████  | 32/40 [01:15<00:18,  2.32s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 82%|████████▎ | 33/40 [01:19<00:20,  2.92s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 85%|████████▌ | 34/40 [01:21<00:15,  2.65s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 88%|████████▊ | 35/40 [01:22<00:10,  2.19s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 90%|█████████ | 36/40 [01:24<00:08,  2.16s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 92%|█████████▎| 37/40 [01:37<00:15,  5.27s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 95%|█████████▌| 38/40 [01:40<00:09,  4.54s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 98%|█████████▊| 39/40 [01:43<00:04,  4.23s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 40/40 [01:47<00:00,  2.70s/it]

Exact match for nearest neighbor prompt: 0.425





Let's try to improve on nearest neighbor based example search. This part will be more open ended. We will now implement a different example selection method that improves over uniform random selection. You may implement an algorithm `Diverse Demonstrations Improve In-context Compositional Generalization` (https://arxiv.org/abs/2212.06800) or come up with your own example selection method. In the report, describe the algorithm that you implemented and intuitiion of why it may it could be effective.

In [17]:
from sentence_transformers import util
def construct_diversity_prompt(utterance: str,
                               training_examples: List[Example],
                               num_samples: int = 10) -> str:
    """YOUR CODE HERE"""
    # Initialize pre-trained language model for sentence embeddings
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Compute embeddings for all training examples
#     print(training_examples[0])
    example_embeddings =[]
    for example in training_examples:
        example_embeddings.append("Please parse the following sentence: " + example.query + ". Answer: " + example.program )
    embeddings = model.encode(example_embeddings)
#     print(embeddings)

    # Compute embedding for utterance
    utterance_embedding = model.encode(utterance)

    # Compute cosine similarities between utterance and all training examples
    similarities = util.cos_sim(utterance_embedding, embeddings)[0]

    # Sort examples in decreasing order of similarity
    sorted_indices = similarities.argsort(descending=True)

    # Select example with highest similarity as first example in prompt
    selected_indices = [sorted_indices[0]]

    # Select subsequent examples based on diversity score
    for i in range(1, num_samples):

        # Compute diversity score for each candidate example
        candidate_scores = []
        for j in range(len(training_examples)):
            if j not in selected_indices:
                # Compute cosine similarities between candidate example and all selected examples
                similarity_sum = 0
                for k in selected_indices:
                    similarity_sum += util.cos_sim(embeddings[j], embeddings[k])[0][0]
                # Compute diversity score
                candidate_scores.append(similarity_sum / len(selected_indices))
            else:
                candidate_scores.append(0)

        # Select candidate with highest diversity score
        max_score = max(candidate_scores)
        max_index = candidate_scores.index(max_score)
        selected_indices.append(max_index)

    # Construct prompt from selected examples
    content = []
    for i in selected_indices:
        content.append("Please parse the following sentence: " + training_examples[i].query + ". Answer: " + training_examples[i].program)
#     print(content)
    content.append(f"Please parse the following sentence: " + utterance + ". Answer:")
    content_string = " ".join(content)
    prompt = [{"role": "system", "content": "You are a chatbot"}, 
              {"role": "user", "content": content_string}]
    return prompt

In [18]:
print(construct_diversity_prompt(utterance="what river runs through m0", training_examples=training_examples))


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[{'role': 'system', 'content': 'You are a chatbot'}, {'role': 'user', 'content': 'Please parse the following sentence: what rivers are in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: what rivers are in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: what rivers are in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: what rivers are in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: which rivers are in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: rivers in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: what are all the rivers in m0. Answer: answer ( intersection ( river , loc_2 ( m0 ) ) ) Please parse the following sentence: name all the rivers in m0. Answer: answer ( intersection ( riv

In [19]:
diversity_predictions = get_predictions(model="gpt-3.5-turbo",
                                         evaluation_examples=dev_examples,
                                         prompt_creation_function=construct_diversity_prompt,
                                         training_examples=training_examples)

# Save the predictions as `diversity_predictions.txt`
with open('diversity_predictions.txt', 'w') as f:
    for prediction in diversity_predictions:
        f.write(prediction)
        f.write('\n')

exact_match = evaluate(diversity_predictions, dev_examples)
print(f"Exact match for diversity based prompt: {exact_match}")

  0%|          | 0/40 [00:00<?, ?it/s]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  2%|▎         | 1/40 [00:07<04:56,  7.60s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  5%|▌         | 2/40 [00:16<05:20,  8.44s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  8%|▊         | 3/40 [00:24<04:57,  8.04s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 10%|█         | 4/40 [00:31<04:42,  7.84s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 12%|█▎        | 5/40 [00:40<04:44,  8.12s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 15%|█▌        | 6/40 [00:52<05:17,  9.33s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 18%|█▊        | 7/40 [01:00<04:54,  8.94s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 20%|██        | 8/40 [01:07<04:32,  8.52s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 22%|██▎       | 9/40 [01:17<04:31,  8.77s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 25%|██▌       | 10/40 [01:25<04:21,  8.72s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 28%|██▊       | 11/40 [01:34<04:09,  8.59s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 30%|███       | 12/40 [01:44<04:13,  9.07s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 32%|███▎      | 13/40 [01:51<03:52,  8.60s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 35%|███▌      | 14/40 [01:59<03:38,  8.41s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 38%|███▊      | 15/40 [02:07<03:23,  8.16s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 40%|████      | 16/40 [02:14<03:11,  7.97s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 42%|████▎     | 17/40 [02:22<03:04,  8.02s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 45%|████▌     | 18/40 [02:30<02:53,  7.90s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 48%|████▊     | 19/40 [02:38<02:44,  7.82s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 50%|█████     | 20/40 [02:46<02:40,  8.01s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 52%|█████▎    | 21/40 [02:55<02:40,  8.43s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 55%|█████▌    | 22/40 [03:03<02:27,  8.19s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 57%|█████▊    | 23/40 [03:11<02:18,  8.17s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 60%|██████    | 24/40 [03:23<02:25,  9.10s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 62%|██████▎   | 25/40 [03:31<02:14,  8.93s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 65%|██████▌   | 26/40 [03:39<02:01,  8.71s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 68%|██████▊   | 27/40 [03:47<01:50,  8.51s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 70%|███████   | 28/40 [03:59<01:51,  9.32s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 72%|███████▎  | 29/40 [04:06<01:37,  8.89s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 75%|███████▌  | 30/40 [04:14<01:26,  8.62s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 78%|███████▊  | 31/40 [04:23<01:17,  8.57s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 80%|████████  | 32/40 [04:31<01:07,  8.39s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 82%|████████▎ | 33/40 [04:38<00:57,  8.16s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 85%|████████▌ | 34/40 [04:46<00:47,  7.99s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 88%|████████▊ | 35/40 [04:55<00:40,  8.17s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 90%|█████████ | 36/40 [05:03<00:33,  8.34s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 92%|█████████▎| 37/40 [05:11<00:24,  8.08s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 95%|█████████▌| 38/40 [05:19<00:16,  8.19s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 98%|█████████▊| 39/40 [05:28<00:08,  8.43s/it]

Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 40/40 [05:36<00:00,  8.42s/it]

Exact match for diversity based prompt: 0.5





Get the predictions and submit these as `diversity_predictions.txt`, where each line is a prediction for the development set. With the improved selection, we should get least 35\% exact match on the validation set.

For the report, compare the predictions from the example selection methods using 1) uniform random sampling 2) embedding-based similarity search and 3) coverage based selection. Compare and contrast the errors and submit your analysis as `report.pdf`

* hw5.ipynb (this file; please rename to match)
* random_predictions.txt
* similarity_predictions.txt
* diversity_predictions.txt
* report.pdf