## Prompt Engineering Assignment

Complete the required tasks below. Feel free to make intermediate commits along the way, but when you've finished commit with a message that makes it obvious to me you're ready for me to evaluate the work. (e.g., "Ready for Review")

In [1]:
import polars as pl
import openai
import os
import random

This next cell creates an Open AI client. GitHub (wisely) prevents one from putting API keys in public repositories now, so I'm using an environment variable to store my API key. ChatGPT should be able to help you create an environment variable on your system. On MacOS you can edit your shell configuration file (`.zshrc` on my machine). On a windows machine you'll use the environment variable editor application. The API key will look something like this fake one: `sk-VPAJxeOHF7YLCLweC0fFT18lbkFRPP5Yux15Rs6FTfrf6Mxj`. 

In [2]:
client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Now let's read in the training data into a polars DF. 

In [3]:
df = pl.read_csv("data/training-data.txt",separator="\t")

print(df)

shape: (1_000, 3)
┌─────────────────────────────────┬────────────────┬───────────────┐
│ abstract                        ┆ location_label ┆ topic_label   │
│ ---                             ┆ ---            ┆ ---           │
│ str                             ┆ str            ┆ str           │
╞═════════════════════════════════╪════════════════╪═══════════════╡
│ Salt-induced deterioration of … ┆ NonUS          ┆ Conservation  │
│ Coyote (Canis latrans Say, 182… ┆ USA            ┆ Ecology       │
│ Over the past decade, underwat… ┆ NonUS          ┆ Technology    │
│ The Marine Strategy Framework … ┆ NonUS          ┆ Biodiversity  │
│ Background: The globally abund… ┆ NonUS          ┆ Ecology       │
│ …                               ┆ …              ┆ …             │
│ ContextMany wildlife populatio… ┆ USA            ┆ HumanWildlife │
│ The Atlantic-Gaspesie caribou … ┆ NonUS          ┆ Ecology       │
│ Shade coffee has shown great p… ┆ NonUS          ┆ Ecology       │
│ Seabirds alloc

## The Assignment

This data set includes 1000 scientific article abstracts and two additional columns: `location_label` indicating if the article is from a US journal and `topic_label`. The topics were determined by the researcher, Madeline Damon, although this labeling is not necessarily the same as what she used in her research. 

We're using this data set to explore prompt engineering, the process by which we modify prompts and attempt to improve performance. In this case, we're going to see how well we can get an Open AI model to match the `topic_label` column. (It goes without saying that you don't want to pass this column into the AI model since that would be giving the game away.)

Since we're just getting started this semester, I'm going to give you some code that you can adapt for the assignment.

You can read a lot more about prompt engineering on the [Open AI page](https://platform.openai.com/docs/guides/prompt-engineering). Note also that using this API costs money. You can read about the pricing [here](https://openai.com/api/pricing/).



## Attempt Number Zero

In this section, I'm going to build out some code to help you get started. I'll pull a random selection of 100 abstracts and ask the (old) OpenAI ChatGPT 3.5 model to classify them. Then I'll measure the accuracy. I'm going to use a very basic prompt, so I'm expecting the accuracy to be really low.

In [8]:
random.seed(20240825)         # Set seed for reproducibility
sample_df = df.sample(n=100)
print(sample_df)

shape: (100, 3)
┌─────────────────────────────────┬────────────────┬───────────────┐
│ abstract                        ┆ location_label ┆ topic_label   │
│ ---                             ┆ ---            ┆ ---           │
│ str                             ┆ str            ┆ str           │
╞═════════════════════════════════╪════════════════╪═══════════════╡
│ Marine mammals and diving bird… ┆ NonUS          ┆ Ecology       │
│ Study Objectives: Raptors are … ┆ NonUS          ┆ HumanWildlife │
│ Sharks and rays are threatened… ┆ NonUS          ┆ Conservation  │
│ The diet and feeding behaviour… ┆ NonUS          ┆ Ecology       │
│ Conservation planning is incre… ┆ USA            ┆ Conservation  │
│ …                               ┆ …              ┆ …             │
│ A new method consisting of enr… ┆ NonUS          ┆ Technology    │
│ Rising environmental temperatu… ┆ USA            ┆ ClimateChange │
│ A forest of the black coral An… ┆ NonUS          ┆ Conservation  │
│ Developing techn

In [9]:
starting_prompt = "Classify the topic of this abstract in single token.\n\n"

One note: the way I'm coding this below is not the typical "polars" way. Polars is designed to be very efficient working with entire columns, so typically you'd call `apply` with a function that handled the calls to the Open AI API. I've done this in a loop so that you can insert break statements and see the results. 

In [10]:
sample_df = sample_df.with_columns(pl.lit("").alias("ai_label")) 
  # This is one way to add a new column to a polars DF

total_input = 0
total_output = 0 

for idx, row in enumerate(sample_df.iter_rows()) :
  
  prompt = starting_prompt + row[0] # abstract in first spot in tuple

  # There is a lot to learn about this chat client and the information is 
  # changing all the time. The best place to get the most up to date
  # information is the OpenAI API documentation.
  # https://platform.openai.com/docs/overview
  # The information I used to build this can be found on these pages: 
  # https://platform.openai.com/docs/guides/text-generation and
  # https://platform.openai.com/docs/guides/chat-completions

  completion = client.chat.completions.create(
          model="gpt-3.5-turbo", # There are many other models you can use: https://platform.openai.com/docs/models
          messages=[
              {"role": "system", "content": "You are a helpful assistant."},
              {
                  "role": "user",
                  "content": prompt,
              }
          ] # And the messages can be more complex than this.
      )

  # break # Uncomment this break statement to take a look at this "completion" object

  # Let's keep track of token usage
  total_input += completion.usage.prompt_tokens
  total_output += completion.usage.completion_tokens

  # Sometimes getting information out of the reply is a bit complicated.
  ai_label = completion.choices[0].message.content.strip()

  sample_df[idx, 'ai_label'] = ai_label

# This took a bit longer than a minute on my machine.

In [11]:
accuracy = (sample_df["ai_label"] == sample_df["topic_label"]).cast(pl.Int32).sum()/sample_df.height
print(f"Accuracy: {accuracy * 100:.2f}%")

print(f"Total input tokens: {total_input}")
print(f"Total output tokens: {total_output}")
print(f"Total tokens: {total_input + total_output}")

Accuracy: 15.00%
Total input tokens: 34837
Total output tokens: 260
Total tokens: 35097


In my testing I got an accuracy of 15% on this first shot. Not very good!

--- 

Now I'm going to ask you to do a series of tasks.

### Task 1

Repeat the above code with `gpt-3.5-turbo` and at least three other Open AI models. (I encourage you to choose models at very different price points such as `gpt-4` and `gpt-4o-mini`. ) Answer the following questions for each model: 

* What was the accuracy? 
* How many tokens were used for input and output? (If you're not changing the seed and prompt these will stay the same.) 
* What is your rough estimate of the number of tokens for input and output if you had 100,000 abstracts?
* What was the cost of your experiment? 
* What would it cost to run on 100,000 abstracts? 

Do these runs and record the results here in prose. 


In [None]:
# your work here if needed

At this point, pick a model that you feel gives an "acceptable" tradeoff between price and accuracy.

### Task 2

Incorporate some of the ideas from the the prompt engineering page of the Open AI documentation. Be methodical and scientific in your approach: 

1. Make a change.
2. Select a sample of abstracts.
3. Measure the accuracy and estimate the cost.

Document your changes here in the workbook. Do at least three rounds of improvements. Let me know if you need ideas on specific improvements to make. Report on your final version just as you did in Task 1 (accuracy, costs, cost per 100K abstracts, etc.)



In [None]:
# a code cell for you to get started. 

### Task 3

Settle on your final version and run it on the full set of 1000 abstracts. Report your results.  

In [None]:
# your code