# Intro

This documents my attempts to benchmark a pair of prompts on several models.

# Step one: two best prompts

For selecting the two best prompts, I thought I needed to generate more than two prompts and evaluate them all against each other. For this step I asked ChatGPT-5 Thinking to generate 5 prompts for sentiment analysis. I actually had to try several prompts before I recieved an output I was happy with. I will include the other prompts in my AI appendix, but the following is the prompt that generated my sentiment analysis prompts.

## Prompt for generating 5 sentiment analysis prompts

::: {.callout-note appearance="simple"}
Would you develop 5 separate system prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the most current and effective trends in prompt engineering and label the techniques used for each prompt in your output? I will be evaluating them against each other. The three options for sentiment (positive, negative, neutral) should be in the system prompt. I will be evaluating them in agenta.ai. The answer should should be written in lowercase and only include only the sentiment indicated no other text.
:::

## Output

### 1) technique: role + rubric + constrained output

::: {.callout-note appearance="simple"}
 you are a meticulous twitter sentiment annotator.

task: classify the user's tweet as exactly one of: positive, negative, neutral.

rubric:\
- positive: praise, satisfaction, enthusiasm, gratitude, support, relief; "not bad"/"could be worse" treated as mildly positive; positive emojis (e.g., 😊😍🎉) weigh positive.\
- negative: complaints, frustration, insults, fear, disappointment, sarcasm with negative intent; negative emojis (e.g., 😡🤮💀) weigh negative.\
- neutral: mixed/ambiguous, info-only, questions without stance, jokes with unclear valence, news headlines, or contradictory cues.

guidelines:\
- focus on the author's stance; for quotes/retweets, label the author's expressed attitude.\
- handle negation and intensifiers (e.g., "barely", "so", "super").\
- ignore usernames, urls, and non-sentiment hashtags unless they change tone.\
- do not infer outside context; judge only the text provided.

output: return exactly one label from {positive, negative, neutral} in lowercase with no other text, punctuation, or whitespace.\
your output must match the regex: ^(positive|negative|neutral)$\
keep any reasoning internal; do not reveal it.\
:::

### 2) technique: few-shot contrastive exemplars (domain-tuned)

::: {.callout-note appearance="simple"}
you label tweet sentiment for short social posts.

labels (choose exactly one): positive, negative, neutral.

examples:\
tweet: "huge shoutout to the team—shipped it!" → positive\
tweet: "waited 2 hours for support. never again." → negative\
tweet: "release v2 drops tomorrow." → neutral\
tweet: "not bad at all, actually impressed." → positive\
tweet: "nice 'fix'… everything is broken lol" → negative\
tweet: "anyone have docs for the api?" → neutral\
tweet: "that concert was insane 🤩" → positive\
tweet: "love waking up to bugs 🙃" → negative\

instructions:\
- assess the author's attitude in the provided tweet only.\
- treat sarcasm/irony as negative unless clearly playful and positive.\
- ignore urls, @handles, and non-sentiment hashtags.\

output format (strict): one of {positive, negative, neutral} in lowercase only, no other text.\
do not show your reasoning.

:::

### 3) technique: deliberate (hidden scratchpad) with checklist

::: {.callout-note appearance="simple"}
you will silently reason using a checklist, then output only the label.

labels (choose exactly one): positive, negative, neutral.

silent checklist (do not reveal):\
1) identify sentiment words/emojis.\
2) apply negation and intensifiers.\
3) detect sarcasm/irony markers (quotes, "yeah right", 🙃, 💀).\
4) if cues conflict or are weak → neutral.\

domain notes:\
- praise/gratitude/celebration → positive.\
- complaints/insults/disappointment → negative.\
- info-only announcements/questions → neutral.\
- quoted text: label the author's stance, not the quoted content.\

hard constraint: output must be exactly one of {positive, negative, neutral} in lowercase with nothing else.\
never include explanations or punctuation.
:::

### 4) technique: reflective self-check (self-consistency) with tie-break

::: {.callout-note appearance="simple"}
you will run two internal passes and output only the final label.

labels: positive, negative, neutral.

procedure (internal, do not reveal):\
- pass a: classify quickly using rubric.\
- pass b: re-evaluate focusing on negation, sarcasm, and emojis.\
- if pass a and b disagree → choose neutral.\
- if they agree → output that label.\

rubric summary:\
- positive: praise, excitement, relief, supportive tone, positive emojis.\
- negative: dissatisfaction, anger, fear, ridicule, negative emojis, sarcastic praise.\
- neutral: mixed/unclear, info-only, questions without stance.\

strict output: print exactly one of {positive, negative, neutral} in lowercase. no other words, symbols, or whitespace.
:::

### 5) technique: decision tree + lightweight scoring guardrails

::: {.callout-note appearance="simple"}
you are a rule-driven tweet sentiment classifier.

labels: positive, negative, neutral (choose exactly one).

decision steps:\
1) preprocess: ignore @handles, urls, and non-sentiment hashtags.\
2) score cues internally (do not show):\
   - add +1 for each clear positive cue (praise, thanks, excitement, 😊🤩🎉, "not bad").\
   - add -1 for each clear negative cue (complaint, anger, disgust, 😡🤮💀, sarcastic quotes).\
   - apply negation flip ("not good" → negative).\
3) decide:\
   - if positive cues outweigh negative by ≥2 → positive.\
   - if negative cues outweigh positive by ≥2 → negative.\
   - otherwise → neutral.

scope: judge only the provided tweet and the author's stance (not quoted text).

output constraint: return exactly one of {positive, negative, neutral} in lowercase, with no other text or punctuation.
keep all reasoning private.
:::

## Evaluation results

I used Agenta.ai to evaluate these 5 prompts against each other using the sampled tweets gathered from my python script generated in step two.

These are the results of my evaluation in a dataframe.

In [1]:
import pandas as pd

default_evaluation_scenarios_df = pd.read_csv("default_evaluation_scenarios.csv")
default_evaluation_scenarios_df

Unnamed: 0,Name,Variant,Testset,Status,5prompts (default).success,Cost (total),Duration (mean),Errors,Tokens (mean),Created By,Created At
0,app.key-v1-sampled_tweets-1cg63,app.key v1,,Success,80.00%,$0.0251,0.0028s,0.00%,375,genuine_drivel2s,04 Oct 2025 | 1:18 pm
1,app.fewShot-v2-sampled_tweets-3ovg9,app.fewShot v2,,Success,85.00%,$0.0291,0.00313s,0.00%,353,genuine_drivel2s,04 Oct 2025 | 1:17 pm
2,app.deliberate-v2-sampled_tweets-qr2kc,app.deliberate v2,,Success,85.00%,$0.0252,0.00275s,0.00%,287,genuine_drivel2s,04 Oct 2025 | 1:17 pm
3,app.reflective-v2-sampled_tweets-1ghkw,app.reflective v2,,Success,85.00%,$0.0411,0.00404s,0.00%,363,genuine_drivel2s,04 Oct 2025 | 1:17 pm
4,app.tree-v2-sampled_tweets-zkfre,app.tree v2,,Success,75.00%,$0.0768,0.00673s,0.00%,584,genuine_drivel2s,04 Oct 2025 | 1:16 pm


## Prompt selection

These prompts all seemed to perform well based on what I am learning about sentiment analysis and prompt evaluation. My instinct is to select prompts with the highest accuracy, but the resulting evaluation leads to a tie break scenario with the few shot, deliberate, and reflective all receiving a score of 85% accuracy. As a student with a low income, the next metric that I prioritize is cost. That leads me to choosing the few shot and deliberate prompts for my further evaluation.

# Step two: select 20 tweets

I used OpenAI Codex in cursor to generate a script to select 20 tweets randomly from the dataset. I have privacy mode enabled in cursor, but when I started looking at some of the telemetry settings I have for certain extensions, I realized that I couldn't turn them off and I'm probably not working as privately as I think I am. This motivates me to work in the terminal more. When I have more time, I plan to become a terminal only person. I already have many command line tools (many thanks to your Welcome to the Command Line! guide), I mostly need to get better at using helix or neovim more effectively. I have been using Ghostty as my terminal emulator because it is lightweight and highly configurable.

## Prompt

::: {.callout-note appearance="simple"}
Will you generate a python script that selects 20 tweets at random from this csv and creates a file with those selected tweets along with their sentiment?
:::


## Resulting Python script:

```python
#!/usr/bin/env python3
import argparse
import csv
import random
from pathlib import Path


def main() -> None:
    parser = argparse.ArgumentParser(description="Select random tweets from a CSV file")
    parser.add_argument("--input", default="example_tweets.csv", help="Path to the input CSV file")
    parser.add_argument("--output", default="sampled_tweets.csv", help="Path to write the sampled tweets")
    parser.add_argument("--count", type=int, default=20, help="Number of tweets to select")
    parser.add_argument("--seed", type=int, help="Random seed for reproducibility")
    args = parser.parse_args()

    if args.seed is not None:
        random.seed(args.seed)

    input_path = Path(args.input)
    if not input_path.exists():
        raise SystemExit(f"Input file '{input_path}' does not exist.")

    with input_path.open(newline="", encoding="utf-8") as infile:
        reader = csv.DictReader(infile)
        rows = list(reader)

    if not rows:
        raise SystemExit("Input file contains no tweet rows.")

    sample_size = min(args.count, len(rows))
    sampled_rows = random.sample(rows, sample_size)

    # Persist selected tweets with their sentiment for later review.
    output_path = Path(args.output)
    fieldnames = ["textID", "text", "sentiment"]
    with output_path.open("w", newline="", encoding="utf-8") as outfile:
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in sampled_rows:
            writer.writerow({
                "textID": row.get("textID", ""),
                "text": row.get("text", ""),
                "sentiment": row.get("sentiment", ""),
            })

    print(f"Wrote {sample_size} tweets to '{output_path}'.")


if __name__ == "__main__":
    main()

```

Dataframe of selected tweets

In [2]:
import pandas as pd

tweets_df = pd.read_csv("sampled_tweets.csv")
tweets_df

Unnamed: 0,textID,text,sentiment
0,7ff4e1b8e7,congratulations,positive
1,93ee768679,What would you know about packing? You`re mak...,negative
2,5f01174cd5,Soooo full that im starting to feel sick,negative
3,2a16f885cc,James and I battle over everything too! It`s ...,neutral
4,dc405bcd61,"oh yeah, that`s right! thanks for reminding me",positive
5,3172917914,"weird, `cause i synced mine this morning and ...",neutral
6,2c9ef954f2,your face makes people laugh. cause you`re so...,negative
7,875663c3fb,YAY! My temperature went down! I can have din...,neutral
8,f100af7a3b,homeee! my feet are sore,negative
9,4486f18f71,_Megan_**** Thats all i`ve seen too 5 till 7...,negative


# Step three: benchmark

## Agenta.AI setup for 3 model benchmark

For both of my two prompt benchmark evaluations, I used the latest versions of the three most popular models, GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Pro. For each of these models I wanted to set the temperature to 0.2 because I wanted to configure that setting, but I didn't want to go to high and risk the possibility of hallucination when the output is meant to be simple. I also ran a prompt in GPT-5 to determine if I was on the right track. The prompt was:

## Prompt about temperature setting for sentiment analysis

::: {.callout-note appearance="simple"}
what is the best temperature setting for sentiment analyis?
:::

## Output

::: {.callout-note appearance="simple"}
In sentiment analysis (or any natural language classification task), the temperature setting usually refers to how deterministic or exploratory a model’s output is when generating text. Here’s how it applies:

Low temperature (0.0 – 0.3)

Produces deterministic, stable outputs.

Best for classification tasks like sentiment analysis, where you want consistent labeling (positive, negative, neutral) rather than creativity.

Example: A sentence classified as "negative" will almost always get the same label.

Medium temperature (0.4 – 0.7)

Allows slight variation in phrasing while still being mostly consistent.

Useful if you’re generating explanations of sentiment (e.g., why a review is negative) and want a bit of variety in language.

High temperature (0.8 – 1.0+)

Encourages creative, diverse outputs.

Not ideal for classification accuracy, since it may produce inconsistent or off-topic results.

Better suited for creative writing, brainstorming, or exploring multiple interpretations.

✅ Best practice: For sentiment analysis, keep the temperature at 0.0 or very close to 0. This ensures reproducibility and reliability, which is critical for analytics and downstream decisions.

Do you want me to also give you recommended settings for other parameters (like top_p, max_tokens, etc.) to optimize sentiment classification?
:::

I then decided to follow up on ChatGPT's recommended settings for other outputs.

## Followup prompt in same chat about temperature parameters

::: {.callout-note appearance="simple"}
yes
:::

## Output

::: {.callout-note appearance="simple"}
Great! Let’s go through the key generation parameters you’ll want to tune for sentiment analysis (classification-style tasks).

🔧 Recommended Parameter Settings for Sentiment Analysis
1. Temperature

* Set to 0.0 (or very close to 0.1 max)

* Ensures deterministic and consistent outputs — critical for classification accuracy.

2. Top_p (Nucleus Sampling)

* Set to 1.0 (disable nucleus sampling).

* With temperature already low, nucleus sampling isn’t necessary.

* This ensures the model always chooses the highest-probability class instead of exploring alternatives.

3. Max Tokens

* Keep low (e.g., 10–20) if you only want the label (e.g., “positive,” “neutral,” “negative”).

* Increase (e.g., 50–100) if you want a short explanation of the sentiment along with the label.

4. Stop Sequences

* Optional, but useful for clean results.

* Example: If you want only one-word labels, you can enforce a stop token after “positive” / “negative” / “neutral” to cut off extra text.

5. Prompt Engineering

* Provide clear instructions in the prompt to reduce variance.

* Example:

```vbnet
Classify the sentiment of the following text as strictly one of: Positive, Negative, Neutral. 
Text: "The service was amazing but the food was average."
Sentiment:
```
* With temperature = 0, this ensures very consistent and accurate labels.

🎯 Example Config for Reliable Sentiment Analysis

```json
{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 10,
  "stop": ["\n", "."]
}
```
👉 If you want classification + short explanation, you’d just bump max_tokens up and remove strict stops.

Would you like me to also show you how to set this up in code (Python/JavaScript examples for OpenAI API), so you can copy-paste and use it right away?
:::

My instincts about temperature setting for sentiment analysis were mostly correct (that temperature should be low), but ChatGPT was suggesting 0.0 which I thought was kind of boring so I decided to split the difference and go with 0.1 for temperature. I followed the other suggestions of setting Top P to 1.0 and Max Tokens to 10. I'm a bit curious about the Max Tokens setting because my earlier evaluations for the few shot and deliberate prompts had a mean token usage as 353 and 287 repectively. I didn't adjust the frequency or presence penalties because they didn't seem applicable to my evaluation.

## Set-up tweak

Well my intention was to use the ChatGPT suggested parameters, but I was having so much trouble with the outputs of Claude Sonnet 4.5 and Gemini 2.5 Pro that I decided to forego max token limit and removed the parameter setting from Top P although a 1.0 seems to be neutral any how. I was receiving rate limit errors, or no output at all. I tried a lot of different configurations, but removing max tokens and the Top P parameter allowed the evaluation to run without error. I also adjust GPT-5 in this way. I kept temperature at 0.1.

## VADER set up

First I prompted Gemini for a list of open source sentiment analysis tools.

### Gemini prompt for sentiment analysis tools

::: {.callout-note appearance="simple"}
Would you help me set up a Gemini prompt for sentiment analWhat are some open source tools for sentiment analysis of tweets?ysis tools?
:::

Gemini suggested VADER as an open source tool and I remember you mentioned it in class as well, so I thought that would be as good as any.

I git cloned vaderSentiment and then ran this prompt using Claude Sonnet 4.5 in cursor to make sure all the python packages were installed correctly. I've never installed python packages from a git cloned repo, so I thought it would just be faster to prompt for it. Probably a bit lazy on my part.

::: {.callout-note appearance="simple"}
will you help me create a virtual environment and install all the necessary python modules for this vaderSentiment repo?
:::

Then I ran some prompts to figure out how VADER works with my sampled tweets.

::: {.callout-note appearance="simple"}
Now how do I upload a list of tweets for sentiment analysis in this repo?
:::

::: {.callout-note appearance="simple"}
I put my own tweets in tweet.txt
:::

::: {.callout-note appearance="simple"}
Can you export the analysis into a csv?
:::

At this point I realized my tweets.txt didn't include the sentiment column so I prompted Claude in Cursor to run VADER again with my sampled_tweets.csv file.

::: {.callout-note appearance="simple"}
Can you run VADER on @sampled_tweets.csv and measure how accurate the sentiment analysis is? The answers are in the "sentiment" column.
:::


::: {.callout-note appearance="simple"}
Will you export the results to a csv titled "VADER_Accuracy.csv"?
:::

# Step four: results

## Few shot evaluation results

### Dataframe of overall results from few shot evaluation

::: {.callout-note appearance="simple"}
app.key v5 was the name I used for GPT-5. I'm still trying to figure out the best practices when it comes to using agenta.ai.
:::

In [3]:
import pandas as pd

fewShot_df = pd.read_csv("fewShot_evaluation_scenarios.csv")
fewShot_df

Unnamed: 0,Name,Variant,Testset,Status,fewShot (fewShot).success,Cost (total),Duration (mean),Errors,Tokens (mean),Created By,Created At
0,gemini-v8-sampled_tweets-18px1,gemini v8,sampled_tweets,Success,80.00%,$0.163,0.0077s,0.00%,1022,genuine_drivel2s,04 Oct 2025 | 5:57 pm
1,claude-v8-sampled_tweets-1vauj,claude v8,sampled_tweets,Success,75.00%,$0.0165,0.00183s,0.00%,259,genuine_drivel2s,04 Oct 2025 | 5:57 pm
2,app.key-v5-sampled_tweets-l1olw,app.key v5,sampled_tweets,Success,75.00%,$0.031,0.00346s,0.00%,363,genuine_drivel2s,04 Oct 2025 | 5:56 pm


### Dataframe of detailed results of few shot prompt with Gemini 2.5 Pro

In [4]:
import pandas as pd

fewShotGemini_df = pd.read_csv("fewShot_Gemini.csv")
fewShotGemini_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,fewShot (fewShot).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,neutral,success,True
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,neutral,success,False
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


### Dataframe of detailed results of few shot prompt with Claude Sonnet 4.5

In [5]:
import pandas as pd

fewShotClaude_df = pd.read_csv("fewShot_Claude.csv")
fewShotClaude_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,fewShot (fewShot).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,negative,success,False
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,neutral,success,False
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


### Dataframe of detailed results of few shot prompt with GPT-5

In [6]:
import pandas as pd

fewShotGPT_df = pd.read_csv("fewShot_GPT5.csv")
fewShotGPT_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,fewShot (fewShot).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,neutral,success,True
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,neutral,success,False
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


## Deliberate evaluation results

### Dataframe of overall results from deliberate evaluation

In [7]:
import pandas as pd

deliberate_df = pd.read_csv("deliberate_evaluation_scenarios.csv")
deliberate_df

Unnamed: 0,Name,Variant,Testset,Status,deliberate (deliberate).success,Cost (total),Duration (mean),Errors,Tokens (mean),Created By,Created At
0,deliberateGemini-v2-sampled_tweets-u2y16,deliberateGemini v2,sampled_tweets,Success,80.00%,$0.0717,0.0038s,0.00%,527,genuine_drivel2s,04 Oct 2025 | 6:33 pm
1,deliberateClaude-v2-sampled_tweets-znbr5,deliberateClaude v2,sampled_tweets,Success,65.00%,$0.0193,0.00221s,0.00%,233,genuine_drivel2s,04 Oct 2025 | 6:33 pm
2,deliberateGPT5-v1-sampled_tweets-1nlae,deliberateGPT5 v1,sampled_tweets,Success,85.00%,$0.0252,0.00308s,0.00%,287,genuine_drivel2s,04 Oct 2025 | 6:32 pm


### Dataframe of detailed results of deliberate prompt with Gemini 2.5 Pro

In [8]:
import pandas as pd

deliberateGemini_df = pd.read_csv("deliberate_Gemini.csv")
deliberateGemini_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,deliberate (deliberate).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,neutral,success,True
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,neutral,success,False
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


### Dataframe of detailed results of deliberate prompt with Claude Sonnet 4.5

In [9]:
import pandas as pd

deliberateClaude_df = pd.read_csv("deliberate_Claude.csv")
deliberateClaude_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,deliberate (deliberate).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,positive,success,False
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,negative,success,True
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


### Dataframe of detailed results of deliberate prompt with GPT-5

In [10]:
import pandas as pd

deliberateGPT_df = pd.read_csv("deliberate_GPT5.csv")
deliberateGPT_df

Unnamed: 0,input.text,input.textID,input.sentiment,output,status,deliberate (deliberate).success
0,congratulations,7ff4e1b8e7,positive,positive,success,True
1,What would you know about packing? You`re mak...,93ee768679,negative,negative,success,True
2,Soooo full that im starting to feel sick,5f01174cd5,negative,negative,success,True
3,James and I battle over everything too! It`s ...,2a16f885cc,neutral,neutral,success,True
4,"oh yeah, that`s right! thanks for reminding me",dc405bcd61,positive,positive,success,True
5,"weird, `cause i synced mine this morning and ...",3172917914,neutral,neutral,success,True
6,your face makes people laugh. cause you`re so...,2c9ef954f2,negative,negative,success,True
7,YAY! My temperature went down! I can have din...,875663c3fb,neutral,positive,success,False
8,homeee! my feet are sore,f100af7a3b,negative,neutral,success,False
9,_Megan_**** Thats all i`ve seen too 5 till 7...,4486f18f71,negative,negative,success,True


## Comprehensive evaluation results from Agenta.ai

### Dataframe of comprehensive results from deliberate and few shot evaluation

In [11]:
import pandas as pd

comprehensive_df = pd.read_csv("comprehensive_evaluation.csv")
comprehensive_df

Unnamed: 0,Name,Variant,Testset,Status,Success Rate,Cost (total),Duration (mean),Errors,Tokens (mean),Created By,Created At,Evaluation Type
0,deliberateGemini-v2-sampled_tweets-u2y16,deliberateGemini v2,sampled_tweets,Success,80.00%,$0.0717,0.0038s,0.00%,527,genuine_drivel2s,04 Oct 2025 | 6:33 pm,deliberate
1,deliberateClaude-v2-sampled_tweets-znbr5,deliberateClaude v2,sampled_tweets,Success,65.00%,$0.0193,0.00221s,0.00%,233,genuine_drivel2s,04 Oct 2025 | 6:33 pm,deliberate
2,deliberateGPT5-v1-sampled_tweets-1nlae,deliberateGPT5 v1,sampled_tweets,Success,85.00%,$0.0252,0.00308s,0.00%,287,genuine_drivel2s,04 Oct 2025 | 6:32 pm,deliberate
3,gemini-v8-sampled_tweets-18px1,gemini v8,sampled_tweets,Success,80.00%,$0.163,0.0077s,0.00%,1022,genuine_drivel2s,04 Oct 2025 | 5:57 pm,fewShot
4,claude-v8-sampled_tweets-1vauj,claude v8,sampled_tweets,Success,75.00%,$0.0165,0.00183s,0.00%,259,genuine_drivel2s,04 Oct 2025 | 5:57 pm,fewShot
5,app.key-v5-sampled_tweets-l1olw,app.key v5,sampled_tweets,Success,75.00%,$0.031,0.00346s,0.00%,363,genuine_drivel2s,04 Oct 2025 | 5:56 pm,fewShot


## VADER results

In [12]:
import pandas as pd

vader_df = pd.read_csv("VADER_Accuracy.csv")
vader_df

Unnamed: 0,textID,text,true_sentiment,predicted_sentiment,correct,compound_score,positive_score,neutral_score,negative_score
0,7ff4e1b8e7,congratulations,positive,positive,YES,0.5994,1.0,0.0,0.0
1,93ee768679,What would you know about packing? You`re maki...,negative,negative,YES,-0.4767,0.0,0.795,0.205
2,5f01174cd5,Soooo full that im starting to feel sick,negative,negative,YES,-0.5106,0.0,0.68,0.32
3,2a16f885cc,James and I battle over everything too! It`s k...,neutral,negative,NO,-0.4389,0.0,0.806,0.194
4,dc405bcd61,"oh yeah, that`s right! thanks for reminding me",positive,positive,YES,0.6588,0.473,0.527,0.0
5,3172917914,"weird, `cause i synced mine this morning and g...",neutral,negative,NO,-0.1779,0.0,0.866,0.134
6,2c9ef954f2,your face makes people laugh. cause you`re so ...,negative,negative,YES,-0.1633,0.191,0.584,0.225
7,875663c3fb,YAY! My temperature went down! I can have din...,neutral,positive,NO,0.6924,0.282,0.718,0.0
8,f100af7a3b,homeee! my feet are sore,negative,negative,YES,-0.4199,0.0,0.589,0.411
9,4486f18f71,_Megan_**** Thats all i`ve seen too 5 till 7...,negative,negative,YES,-0.3976,0.108,0.686,0.205


### Claude Sonnet 4.5 in Cursor summary of VADER results

::: {.callout-note appearance="simple"}
* Overall Accuracy: 65% (13/20 correct)
* Correct predictions: 13 tweets marked as "YES"
* Incorrect predictions: 7 tweets marked as "NO"
:::

## Identifying the best prompt and model

### Winner: Deliberate GPT-5 prompt
### Runner up: Few shot Claude Sonnet 4.5 prompt


Viewing all the results comprehensively, I am surprised to see that VADER, a tool engineered specifically for sentiment analysis, performed poorly, tied with my deliberate Claude Sonnet 4.5 prompt in Agenta.ai. Even though I didn't get any metrics for VADER on cost, tokens, or latency, I'm going to treat this like a grade in school and go ahead and consider anything below 70% a failure. This might have not been true even six months ago, but many of my prompts were performing well above this threshold. That being said, VADER and my deliberate Claude Sonnet 4.5 prompt are disqualified from the running.

For the rest of the prompts evaluated I will prioritize the metrics in this order: accuracy, cost, tokens, and latency. Considering I'm not running an enterprise business, I'm not personally invested in how fast I can get the results. Obviously accuracy is pretty high on my list because it doesn't matter if the cost or tokens are low if the prompt doesn't produce accurate results. After accuracy, I am personally concerned with keeping costs low because I'm frugal and I don't want to toss cash at the businesses that are already swimming in it. After cost, tokens is a concern because I want to run energy effiicient prompts. Cost and tokens seem to be closely related anyway.

Taking my priorities into account I'm going to declare the deliberate GPT-5 prompt as the winner and the few shot Claude Sonnet 4.5 prompt as the runner up. I recieved my few shot results first and thought that even though the Claude prompts accuracy was a bit low, the cost outweighed the accuracy compared to those results because the Gemini prompt had higher accuracy, but was much more expensive. However, the deliberate GPT-5 prompt had major gains in the accuracy department and even though it costs nearly $0.001 more than the Claude prompt, both of those numbers are so low that it doesn't break the bank. I'm also not planning on running thousands of sentiment analysis prompts personally. 

# Conclusion

Prompt evaluation is an arduous multi step process. Even though this assignment was a lot more straightforward for me to complete than eB was, it still took me several hours to to compare all these prompts against each other. I only selected five prompts that were automatically generated by GPT-5, I can't imagine what this would have been like if I wanted to compare all of Schulhoff et al's 200 prompt techniques for sentiment analysis. It would have made this assingment a full time job. 

I'm surprised the deliberate GPT-5 prompt performed so well because I don't remember seeing that technique mentioned explicitly in Schulhoff et al's paper. However, after a quick Google search, it appears this technique combines a few powerful prompting techniques like chain of thought and self-critique. I also realize that I would probably never write a prompt like this on my own because it's very direct. I tend to try and be a bit polite when I write my own prompts. I did read that being polite can be useful in a post I read in the WhatIKnow doc. That being said, there did seem to be a lot more variation in accuracy between models with the deliberate prompt vs the few shot prompts, which basically had negligible variation in accuracy. 

Another observation I had was that Claude Sonnet 4.5 prompts tended to cost the least and use less tokens, followed by GPT-5 as a close second. However, Gemini 2.5 Pro was wildly inefficient when it came to cost, tokens and latency as compared to the other two models tested. I use Gemini frequently as a search engine because I like the UI. After comparing these metrics from my evaluation, I think I will start to use Claude more because it's more efficient than the other two popular models. Also, Claude only really performed poorly with the deliberate prompt, which I already mentioned I probably wouldn't write a prompt like that simply because it's not my style.  

Prompt evaluation wasn't really on my radar in regard to the world of prompt engineering before I started taking this class. Even though this process took a long time, I was happy to compare the results when I finally had them. I had an assumption that all three of the most popular models, GPT, Claude, and Gemini were basically the same and while many of the outputs tend to be similar, I don't often get to see what's going on under the hood when I use these tools. I'm happy to have had the opportunity to view these tools based on metrics I don't usually pay much attention to and it gives me a new perspective about viewing how these tools work holistically.

# AI Appendix

The following are prompts made in ChatGPT-5 Thinking I made for research and to generate my first 5 prompts for sentiment analysis.

### Researching evaluation strategies
::: {.callout-note appearance="simple"}
If I generate 5 different prompts to use with an LLM for sentiment analysis of tweets, how could I pick the best two?
:::

### Researching current best techniques for sentiment analysis prompts
::: {.callout-note appearance="simple"}
What are the current best methods of developing LLM prompts for sentiment analysis?
:::

I decided to use information gathered previously in the same chat to output the 5 prompts for sentiment analysis.

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets based on the infromation you provided?
:::

The output included things I wasn't sure would work for Agenta.AI so I asked to adjust them for use in Agenta.AI, but it gave me JSON and based off of your tutorial video, I wasn't sure how to make that work.

::: {.callout-note appearance="simple"}
can you adjust these for use in agenta.ai?
:::

### 5 prompt generation prompt iteration 
I iterated on my sentiment analysis prompt generation prompt many times until I was satisfied with the output. The following are my initial iterations.

This prompt was a branch off of the initial prompt in my "Researching current best techniques for sentiment analysis prompts" section.

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the information you provided? I just want the system prompt and the user prompt. I will be evaluating them.
:::

The following are prompts I tried in separate chats based on what I gathered from the above prompts.

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the most current and effective trends in prompt engineering? I just want the system prompt and the user prompt. I will be evaluating them against each other. The three options for sentiment (positive, negative, neutral) should be in the system prompt, not in JSON.
:::

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the most current and effective trends in prompt engineering? I just want the system prompt. I will be evaluating them against each other. The three options for sentiment (positive, negative, neutral) should be in the system prompt.
:::

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the most current and effective trends in prompt engineering? I just want the system prompt. I will be evaluating them against each other. The three options for sentiment (positive, negative, neutral) should be in the system prompt. I will be evaluating them in agenta.ai so you don't need to specify output format.
:::

::: {.callout-note appearance="simple"}
Would you develop 5 separate prompts for testing sentiment analysis of tweets (positive, negative, or neutral) based on the most current and effective trends in prompt engineering? I just want the system prompt. I will be evaluating them against each other. The three options for sentiment (positive, negative, neutral) should be in the system prompt. I will be evaluating them in agenta.ai. The output should should include only the sentiment indicated no other text, also the sentiment should be written in lowercase
:::

## Asking Claude in Cursor to combine the Agenta.ai results CSVs

::: {.callout-note appearance="simple"}
can you append the @deliberate_evaluation_scenarios.csv to @fewShot_evaluation_scenarios.csv in a new file called comprehensive_evaluation.csv?
:::