In [1]:
import ast
import json
import os

import ipywidgets as widgets
import pandas as pd
from dotenv import load_dotenv
from IPython.display import display
from openai import OpenAI

In [2]:
load_dotenv(".env")
api_key = os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=api_key)

In [24]:
def get_completion(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.9,
        response_format={
            "type": "text",
        },
    )

    completion = completion.to_dict()

    content = completion["choices"][0]["message"]["content"]

    return content

#### Average prompt, average response

**Task**: run the following prompt and examine the response

In [None]:
prompt = """
Brainstorm a list of product names for a pair of trainers that appeal to teenagers?
"""

get_completion(prompt)

**Task**: what do you think some of the obvious issues with this prompt?

1. **Vague direction**: what style and attributes?
2. **Unformatted output**: how many names and in what format?
3. **Missing example**: what good names look like?
4. **Limited evaluation**: how to determine if a name is good or bad?
5. **No task division**: what are the steps involved?

### The Five Principles of Prompting

1. **Give Direction**: describe the desired style in detail, or reference a relevant persona

2. **Specify Format**: define what rules to follow, and the required structure of the response

3. **Provide Examples**: insert a diverse set of test cases where the task was done correctly

4. **Evaluate Quality**: identify errors and rate responses, testing what drives performance

5. **Divide Labour**: split tasks into multiple steps, chained together for complex goal

#### 1. **Give Direction**

Human would also struggle to complete this task without a good brief. Imagine what context a human might need for this task and try including it in the prompt.

**Strategies**:

* Role-playing
* Prewarming or internal retrieval
* Best advice

**Role-playing**: find a persona (who is famous in the training data) to emulate their style

**Task**: Use your favourite persona/brand to give direction

In [None]:
prompt = "Brainstorm a list of product names for a pair of trainers that appeal to teenagers?, in the style of Nike?"
get_completion(prompt)

**Prewarming or internal retrieval**: ask chatGPT for best practice advice, then ask it to follow its own advice 

**Task**: ask chatGPT for advice, then use its advice

In [None]:
prompt = "Please give me 5 tips for naming products based on expert industry advice"
advice = get_completion(prompt)
advice

In [None]:
prompt = f"""
Using the following advice:

{advice}

Brainstorm a list of product names for a pair of trainers that appeal to teenagers?, in the style of Nike?
"""

get_completion(prompt)

**Best advice**: take the best advice you can find for the task and use it

**Task**: Use the [5 Golden Rules](https://www.brandwatch.com/blog/how-to-name-a-product-our-5-golden-rules/) for naming a product

In [None]:
advice = """
1. It should be readable and writable
If your product name is hard to pronounce, people won’t talk about it and if they can’t write it down (and spell it correctly!) when they hear it, how do you expect them to Google it?

Keep it simple and don’t go with any wacky spellings just for the sake of it.

2. It should be unique
It’s very hard in this day and age to be completely unique, so you can give yourself a bit of leeway, but your product name should at least be unique to your industry.

This makes it much easier to get the domain, do well in search and know that when someone says the name, they mean your product.

3. It should be short, punchy and memorable
The longer the name, the harder it is to grab people.

Longer names also mean people resort to abbreviations that you often don’t get to control.

4. It should look good written down and sound cool to say
You want your product name to jump off the page and stand out next to all the other boring words around it.

When someone says it in a sentence it should stand out so everyone around pays attention.

5. It should evoke an emotion, feeling or idea
Your product name should tie back into what your product is, what the feeling you want people to have when experiencing your product is, and/or what idea are you trying to get across.

It should be emotive and inspiring
"""

In [None]:
prompt = f"""
Using the following 5 golden rules:

{advice}

Brainstorm a list of product names for a pair of trainers that appeal to teenagers?, in the style of Nike?
"""

get_completion(prompt)

**Trade-off**: too much direction can cause the model to quickly get into a conflicting combination that it can't resolve. While too much direction can narrow the creativity of the model, too little direction is the more common problem

#### 2. **Specify Format**


AI models are universal translators! They can translate between:

* Languages, for example French to English
* Data formats, for example lists to json 
* Programming languages, for example, C# to Python
* Natural language and programming languages, for example text to Python

* **Task**: Update the prompt to return the response as a comma-separated list of 5 names

In [None]:
prompt = """
Brainstorm a list of product names for a pair of trainers that appeal to teenagers?, in the style of Nike?

Return the response as a comma-separated list of 5 names, in this format: ["Name 1", "Name 2", "Name 3", "Name 4", "Name 5"]
"""

completion = get_completion(prompt)
completion

* **Task**: Print the items in the list, one by one
* *Hint*: the response is a string list, use `ast.literal_eval` first to convert it into a list object

In [None]:
completion = ast.literal_eval(completion)

for item in completion:
    print(item)

#### 3. **Provide Examples**

* Our prompts didn't give any examples of what "good" outputs look like! A prompt with no examples is called a `zero-shot` prompt which is asking for a lot without giving much in return.

* Even adding one example, which is called `one-shot`, helps considerably

* It's more common practice to add multiple examples, which is called `few-shot`. The strength of a prompt often comes down to the examples used, especially if you're not a domain expert as providing examples can sometimes be easier than trying to explain what is it that you like about the output to look like 

* **Task**: experiment with the number of examples, from `zero-shot` to `few-shot`

In [None]:
prompt = """

"""

completion = get_completion(prompt)
completion = json.loads(completion)

completion["ProductNames"]

**Trade-off**: between reliability and creativity. Go past 3-5 examples and the results will become more reliable but less creative. Give similar examples and the results will be much more constrained and less diverse. Lack of diversity and variations can cause a problem in handling edge cases.

#### 4. **Evaluate Quality**

* So far, there has been no feedback loop to judge the quality of the responses other than checking the results manually (A.K.A vibe). 

* This is called **blind prompting** and it's only fine when our prompts are used temporally for a single task and rarely visited.

* If we are planning to use the same prompt for an application, then we need to be more rigorous with measuring the results. 

* This is because when a new LLM model is released, or a new version of the same LLM model is released, there is no guarantee a prompt that works on the older model/version would work on the new model/version.  

How to evaluate?

* **Step 1**: Generate a number of prompts for a task
* **Step 2**: get a response for *multiple* runs for each prompt 
* **Step 3**: save results in a file
* **Step 4**: evaluate each of the responses using one of the below strategies, which is best done blind and randomised to avoid favouring one prompt over another

**Strategies**:

* Simple thumps-up/down rating system
* 3, 5, or 10 points rating system 
* Use a ground-truth
* Use an LLM as judge

**Task**: 
* Iterate through **2 prompts**
* For each prompt, iterate through **3 runs** 
* Give each prompt and run a name
* Get a response from the model
* Convert responses into a dataframe
* Save the dataframe as a CSV file

In [None]:
prompt_1 = """Product description: A pair of shoes that can fit any foot size.
Seed words: adaptable, fit, omni-fit.
Product names:"""

prompt_2 = """Product description: A home milkshake maker.
Seed words: fast, healthy, compact.
Product names: HomeShaker, Fit Shaker, QuickShake, Shake Maker

Product description: A watch that can tell accurate time in space.
Seed words: astronaut, space-hardened, eliptical orbit
Product names: AstroTime, SpaceGuard, Orbit-Accurate, EliptoTime.

Product description: A pair of shoes that can fit any foot size.
Seed words: adaptable, fit, omni-fit.
Product names:"""

In [None]:
prompts = [prompt_1, prompt_2]
responses = []
no_runs = 3

for ind, prompt in enumerate(prompts):
    for run in range(no_runs):
        variant = f"{ind}_{run}"
        response = get_completion(prompt)
        variant = f"{ind}_{run}"

        data = {
            "variant": variant,
            "prompt": prompt,
            "response": response,
        }

        responses.append(data)

df = pd.DataFrame(responses)

df.to_csv("responses.csv", index=False)

In [None]:
display(df)

In [None]:
# load the responses.csv file
df = pd.read_csv("responses.csv")

# Shuffle the dataframe
df = df.sample(frac=1).reset_index(drop=True)

# df is your dataframe and 'response' is the column with text to test
response_index = 0
# add a new column to store feedback
df["feedback"] = pd.Series(dtype="str")


def on_button_clicked(b):
    global response_index
    #  convert thumbs up / down to 1 / 0
    user_feedback = 1 if b.description == "\U0001f44d" else 0

    # update the feedback column
    df.at[response_index, "feedback"] = user_feedback

    response_index += 1
    if response_index < len(df):
        update_response()
    else:
        df.to_csv("results.csv", index=False)

        print("A/B testing completed. Here's the results:")
        # Calculate score and num rows for each variant
        summary_df = (
            df.groupby("variant")
            .agg(count=("feedback", "count"), score=("feedback", "mean"))
            .reset_index()
        )
        print(summary_df)


def update_response():
    new_response = df.iloc[response_index]["response"]
    if pd.notna(new_response):
        new_response = "<p>" + new_response + "</p>"
    else:
        new_response = "<p>No response</p>"
    response.value = new_response
    count_label.value = f"Response: {response_index + 1}"
    count_label.value += f"/{len(df)}"


response = widgets.HTML()
count_label = widgets.Label()

update_response()

thumbs_up_button = widgets.Button(description="\U0001f44d")
thumbs_up_button.on_click(on_button_clicked)

thumbs_down_button = widgets.Button(description="\U0001f44e")
thumbs_down_button.on_click(on_button_clicked)

button_box = widgets.HBox([thumbs_down_button, thumbs_up_button])

display(response, button_box, count_label)

#### 5. Divide Labour

* Whe your prompt gets longer and more convoluted, the response might get less deterministic, and hallucinations or anomalies increase.

* One of the core principles of engineering is to use *task decomposition* to break a problem down into their component parts, so we can easily solve each individual part.

* Breaking your AI work into multiple calls that are chained together can help accomplishing more complex tasks

**Task**:
* Step 1: Generate a list of 10 product names (as before) 

* Step 2: Rate this list out of 10

* Step 3: Append `let's think step by step` to understand how the rates where generated

* Step 4: Choose the best product name with the highest rating

In [None]:
prompt = """

"""

names = get_completion(prompt)

In [None]:
prompt = f"""
Rate this list of product names for a pair of shoes that can fit any foot size. The rating should be out of 10, inline next to the product name:

{names} 
"""

rating = get_completion(prompt)

In [None]:
prompt = f"""
Let's think step by step. Rate this list of product names for a pair of shoes that can fit any foot size. The rating should be out of 10, inline next to the product name:

{names} 
"""

rating = get_completion(prompt)