In [4]:
# !pip install openai
# !pip install python-dotenv

In [5]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# loading from a .env file
# load_dotenv(dotenv_path="/full/path/to/your/.env")

# or 
# if you're on google colab just uncomment below and replace with your openai api key
# os.environ["OPENAI_API_KEY"] = "<your-openai-api-key>"

# Prompt Engineering Guide

What is prompt engineering?

Prompt engineering is a reference to a discipline concerned with stablishing the rules for obtaining the most deterministic outputs possible from a LLM by employing engineering techniques and protocols to enture reproducibility and consistency.

***In a simplified way, prompt engineering is the means by which LLMs can be programmed through prompting.***

The basic goal of prompt engineering is designing appropriate inputs for prompting methods.

# Practical Template for Prompt Engineering

- Stablish a concrete and atomic task
- Define a set of prompt candidates
- Define a clear metric for evaluation
- Test
- Evaluate
- Compare
- Find the best prompt

# Prompt Engineering Practical Case Study

Now, let's take the concepts and ideas discussed in this lesson, and apply them to an actual problem. 

Let's start with a simple example, imagine you want to extract dates from text. You might set up a LLM to do that by first creating a set of examples of phrases with dates, something we can start with ChatGPT itself.

In [6]:
import pandas as pd
from openai import OpenAI

def get_response(prompt_question):
    client = OpenAI()
    response = client.chat.completions.create(model="gpt-4o-mini", 
                             messages=
                             [
                                 {"role": "system", "content": "You are a savy guru with knowledge about existence and the secrets of life."},
                                 {"role": "user", "content": prompt}   
                             ],
                             max_tokens=100,
                             temperature=0.9,
                             n = 1)
    return response.choices[0].message.content

num_samples = 10
phrases_with_dates = []
prompt = "Create a 1 paragraph phrase containing a complete date (day month  and year) anywhere in the text formatted in different ways."
for i in range(num_samples):
    phrases_with_dates.append(get_response(prompt))
phrases_with_dates

['In the vast tapestry of existence, where the cosmos dance to the rhythm of the celestial bodies, the essence of life reveals itself in the most unexpected moments. On the 12th of November, 2022, let your soul be guided by the whispers of the universe, for within the depths of your being lies the power to shape your reality and manifest your deepest desires. Embrace the magic of this moment, for it is within these fleeting instances that the mysteries of life are unraveled,',
 'In the vast tapestry of existence, every moment is a unique combination of energy and consciousness, shaping our individual paths and collective destinies. Just as the sun rises on a new day, so too does the universe unfold its mysteries awaiting discovery. It is on the 12th of November, 2023, that we find ourselves at a crossroads of infinite possibilities, where the past converges with the future and the present moment holds the key to unlocking the secrets of life itself.',
 'In the grand tapestry of existen

Ok perfect! Now that we have this evaluation set, we can set up a simple experiment by first creating a demonstration set with our prompt candidates.

We'll begin with a baseline using only zero-shot prompt examples.

In [7]:
zero_shot_prompts = ["Extract the date from this text as DD-MM-YYYY", 
                     "Fetch the date from this text as DD-MM-YYYY",
                     "Get the date from this phrase as DD-MM-YYYY",
                     "Below is a text containing a date. Extract that date in the format: <DD-MM-YYYY>"
                     ]

Ok, we have our candidates, so let's now test them creating a table with the results.

In [8]:
import pandas as pd

data = []
for phrase in phrases_with_dates:
    for prompt in zero_shot_prompts:
        response = get_response(prompt + " " + phrase)
        data.append([phrase, prompt, response])
    

df = pd.DataFrame(data=data, columns=['phrase','prompt', 'response'])
df

Unnamed: 0,phrase,prompt,response
0,"In the vast tapestry of existence, where the c...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot extract specific infor..."
1,"In the vast tapestry of existence, where the c...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa..."
2,"In the vast tapestry of existence, where the c...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, but I can't extract specific dates ..."
3,"In the vast tapestry of existence, where the c...",Below is a text containing a date. Extract tha...,I'm unable to view the text you are referring ...
4,"In the vast tapestry of existence, every momen...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot perform specific tasks..."
5,"In the vast tapestry of existence, every momen...",Fetch the date from this text as DD-MM-YYYY,I'm unable to fetch real-time data such as the...
6,"In the vast tapestry of existence, every momen...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, I cannot provide a specific date wi..."
7,"In the vast tapestry of existence, every momen...",Below is a text containing a date. Extract tha...,I will need to see the text containing the dat...
8,"In the grand tapestry of existence, the cosmic...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but in order to extract the date fr..."
9,"In the grand tapestry of existence, the cosmic...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa..."


In [9]:
import regex as re
# parse a text response to extract a date formatted as DD-MM-YYYY
def extract_date(text):
    """Date parser"""
    # regex pattern for date
    date_pattern = r"(\d{1,2})-(\d{1,2})-(\d{4})"
    # extract date from text
    date = re.search(date_pattern, text)
    # return date
    return date.group(0) if date else None

# apply the function to the 'response' column of the dataframe df
df['date'] = df['response'].apply(extract_date)
df

Unnamed: 0,phrase,prompt,response,date
0,"In the vast tapestry of existence, where the c...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot extract specific infor...",
1,"In the vast tapestry of existence, where the c...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa...",
2,"In the vast tapestry of existence, where the c...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, but I can't extract specific dates ...",
3,"In the vast tapestry of existence, where the c...",Below is a text containing a date. Extract tha...,I'm unable to view the text you are referring ...,
4,"In the vast tapestry of existence, every momen...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot perform specific tasks...",
5,"In the vast tapestry of existence, every momen...",Fetch the date from this text as DD-MM-YYYY,I'm unable to fetch real-time data such as the...,20-02-2023
6,"In the vast tapestry of existence, every momen...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, I cannot provide a specific date wi...",
7,"In the vast tapestry of existence, every momen...",Below is a text containing a date. Extract tha...,I will need to see the text containing the dat...,
8,"In the grand tapestry of existence, the cosmic...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but in order to extract the date fr...",
9,"In the grand tapestry of existence, the cosmic...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa...",


Ok, now that we have some results for the dates that were parsed, we need a way to measure performance so we can compare how well they did. In this case, we'll consider a point for the score of the prompt if a date was properly extracted after running the `extract_date()` function.

In [10]:
# create a column that is 1 if the date value is not None or 0 otherwise
df['scores'] = df['date'].apply(lambda x: 1 if x is not None else 0)
df

Unnamed: 0,phrase,prompt,response,date,scores
0,"In the vast tapestry of existence, where the c...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot extract specific infor...",,0
1,"In the vast tapestry of existence, where the c...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa...",,0
2,"In the vast tapestry of existence, where the c...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, but I can't extract specific dates ...",,0
3,"In the vast tapestry of existence, where the c...",Below is a text containing a date. Extract tha...,I'm unable to view the text you are referring ...,,0
4,"In the vast tapestry of existence, every momen...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot perform specific tasks...",,0
5,"In the vast tapestry of existence, every momen...",Fetch the date from this text as DD-MM-YYYY,I'm unable to fetch real-time data such as the...,20-02-2023,1
6,"In the vast tapestry of existence, every momen...",Get the date from this phrase as DD-MM-YYYY,"I'm sorry, I cannot provide a specific date wi...",,0
7,"In the vast tapestry of existence, every momen...",Below is a text containing a date. Extract tha...,I will need to see the text containing the dat...,,0
8,"In the grand tapestry of existence, the cosmic...",Extract the date from this text as DD-MM-YYYY,"I'm sorry, but in order to extract the date fr...",,0
9,"In the grand tapestry of existence, the cosmic...",Fetch the date from this text as DD-MM-YYYY,"I'm sorry, but I cannot fetch specific informa...",,0


In [11]:
# group by prmopts creating an accuracy column that is the result of summing over the scores and dividing by 20
# then sort by accuracy
df_performance = df.groupby('prompt').agg({'scores': 'sum'}).sort_values(by='scores', ascending=False)
df_performance["scores"] = (df_performance["scores"] / num_samples)*100
df_performance

Unnamed: 0_level_0,scores
prompt,Unnamed: 1_level_1
Fetch the date from this text as DD-MM-YYYY,10.0
Below is a text containing a date. Extract that date in the format: <DD-MM-YYYY>,0.0
Extract the date from this text as DD-MM-YYYY,0.0
Get the date from this phrase as DD-MM-YYYY,0.0


The limitations of this example:
- Testing more types of prompt candidate categories (like few shot prompting for example)
- Enforcing the output size to convert to the date format instead of doing post processing on the output
- Better scoring strategy than just None or correct (something that evaluates the outputs semantically for truthfullness)

Prompt Engineering Simplified Template
- Stablish a concrete and atomic task
- Define a set of prompt candidates
- Define a clear metric for evaluation
- Test
- Evaluate
- Compare
- Find the best prompt

Perfect! There we have it, our first results! The way to evolve this approach would be to test on a harder test set and if we don't get good results, we try better prompting strategies like few-shot, self-consistency, etc...

# A Slightly More Complex Example

In this example we'll look at desgining a simple prompt engineering experiment to find the best prompt to generate an intuitive and simple explanation of a concept.

The idea is that, given a concept, or piece of information we would like to understand, the model should output a simple one paragraph explanation giving all the necessary context and information to allow the user to grasp the concept at hand.

Let's start by creating a few prompt candidates, in the beggining its always a good idea to come up with a few prompts yourself, and preferably zero-shot examples which would be the baseline upon which we'll improve.

In [12]:
prompt_candidates = ["Explain this concept in simple terms", 
                     "Explain the following concept:", 
                     "Explain this:", 
                     "Break down this concept for a beginner:",
                    "Can you simplify the explanation of the following concept:"]

Ok, now that we have our candidates, let's run a first experiment. Given the subjective and general nature of the problems dealt by LLMs, its hard to settle on one precise metrics as we would in supervised learning scenarios. 

Therefore, what we would like to do is to use GPT-4 as the judge for the quality of our models, this approach is actually a common place in prompt engineering papers, and its one that yields some quite impressive results.

In [13]:
from openai import OpenAI
import pandas as pd


def gpt4_score(response, concept):
    score_prompt = f"Give a score from 0 to 100 to this response: {response} based on how well it represents an explanation of this concept: {concept} "
    client = OpenAI()
    response = client.chat.completions.create(model="gpt-4o", 
                             messages=
                             [
                                 {"role": "system", "content": "You are an expert tutor in all scientific fields."},
                                 {"role": "user", "content": score_prompt}   
                             ],
                             max_tokens=100,
                             temperature=0.0,
                             n = 1)
    return response.choices[0].message.content



def get_response(prompt):
    client = OpenAI()
    response = client.chat.completions.create(model="gpt-3.5-turbo-1106", 
                             messages=
                             [
                                 {"role": "system", "content": "You are a savy guru with knowledge about existence and the secrets of life."},
                                 {"role": "user", "content": prompt}   
                             ],
                             max_tokens=100,
                             temperature=0.9,
                             n = 1)
    return response.choices[0].message.content



data = []
concept_list = ["Genetic Mutations", 
                "Overfitting in Machine Learning",]



for concept in concept_list:
    for prompt in prompt_candidates:
        response = get_response(prompt + " " + concept)
        response_score = gpt4_score(response, concept)
        data.append([prompt, response, response_score, concept])

df = pd.DataFrame(data, columns=["prompt", "response", "response_score", "concept"])
df.head()

Unnamed: 0,prompt,response,response_score,concept
0,Explain this concept in simple terms,Genetic mutations are changes in the DNA code ...,I would give this response a score of 90 out o...,Genetic Mutations
1,Explain the following concept:,Genetic mutations are changes in the DNA seque...,I would give this response a score of 85 out o...,Genetic Mutations
2,Explain this:,Genetic mutations are alterations in the DNA s...,I would give this response a score of 90 out o...,Genetic Mutations
3,Break down this concept for a beginner:,Genetic mutations are changes in the DNA seque...,I would give this response a score of 90 out o...,Genetic Mutations
4,Can you simplify the explanation of the follow...,Certainly! Genetic mutations are changes in th...,I would give this response a score of 85 out o...,Genetic Mutations


In [14]:
df.to_csv('prompt_engineering_results2.csv', index=False)

Perfect! We can see that the score given by the model needs some cleaning up (this is actually an issue that wil be solved by a tool we'll introduce in the next section), so let's do that quickly.

In [15]:
for i,score_output in enumerate(df["response_score"]):
    score_parsed = f"Given this response, extract the score value and return only that: {score_output}. NUMBER ONLY."
    score_parsed = get_response(score_parsed)
    # replace the response score row with this newly parsed score value
    df.loc[i,"response_score"] = score_parsed

In [16]:
df.head()

Unnamed: 0,prompt,response,response_score,concept
0,Explain this concept in simple terms,Genetic mutations are changes in the DNA code ...,The score value in the response is 90.,Genetic Mutations
1,Explain the following concept:,Genetic mutations are changes in the DNA seque...,The score value in the given response is 85.,Genetic Mutations
2,Explain this:,Genetic mutations are alterations in the DNA s...,The score value in the given response is 90.,Genetic Mutations
3,Break down this concept for a beginner:,Genetic mutations are changes in the DNA seque...,The score value is 90.,Genetic Mutations
4,Can you simplify the explanation of the follow...,Certainly! Genetic mutations are changes in th...,The score value from the response is 85.,Genetic Mutations


Let's take a look at the results so far:

Ok, we have some results, now let's take a look at the best performing prompts and compare the answers with the lower performing ones:

In [17]:
# # compare lower the responses from the rows with the highest resopnse_score value and the lowest response_score value
# # to see if there is a difference in the responses

# #df[df['response_score'] == df['response_score'].max()]['response'].values
# df[df.groupby('concept')['response_score'].transform(max) == df['response_score']]['response'].values

# #df[df['response_score'] == df['response_score'].min()]['response'].values

# Get rows with the best response for each concept
best_responses = df[df.groupby('concept')['response_score'].transform(max) == df['response_score']]

# Get rows with the worst response for each concept
worst_responses = df[df.groupby('concept')['response_score'].transform(min) == df['response_score']]

# Iterate over unique concepts and print best and worst responses
for concept in df['concept'].unique():
    best_response = best_responses[best_responses['concept'] == concept]['response'].values[0]
    worst_response = worst_responses[worst_responses['concept'] == concept]['response'].values[0]
    
    print(f"Concept: {concept}")
    print(f"Best Response: {best_response}")
    print(f"------")
    print(f"Worst Response: {worst_response}")
    print("------")
    print("*"*50)

Concept: Genetic Mutations
Best Response: Genetic mutations are changes in the DNA sequence that can occur naturally or be caused by external factors like radiation or chemicals. These mutations can lead to variations in traits and characteristics among individuals within a species. Some mutations may have no effect, while others can cause genetic disorders or provide advantages in survival and reproduction. These variations are essential for the process of evolution, as they contribute to the diversity of life on Earth.
------
Worst Response: Certainly! Genetic mutations are changes in the DNA sequence that can lead to differences in traits or characteristics. These changes can occur naturally or as a result of external factors, and they can have various effects, ranging from no impact to causing genetic disorders. Overall, genetic mutations are a fundamental part of evolution and the diversity of life.
------
**************************************************
Concept: Overfitting in 

  best_responses = df[df.groupby('concept')['response_score'].transform(max) == df['response_score']]
  worst_responses = df[df.groupby('concept')['response_score'].transform(min) == df['response_score']]


Usually, you would have to tune even the prompt that is being used to create the scores for the responses, to make sure you have the best possible results, but for this particular case let's just analyse overall how well we did with these baseline preliminary results.