<a href="https://colab.research.google.com/github/mshumer/gpt-prompt-engineer/blob/main/claude_prompt_engineer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# claude-prompt-engineer
By Matt Shumer (https://twitter.com/mattshumer_)

Github repo: https://github.com/mshumer/gpt-prompt-engineer

Generate an optimal prompt for a given task.

To generate a prompt:
1. In the first cell, add in your Anthropic key.
2. If you want, adjust your settings in the `Adjust settings here` cell
2. In the last cell, fill in the description of your task, the variables the system should account for.
3. Run all the cells! The AI will generate a number of candidate prompts, and test them all to find the best one!

In [1]:
from prettytable import PrettyTable
import time
import requests
from tqdm import tqdm
import itertools
import wandb
from tenacity import retry, stop_after_attempt, wait_exponential
from dotenv import load_dotenv
import os

load_dotenv(".env")

ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
use_wandb = False # set to True if you want to use wandb to log your config and results

## Adjust settings here

In [2]:
# K is a constant factor that determines how much ratings change
K = 32

CANDIDATE_MODEL = 'claude-3-opus-20240229'
CANDIDATE_MODEL_TEMPERATURE = 0.9

GENERATION_MODEL = 'claude-3-opus-20240229'
GENERATION_MODEL_TEMPERATURE = 0.8
GENERATION_MODEL_MAX_TOKENS = 800

TEST_CASE_MODEL = 'claude-3-opus-20240229'
TEST_CASE_MODEL_TEMPERATURE = .8

NUMBER_OF_TEST_CASES = 8 # this determines how many test cases to generate... the higher, the more expensive, but the better the results will be

N_RETRIES = 3  # number of times to retry a call to the ranking model if it fails
RANKING_MODEL = 'claude-3-opus-20240229'
RANKING_MODEL_TEMPERATURE = 0.5

NUMBER_OF_PROMPTS = 5 # this determines how many candidate prompts to generate... the higher, the more expensive, but the better the results will be

WANDB_PROJECT_NAME = "claude-prompt-eng" # used if use_wandb is True, Weights &| Biases project name
WANDB_RUN_NAME = None # used if use_wandb is True, optionally set the Weights & Biases run name to identify this run

In [3]:
def start_wandb_run():
  # start a new wandb run and log the config
  wandb.init(
    project=WANDB_PROJECT_NAME,
    name=WANDB_RUN_NAME,
    config={
      "K": K,
      "candiate_model": CANDIDATE_MODEL,
      "candidate_model_temperature": CANDIDATE_MODEL_TEMPERATURE,
      "generation_model": GENERATION_MODEL,
      "generation_model_temperature": GENERATION_MODEL_TEMPERATURE,
      "generation_model_max_tokens": GENERATION_MODEL_MAX_TOKENS,
      "n_retries": N_RETRIES,
      "ranking_model": RANKING_MODEL,
      "ranking_model_temperature": RANKING_MODEL_TEMPERATURE,
      "number_of_prompts": NUMBER_OF_PROMPTS
      })

  return

In [4]:
# Optional logging to Weights & Biases to reocrd the configs, prompts and results
if use_wandb:
  start_wandb_run()

In [5]:
import json
import re

def remove_first_line(test_string):
    if test_string.startswith("Here") and test_string.split("\n")[0].strip().endswith(":"):
        return re.sub(r'^.*\n', '', test_string, count=1)
    return test_string

def generate_candidate_prompts(description, input_variables, test_cases, number_of_prompts):
    headers = {
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }

    variable_descriptions = "\n".join(f"{var['variable']}: {var['description']}" for var in input_variables)

    data = {
        "model": CANDIDATE_MODEL,
        "max_tokens": 1500,
        "temperature": CANDIDATE_MODEL_TEMPERATURE,
        "system": f"""Your job is to generate system prompts for Claude 3, given a description of the use-case, some test cases/input variable examples that will help you understand what the prompt will need to be good at.
The prompts you will be generating will be for freeform tasks, such as generating a landing page headline, an intro paragraph, solving a math problem, etc.
In your generated prompt, you should describe how the AI should behave in plain English. Include what it will see, and what it's allowed to output.
<most_important>Make sure to incorporate the provided input variable placeholders into the prompt, using placeholders like {{{{VARIABLE_NAME}}}} for each variable. Ensure you place placeholders inside four squiggly lines like {{{{VARIABLE_NAME}}}}. At inference time/test time, we will slot the variables into the prompt, like a template.</most_important>
Be creative with prompts to get the best possible results. The AI knows it's an AI -- you don't need to tell it this.
You will be graded based on the performance of your prompt... but don't cheat! You cannot include specifics about the test cases in your prompt. Any prompts with examples will be disqualified.
Here are the input variables and their descriptions:
{variable_descriptions}
Most importantly, output NOTHING but the prompt (with the variables contained in it like {{{{VARIABLE_NAME}}}}). Do not include anything else in your message.""",
        "messages": [
            {"role": "user", "content": f"Here are some test cases:`{test_cases}`\n\nHere is the description of the use-case: `{description.strip()}`\n\nRespond with your flexible system prompt, and nothing else. Be creative, and remember, the goal is not to complete the task, but write a prompt that will complete the task."},
        ]
    }

    prompts = []

    for i in range(number_of_prompts):
        response = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=data)

        message = response.json()

        response_text = message['content'][0]['text']

        prompts.append(remove_first_line(response_text))

    return prompts

def expected_score(r1, r2):
    return 1 / (1 + 10**((r2 - r1) / 400))

def update_elo(r1, r2, score1):
    e1 = expected_score(r1, r2)
    e2 = expected_score(r2, r1)
    return r1 + K * (score1 - e1), r2 + K * ((1 - score1) - e2)

# Get Score - retry up to N_RETRIES times, waiting exponentially between retries.
@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def get_score(description, test_case, pos1, pos2, input_variables, ranking_model_name, ranking_model_temperature):
    headers = {
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }

    variable_values = "\n".join(f"{var['variable']}: {test_case.get(var['variable'], '')}" for var in input_variables)

    data = {
        "model": RANKING_MODEL,
        "max_tokens": 1,
        "temperature": ranking_model_temperature,
        "system": f"""Your job is to rank the quality of two outputs generated by different prompts. The prompts are used to generate a response for a given task.
You will be provided with the task description, input variable values, and two generations - one for each system prompt.
Rank the generations in order of quality. If Generation A is better, respond with 'A'. If Generation B is better, respond with 'B'.
Remember, to be considered 'better', a generation must not just be good, it must be noticeably superior to the other.
Also, keep in mind that you are a very harsh critic. Only rank a generation as better if it truly impresses you more than the other.
Respond with your ranking ('A' or 'B'), and nothing else. Be fair and unbiased in your judgement.""",
        "messages": [
            {"role": "user", "content": f"""Task: {description.strip()}
Variables: {test_case['variables']}
Generation A: {remove_first_line(pos1)}
Generation B: {remove_first_line(pos2)}"""},
        ]
    }

    response = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=data)

    message = response.json()

    score = message['content'][0]['text']

    return score

@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def get_generation(prompt, test_case, input_variables):
    headers = {
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }


    # Replace variable placeholders in the prompt with their actual values from the test case
    for var_dict in test_case['variables']:
        for variable_name, variable_value in var_dict.items():
            prompt = prompt.replace(f"{{{{{variable_name}}}}}", variable_value)

    data = {
        "model": GENERATION_MODEL,
        "max_tokens": GENERATION_MODEL_MAX_TOKENS,
        "temperature": GENERATION_MODEL_TEMPERATURE,
        "system": 'Complete the task perfectly.',
        "messages": [
            {"role": "user", "content": prompt},
        ]
    }

    response = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=data)

    message = response.json()

    generation = message['content'][0]['text']

    return generation

def test_candidate_prompts(test_cases, description, input_variables, prompts):
    # Initialize each prompt with an ELO rating of 1200
    prompt_ratings = {prompt: 1200 for prompt in prompts}

    # Calculate total rounds for progress bar
    total_rounds = len(test_cases) * len(prompts) * (len(prompts) - 1) // 2

    # Initialize progress bar
    pbar = tqdm(total=total_rounds, ncols=70)

    # For each pair of prompts
    for prompt1, prompt2 in itertools.combinations(prompts, 2):
        # For each test case
        for test_case in test_cases:
            # Update progress bar
            pbar.update()

            # Generate outputs for each prompt
            generation1 = get_generation(prompt1, test_case, input_variables)
            generation2 = get_generation(prompt2, test_case, input_variables)

            # Rank the outputs
            score1 = get_score(description, test_case, generation1, generation2, input_variables, RANKING_MODEL, RANKING_MODEL_TEMPERATURE)
            score2 = get_score(description, test_case, generation2, generation1, input_variables, RANKING_MODEL, RANKING_MODEL_TEMPERATURE)

            # Convert scores to numeric values
            score1 = 1 if score1 == 'A' else 0 if score1 == 'B' else 0.5
            score2 = 1 if score2 == 'B' else 0 if score2 == 'A' else 0.5

            # Average the scores
            score = (score1 + score2) / 2

            # Update ELO ratings
            r1, r2 = prompt_ratings[prompt1], prompt_ratings[prompt2]
            r1, r2 = update_elo(r1, r2, score)
            prompt_ratings[prompt1], prompt_ratings[prompt2] = r1, r2

            # Print the winner of this round
            if score > 0.5:
                print(f"Winner: {prompt1}")
            elif score < 0.5:
                print(f"Winner: {prompt2}")
            else:
                print("Draw")

    # Close progress bar
    pbar.close()

    return prompt_ratings

def generate_optimal_prompt(description, input_variables, num_test_cases=10, number_of_prompts=10, use_wandb=False):
    if use_wandb:
        wandb_table = wandb.Table(columns=["Prompt", "Ranking"] + [var["variable"] for var in input_variables])
        if wandb.run is None:
            start_wandb_run()

    test_cases = generate_test_cases(description, input_variables, num_test_cases)
    prompts = generate_candidate_prompts(description, input_variables, test_cases, number_of_prompts)
    print('Here are the possible prompts:', prompts)
    prompt_ratings = test_candidate_prompts(test_cases, description, input_variables, prompts)

    table = PrettyTable()
    table.field_names = ["Prompt", "Rating"] + [var["variable"] for var in input_variables]
    for prompt, rating in sorted(prompt_ratings.items(), key=lambda item: item[1], reverse=True):
        # Use the first test case as an example for displaying the input variables
        example_test_case = test_cases[0]
        table.add_row([prompt, rating] + [example_test_case.get(var["variable"], "") for var in input_variables])
        if use_wandb:
            wandb_table.add_data(prompt, rating, *[example_test_case.get(var["variable"], "") for var in input_variables])

    if use_wandb:
        wandb.log({"prompt_ratings": wandb_table})
        wandb.finish()
    print(table)

def generate_test_cases(description, input_variables, num_test_cases):
    headers = {
        "x-api-key": ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json"
    }

    variable_descriptions = "\n".join(f"{var['variable']}: {var['description']}" for var in input_variables)

    data = {
        "model": CANDIDATE_MODEL,
        "max_tokens": 1500,
        "temperature": CANDIDATE_MODEL_TEMPERATURE,
        "system": f"""You are an expert at generating test cases for evaluating AI-generated content.
Your task is to generate a list of {num_test_cases} test case prompts based on the given description and input variables.
Each test case should be a JSON object with a 'test_design' field containing the overall idea of this test case, and a list of additional JSONs for each input variable, called 'variables'.
The test cases should be diverse, covering a range of topics and styles relevant to the description.
Here are the input variables and their descriptions:
{variable_descriptions}
Return the test cases as a JSON list, with no other text or explanation.""",
        "messages": [
            {"role": "user", "content": f"Description: {description.strip()}\n\nGenerate the test cases. Make sure they are really, really great and diverse:"},
        ]
    }

    response = requests.post("https://api.anthropic.com/v1/messages", headers=headers, json=data)
    message = response.json()

    response_text = message['content'][0]['text']

    test_cases = json.loads(response_text)

    print('Here are the test cases:', test_cases)

    return test_cases

# In the cell below, fill in your description and input variables

In [6]:
"""## Example usage
description = "Given a prompt, generate a personalized email response." # this style of description tends to work well

input_variables = [
    {"variable": "SENDER_NAME", "description": "The name of the person who sent the email."},
    {"variable": "RECIPIENT_NAME", "description": "The name of the person receiving the email."},
    {"variable": "TOPIC", "description": "The main topic or subject of the email. One to two sentences."}
]"""

'## Example usage\ndescription = "Given a prompt, generate a personalized email response." # this style of description tends to work well\n\ninput_variables = [\n    {"variable": "SENDER_NAME", "description": "The name of the person who sent the email."},\n    {"variable": "RECIPIENT_NAME", "description": "The name of the person receiving the email."},\n    {"variable": "TOPIC", "description": "The main topic or subject of the email. One to two sentences."}\n]'

In [7]:
description = """

Given a prompt, generate a list of JSON objects that classify a candidate's CV against multiple job descriptions. The purpose is to classify the CV's suitability for each job opening into one of five categories: Highly Suitable, Moderately Suitable, Potentially Suitable, Marginally Suitable, or Not Suitable. The classification is based on the candidate's years of experience and the match between their skills, qualifications, and the job requirements. To perform this task, you will receive a candidate's CV delimited by #### characters, along with job IDs delimited by <> characters and their corresponding job descriptions delimited by ---- characters. The output must be a JSON object containing the job ID, suitability category, and a brief explanation for each classification.
"""

input_variables = [
    {"variable": "id", "description": "A key name of each JSON object within the list, representing the unique identifier for each job opening, and it corresponds to the one delimited by <> characters"},
    {"variable": "suitability", "description": "A key name of each JSON object within the list, representing one of five categories (Highly Suitable, Moderately Suitable, Potentially Suitable, Marginally Suitable, or Not Suitable) which determines the level of compatibility between a user's CV and a job description."},
    {"variable": "explanation", "description": "A key name of each JSON object within the list, representing a brief explaination behind the chosen suitability category."}
]

In [8]:
if use_wandb:
    wandb.config.update({"description": description,
                         "input_variables": input_variables,
                         "num_test_cases": NUMBER_OF_TEST_CASES,
                         "number_of_prompts": NUMBER_OF_PROMPTS})

## Run this cell to start the prompt engineering process!

In [9]:
generate_optimal_prompt(description, input_variables, NUMBER_OF_TEST_CASES, NUMBER_OF_PROMPTS, use_wandb)

Here are the test cases: [{'test_design': 'Basic test case with a single job opening and a well-matched candidate CV', 'variables': [{'id': '<job1>', 'suitability': 'Highly Suitable', 'explanation': 'The candidate has 5+ years of relevant experience and possesses all the required skills for the job.'}]}, {'test_design': 'Test case with multiple job openings and a candidate CV that matches some but not all requirements', 'variables': [{'id': '<job1>', 'suitability': 'Moderately Suitable', 'explanation': 'The candidate has relevant experience but lacks some of the preferred qualifications.'}, {'id': '<job2>', 'suitability': 'Potentially Suitable', 'explanation': "The candidate's experience is in a related field, but they may need additional training to meet all requirements."}, {'id': '<job3>', 'suitability': 'Not Suitable', 'explanation': "The candidate's experience and skills do not align with the job requirements."}]}, {'test_design': 'Test case with a candidate CV that has no relevan

  2%|▊                                 | 2/80 [00:27<18:03, 13.89s/it]

Winner: 
You are an AI assistant that classifies a candidate's suitability for multiple job openings based on their CV. Your input will include:
- The candidate's CV, delimited by #### characters 
- One or more job IDs, each delimited by <> characters
- The corresponding job description for each job ID, delimited by ---- characters

For each job ID provided, analyze the candidate's CV against the job description and requirements. Classify the candidate's suitability into one of five categories:
- Highly Suitable 
- Moderately Suitable
- Potentially Suitable
- Marginally Suitable
- Not Suitable

In addition to the classification, provide a brief explanation for why you chose that suitability level, considering factors like:
- The candidate's years of relevant experience 
- The match between the candidate's skills and qualifications and the job requirements
- Any transferable skills from related industries
- Gaps in required or preferred qualifications

Generate your output as a list of 

  4%|█▎                                | 3/80 [00:54<24:49, 19.34s/it]

Winner: 
You are an AI assistant that classifies a candidate's suitability for multiple job openings based on their CV. Your input will include:
- The candidate's CV, delimited by #### characters 
- One or more job IDs, each delimited by <> characters
- The corresponding job description for each job ID, delimited by ---- characters

For each job ID provided, analyze the candidate's CV against the job description and requirements. Classify the candidate's suitability into one of five categories:
- Highly Suitable 
- Moderately Suitable
- Potentially Suitable
- Marginally Suitable
- Not Suitable

In addition to the classification, provide a brief explanation for why you chose that suitability level, considering factors like:
- The candidate's years of relevant experience 
- The match between the candidate's skills and qualifications and the job requirements
- Any transferable skills from related industries
- Gaps in required or preferred qualifications

Generate your output as a list of 

RetryError: RetryError[<Future at 0x10db12d20 state=finished raised KeyError>]