<a href="https://colab.research.google.com/github/MissaouiAhmed/langsmith-samples/blob/main/Prompts/assisted_prompt_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Bootstrapping with Human Feedback

Prompt engineering isn't always the most fun, especially when it comes to tasks where metrics are hard to defined. Crafting a prompt is often an iterative process, and it can be hard to get over the initial "cold start problem" of creating good prompts and datasets.

Turns out LLMs can do a [decent job at prompt engineering](https://arxiv.org/abs/2211.01910), especially when incorporating human feedback on representative data. For lack of a better term, I'll call this form of prompt optimization "Prompt Bootstrapping", since it iteratively refines a prompt via instruction tuning distilled from human feedback. Below is an overview of the process.

![Prompt Bootstrapping Diagram](./img/prompt-bootstrapping.png)

LangSmith makes this this whole flow very easy. Let's give it a whirl!

This example is based on [@alexalbert's example Claude workflow](https://twitter.com/alexalbert__/status/1767258557039378511?s=20).

In [54]:
%pip install -U langsmith langchain_anthropic langchain arxiv langchain_openai langchain_community pymupdf --quiet

In [55]:
import os
from google.colab import userdata

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('LANGCHAIN_API_KEY')
# We are using Anthropic here as well
os.environ["ANTHROPIC_API_KEY"] = userdata.get('ANTHROPIC_API_KEY')
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [56]:
from langsmith import Client

client = Client()

# 1. Pick a task

Let's say I want to write a tweet generator about academic papers, one that is catchy but not laden with too many buzzwords
or impersonal. Let's see if we can "optimize" a prompt without having to engineer it ourselves.

We will use the meta-prompt ([wfh/metaprompt](https://smith.langchain.com/hub/wfh/metaprompt)) from the Hub to generate our first prompt candidate to solve this task.

In [57]:
from langchain import hub
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser

task = (
    "Generate a tweet to market an academic paper or open source project. It should be"
    " well crafted but avoid gimicks or over-reliance on buzzwords."
)

from langchain_openai import ChatOpenAI
# See: https://smith.langchain.com/hub/wfh/metaprompt
prompt = hub.pull("wfh/metaprompt")
llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0,
    )


def get_instructions(gen: str):
    return gen.split("<Instructions>")[1].split("</Instructions>")[0]


meta_prompter = prompt | llm | StrOutputParser() | get_instructions
recommended_prompt = meta_prompter.invoke(
    {
        "task": task,
        "input_variables": """
{paper}
""",
    }
)
print(recommended_prompt)


You are tasked with generating a tweet to market an academic paper or open source project. The content of the paper or project is provided below:

<content>
{paper}
</content>

Your goal is to craft a well-structured tweet that effectively communicates the essence of the paper or project. Here are the key elements to include in the tweet:

- A brief summary of the paper or project, highlighting its main focus or findings.
- The significance or potential impact of the work, explaining why it matters.
- A call to action, encouraging the audience to read the paper or explore the project further.

When crafting the tweet, avoid using gimmicks or over-relying on buzzwords. Instead, focus on clarity, relevance, and engaging the audience with the core message of the paper or project.

Write your tweet inside <tweet> tags.



OK so it's a fine-not-great prompt. Let's see how it does!

## 2. Dataset

For some tasks you can generate them yourselves. For our notebook, we have created a 10-datapoint dataset of some scraped ArXiv papers.

In [58]:
from itertools import islice

from langchain_community.utilities.arxiv import ArxivAPIWrapper

wrapper = ArxivAPIWrapper(doc_content_chars_max=200_000)
docs = list(islice(wrapper.lazy_load("Self-Replicating Language model Agents"), 10))

In [59]:
ds_name = "Tweet Generator"
ds = client.create_dataset(dataset_name=ds_name)
#client.create_examples(
#    inputs=[{"paper": doc.page_content} for doc in docs[0]],
#    dataset_id=ds.id,
#)

client.create_examples(
    inputs=[{"paper": docs[0].page_content}],
    dataset_id=ds.id,
)



{'example_ids': ['fdf4d8d7-ce6f-4680-b20e-c860b9bf7670'], 'count': 1}

## 3. Predict

We will refrain from defining metrics for now (it's quite subjective). Instead we will run the first version of the generator against the dataset and manually review + provide feedback on the results.

In [60]:
from langchain_core.prompts import PromptTemplate


def parse_tweet(response: str):
    try:
        return response.split("<tweet>")[1].split("</tweet>")[0].strip()
    except:
        return response.strip()


def create_tweet_generator(prompt_str: str):
    prompt = PromptTemplate.from_template(prompt_str)
    return prompt | llm | StrOutputParser() | parse_tweet


tweet_generator = create_tweet_generator(recommended_prompt)

# Example
prediction = tweet_generator.invoke({"paper": docs[0].page_content})
print(prediction)

🚀 Dive into the world of mobile agents with Steven Versteeg's comprehensive thesis! Explore how languages like Java, Telescript, and Agent Tcl empower agents to migrate, communicate, and interact securely across networks. Discover the future of network computing and why these innovations matter. 📚 Read more: [link] #MobileAgents #NetworkComputing #ProgrammingLanguages


In [61]:
res = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)

View the evaluation results for project 'helpful-map-8' at:
https://smith.langchain.com/o/9f68f02e-02f8-464e-810c-347a3f2f34b3/datasets/b8baa542-a501-43bf-b2cc-8879111223fd/compare?selectedSessions=edac48fd-a333-4fe4-b1ee-91ab4e03120b

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/9f68f02e-02f8-464e-810c-347a3f2f34b3/datasets/b8baa542-a501-43bf-b2cc-8879111223fd
[------------------------------------------------->] 1/1

## 4. Label

Now, we will use an annotation queue to score + add notes to the results. We will use this to iterate on our prompt!

For this notebook, I will be logging two types of feedback:

`note`- freeform comments on the runs

`tweet_quality` - a 0-4 score of the generated tweet based on my subjective preferences

In [63]:
q = client.create_annotation_queue(name="Tweet Generator")

In [65]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(project_name=res["project_name"], execution_order=1)
    ],
)

Now, go through the runs to label them. Return to this notebook when you are finished.

![Queue](./img/queue.png)

## 4. Update

With the human feedback in place, let's update the prompt and try again.

In [66]:
from collections import defaultdict


def format_feedback(single_feedback, max_score=4):
    if single_feedback.score is None:
        score = ""
    else:
        score = f"\nScore:[{single_feedback.score}/{max_score}]"
    comment = f"\n{single_feedback.comment}".strip()
    return f"""<feedback key={single_feedback.key}>{score}{comment}
</feedback>"""


def format_run_with_feedback(run, feedback):
    all_feedback = "\n".join([format_feedback(f) for f in feedback])
    return f"""<example>
<tweet>
{run.outputs["output"]}
</tweet>
<annotations>
{all_feedback}
</annotations>
</example>"""


def get_formatted_feedback(project_name: str):
    traces = list(client.list_runs(project_name=project_name, execution_order=1))
    feedbacks = defaultdict(list)
    for f in client.list_feedback(run_ids=[r.id for r in traces]):
        feedbacks[f.run_id].append(f)
    return [
        format_run_with_feedback(r, feedbacks[r.id])
        for r in traces
        if r.id in feedbacks
    ]

In [67]:
formatted_feedback = get_formatted_feedback(res["project_name"])

LLMs are especially good at 2 things:
1. Generating grammatical text
2. Summarization

Now that we've left a mixture of scores and free-form comments, we can use an "optimizer prompt" ([wfh/optimizerprompt](https://smith.langchain.com/hub/wfh/optimizerprompt)) to incorporate the feedback into an updated prompt.


In [68]:
# See: https://smith.langchain.com/hub/wfh/optimizerprompt
optimizer_prompt = hub.pull("wfh/optimizerprompt")


def extract_new_prompt(gen: str):
    return gen.split("<improved_prompt>")[1].split("</improved_prompt>")[0].strip()


optimizer = optimizer_prompt | llm | StrOutputParser() | extract_new_prompt

In [69]:
current_prompt = recommended_prompt
new_prompt = optimizer.invoke(
    {
        "current_prompt": current_prompt,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)

In [70]:
print("Original Prompt\n\n" + current_prompt)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt)

Original Prompt


You are tasked with generating a tweet to market an academic paper or open source project. The content of the paper or project is provided below:

<content>
{paper}
</content>

Your goal is to craft a well-structured tweet that effectively communicates the essence of the paper or project. Here are the key elements to include in the tweet:

- A brief summary of the paper or project, highlighting its main focus or findings.
- The significance or potential impact of the work, explaining why it matters.
- A call to action, encouraging the audience to read the paper or explore the project further.

When crafting the tweet, avoid using gimmicks or over-relying on buzzwords. Instead, focus on clarity, relevance, and engaging the audience with the core message of the paper or project.

Write your tweet inside <tweet> tags.

********************************************************************************
New Prompt

You are tasked with generating a tweet to market an academic 

## 5. Repeat!

Now that we have an "upgraded" prompt, we can test it out again and repeat until we are satisfied with the result.

If you find the prompt isn't converging to something you want, you can manually update the prompt (you are the optimizer in this case) and/or be more explicit in your free-form note feedback.

In [71]:
tweet_generator = create_tweet_generator(new_prompt)

updated_results = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)


View the evaluation results for project 'earnest-wish-26' at:
https://smith.langchain.com/o/9f68f02e-02f8-464e-810c-347a3f2f34b3/datasets/b8baa542-a501-43bf-b2cc-8879111223fd/compare?selectedSessions=c3ba6aaa-e1ad-4650-b5b7-19c920ea5d6c

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/9f68f02e-02f8-464e-810c-347a3f2f34b3/datasets/b8baa542-a501-43bf-b2cc-8879111223fd


Error Type: KeyError, Message: 'Input to PromptTemplate is missing variables {"paper\'s main focus", \'field/area\', \'key finding\', \'potential impact\'}.  Expected: [\'field/area\', \'key finding\', \'paper\', "paper\'s main focus", \'potential impact\'] Received: [\'paper\']\nNote: if you intended {paper\'s main focus} to be part of the string and not a variable, please escape it with double curly braces like: \'{{paper\'s main focus}}\'.\nFor troubleshooting, visit: https://python.langchain.com/docs/troubleshooting/errors/INVALID_PROMPT_INPUT '


[>                                                 ] 0/1[------------------------------------------------->] 1/1

In [84]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(project_name=updated_results["project_name"], execution_order=1)
    ],
)


LangSmithError: Failed to POST /annotation-queues/b1e83d55-4f10-4821-af13-c46da6847292/runs in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/annotation-queues/b1e83d55-4f10-4821-af13-c46da6847292/runs', '{"detail":"At least one of \'session\', \'id\', \'parent_run\', \'trace\' or \'reference_example\' must be specified"}')

Then review/provide feedback/repeat.

Once you've provided feedback, you can continue here:

In [None]:
formatted_feedback = get_formatted_feedback(updated_results["project_name"])

In [None]:
# Swap them out
current_prompt = new_prompt
new_prompt = optimizer.invoke(
    {
        "current_prompt": current_prompt,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)

In [None]:
print("Previous Prompt\n\n" + current_prompt)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt)

## Conclusion

Congrats! You've "optimized" a prompt on a subjective task using human feedback and an automatic prompt engineer flow. LangSmith makes it easy to score and improve LLM systems even when it is hard to craft a hard metric.

You can push the optimized version of your prompt to the hub (here and in future iterations) to version each change.

#### Extensions:

We haven't optimized the meta-prompts above - feel free to make them your own by forking and updating them!
Some easy extensions you could try out include:
1. Including the full history of previous prompts and annotations (or most recent N prompts with feedback) in the "optimizer prompt" step. This may help it better converge (especially if you're using a small dataset)
2. Updating the optimizer prompt to encourage usage of few-shot examples, or to encourage other prompting tricks.
3. Incorporating an LLM judge by including the annotation few-shot examples and instructing it to critique the generated outputs: this could help speed-up the human annotation process.
4. Generating and including a validation set (to avoid over-fitting this training dataset)