# Prompt Bootstrapping with LangSmith + Claude

Prompt engineering can be frustrating, especially when it comes to tasks where metrics are hard to defined. Crafting a prompt is often an iterative process, refining it over multiple examples.

Turns out LLMs can do a [decent job at prompt engineering](https://arxiv.org/abs/2211.01910), especially when incorporating human feedback on representative data. 

In this notebook, we will walk through "prompt bootstrapping", where you will iteratively refine a prompt by providing unstructured feedback over a dataset. Below is an overview of the process.

![Prompt Bootstrapping Diagram](./img/prompt-bootstrapping.png)

LangSmith makes this this whole flow very easy. Let's give it a whirl!

This example is based on [@alexalbert's example Claude workflow](https://x.com/alexalbert__/status/1767258557039378511?s=20).

In [None]:
%pip install -U langsmith langchain_anthropic langchain arxiv

In [27]:
import os

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key
# We are using Anthropic here as well
os.environ["ANTHROPIC_API_KEY"] = "YOUR API KEY"

In [28]:
from langsmith import Client

client = Client()
prompt_name = "YOUR HUB REPO HERE"  # Example: wfh/tweet-generator

# 1. Pick a task

Let's say I want to write a tweet generator about academic papers, one that is catchy but not laden with too many buzzwords
or impersonal. Let's see if we can "optimize" a prompt without having to engineer it ourselves.

We will use the meta-prompt ([wfh/metaprompt](https://smith.langchain.com/hub/wfh/metaprompt)) from the Hub to generate our first prompt candidate to solve this task.

In [29]:
from langchain import hub
from langchain_anthropic import ChatAnthropic
from langchain_core.output_parsers import StrOutputParser

task = (
    "Generate a tweet to market an academic paper or open source project. It should be"
    " well crafted but avoid gimicks or over-reliance on buzzwords."
)


# See: https://smith.langchain.com/hub/wfh/metaprompt
prompt = hub.pull("wfh/metaprompt")
llm = ChatAnthropic(model="claude-3-opus-20240229")


def get_instructions(gen: str):
    return gen.split("<Instructions>")[1].split("</Instructions>")[0]


meta_prompter = prompt | llm | StrOutputParser() | get_instructions

In [4]:
from langchain_core.prompts import ChatPromptTemplate

recommended_prompt_str = meta_prompter.invoke(
    {
        # This is the high level purpose of the system
        "task": task,
        # These are the values your system will accept
        "input_variables": """
{paper}
""",
    }
)

# We'll commit each version of our prompt to the Hub
# so you can track or revisit each iteration.
recommended_prompt = ChatPromptTemplate.from_messages(
    [("user", recommended_prompt_str)]
)
hub.push(prompt_name, recommended_prompt)

'https://smith.langchain.com/hub/wfh/academic-tweet-generator/034c6661'

OK so it's a fine-not-great prompt. Let's see how it does!

## 2. Dataset

For some tasks you can generate them yourselves. For our notebook, we have created a 10-datapoint dataset of some scraped ArXiv papers.

In [30]:
from itertools import islice

from langchain_community.utilities.arxiv import ArxivAPIWrapper

wrapper = ArxivAPIWrapper(doc_content_chars_max=200_000)
docs = list(islice(wrapper.lazy_load("Self-Replicating Language model Agents"), 5))

In [31]:
print(docs[0].page_content[:300])

 
Languages for Mobile Agents 
Steven Versteeg 
 
Supervisor: Leon Sterling 
 
433­463 Thesis 
Department of Computer Science and Software Engineering 
University of Melbourne 
 25 August, 1997 
 ​Abstract 
Mobile agents represent a new model for network computing.  Many different languages 
have be


In [32]:
ds_name = "Tweet Generator"
ds = client.create_dataset(dataset_name=ds_name)
client.create_examples(
    inputs=[{"paper": doc.page_content} for doc in docs], dataset_id=ds.id
)

## 3. Predict

We will refrain from defining metrics for now (it's quite subjective). Instead we will run the first version of the generator against the dataset and manually review + provide feedback on the results.

In [34]:
def parse_tweet(response: str):
    try:
        return response.split("<tweet>")[1].split("</tweet>")[0].strip()
    except:
        return response.strip()


def create_tweet_generator(prompt):
    return prompt | llm | StrOutputParser() | parse_tweet


tweet_generator = create_tweet_generator(recommended_prompt)

# Example
prediction = tweet_generator.invoke({"paper": docs[0].page_content})
print(prediction)

What makes a programming language suitable for writing mobile agents? Key factors:
- Migration support
- Agent communication 
- Interfaces to server resources
- Security
- Efficiency & portability
Java, Telescript, Agent Tcl & others compared in this '97 paper
http://www.cs.mu.oz.au/~scv/433-463/thesis_only.pdf
#MobileAgents #ProgrammingLanguages


In [35]:
res = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)

View the evaluation results for project 'sunny-stem-33' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0433274f-e08b-43a1-9fea-018375080012/compare?selectedSessions=dfaa7ab6-b211-4e56-a6d2-caed9a10f68a

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0433274f-e08b-43a1-9fea-018375080012
[------------------------------------------------->] 5/5

## 4. Label

Now, we will use an annotation queue to score + add notes to the results. We will use this to iterate on our prompt!

For this notebook, I will be logging two types of feedback:

`note`- freeform comments on the runs

`tweet_quality` - a 0-4 score of the generated tweet based on my subjective preferences

In [36]:
q = client.create_annotation_queue(name="Tweet Generator")

In [37]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(project_name=res["project_name"], execution_order=1)
    ],
)

Now, go through the runs to label them. Return to this notebook when you are finished.

![Queue](./img/queue.png)

## 4. Update

With the human feedback in place, let's update the prompt and try again.

In [39]:
from collections import defaultdict


def format_feedback(single_feedback, max_score=4):
    if single_feedback.score is None:
        score = ""
    else:
        score = f"\nScore:[{single_feedback.score}/{max_score}]"
    comment = f"\n{single_feedback.comment}".strip()
    return f"""<feedback key={single_feedback.key}>{score}{comment}
</feedback>"""


def format_run_with_feedback(run, feedback):
    all_feedback = "\n".join([format_feedback(f) for f in feedback])
    return f"""<example>
<tweet>
{run.outputs["output"]}
</tweet>
<annotations>
{all_feedback}
</annotations>
</example>"""


def get_formatted_feedback(project_name: str):
    traces = list(client.list_runs(project_name=project_name, execution_order=1))
    feedbacks = defaultdict(list)
    for f in client.list_feedback(run_ids=[r.id for r in traces]):
        feedbacks[f.run_id].append(f)
    return [
        format_run_with_feedback(r, feedbacks[r.id])
        for r in traces
        if r.id in feedbacks
    ]

In [40]:
formatted_feedback = get_formatted_feedback(res["project_name"])

LLMs are especially good at 2 things:
1. Generating grammatical text
2. Summarization

Now that we've left a mixture of scores and free-form comments, we can use an "optimizer prompt" ([wfh/optimizerprompt](https://smith.langchain.com/hub/wfh/optimizerprompt)) to incorporate the feedback into an updated prompt.


In [53]:
# See: https://smith.langchain.com/hub/wfh/optimizerprompt
optimizer_prompt = hub.pull("wfh/optimizerprompt")


def extract_new_prompt(gen: str):
    return gen.split("<improved_prompt>")[1].split("</improved_prompt>")[0].strip()


optimizer = optimizer_prompt | llm | StrOutputParser() | extract_new_prompt

In [45]:
current_prompt_str = recommended_prompt_str
new_prompt_str = optimizer.invoke(
    {
        "current_prompt": current_prompt_str,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)
# Check in a new version of the prompt to the Hub
new_prompt = ChatPromptTemplate.from_messages([("user", new_prompt_str)])
hub.push(prompt_name, new_prompt)

'https://smith.langchain.com/hub/wfh/academic-tweet-generator/4c8cbfd0'

In [19]:
print("Original Prompt\n\n" + current_prompt_str)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt_str)

Original Prompt



{paper}

Please carefully read the above paper (or excerpt). Identify the key contributions, insights, or results that would be most interesting to a general technical audience on Twitter.

Draft an engaging, concise tweet summarizing the key interesting points of the paper for a general audience. Do not sensationalize or over-hype the claims - be accurate. But do try to pique the reader's curiosity to learn more. Please include a link to the full paper at the end of the tweet.

Write your draft tweet here:

<tweet>

</tweet>

Now please review your draft tweet. Edit it to:
- Remove/replace technical jargon where possible 
- Add 1-2 relevant hashtags
- @mention any relevant Twitter accounts (e.g. authors, institutions, etc) if you can identify them
- Check that the key claims are accurately conveyed
- Trim the tweet to fit within Twitter's character limit if needed

Here is the final, refined tweet:

<tweet>

</tweet>


***********************************************

In [47]:
# Example with the new prompt
tweet_generator = create_tweet_generator(new_prompt)
tweet_generator.invoke({"paper": docs[0].page_content})

'Languages for mobile agents: What features do they need? 🤖🗺️\n- Built-in support for agent migration, communication, security \n- Cross-platform execution on heterogeneous networks\n- Balance of performance & ease of programming\n\nEnabling software agents to roam the internet autonomously, interacting with servers along the way!\n\nhttp://www.cs.mu.oz.au/~sversteeg/research/thesis.html'

## 5. Repeat!

Now that we have an "upgraded" prompt, we can test it out again and repeat until we are satisfied with the result.

If you find the prompt isn't converging to something you want, you can manually update the prompt (you are the optimizer in this case) and/or be more explicit in your free-form note feedback.

In [48]:
updated_results = client.run_on_dataset(
    dataset_name=ds_name,
    llm_or_chain_factory=tweet_generator,
)

View the evaluation results for project 'ample-club-93' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0433274f-e08b-43a1-9fea-018375080012/compare?selectedSessions=65b80c98-e8e4-45f5-a11b-d7fe41fb98c8

View all tests for Dataset Tweet Generator at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0433274f-e08b-43a1-9fea-018375080012
[------------------------------------------------->] 5/5

In [49]:
client.add_runs_to_annotation_queue(
    q.id,
    run_ids=[
        r.id
        for r in client.list_runs(
            project_name=updated_results["project_name"], execution_order=1
        )
    ],
)

**Next, review again** and provide feedback. Optionally repeat.

Once you've provided feedback, you can continue here:

In [50]:
formatted_feedback = get_formatted_feedback(updated_results["project_name"])

In [54]:
# Swap them out
current_prompt_str = new_prompt_str
new_prompt_str = optimizer.invoke(
    {
        "current_prompt": current_prompt_str,
        "annotated_predictions": "\n\n".join(formatted_feedback).strip(),
    }
)

In [55]:
print("Previous Prompt\n\n" + current_prompt_str)
print("*" * 80 + "\nNew Prompt\n\n" + new_prompt_str)

Previous Prompt

Please carefully read the paper (or excerpt) below and identify the key contributions, insights, or results that would be most interesting and valuable to share with a general technical audience on Twitter:

{paper}

Your goal is to write an informative and engaging long-form tweet (500-750 characters) that accurately conveys the paper's main ideas, key details and results, and significance. 

The tweet should follow this structure:

1. Key message: Start with an attention-grabbing title or key takeaway that draws the reader in and makes them want to learn more. Keep it concise yet compelling.

2. Main body: Provide specific details about the methods, results, novelty or implications that a broad audience would find most interesting and valuable. Include data, numbers, key findings, hypotheses tested, etc. Give enough information that the reader gets substantial insight into the work without needing to read the full paper. Use clear language but don't oversimplify.

3.

## Conclusion

Congrats! You've "optimized" a prompt on a subjective task using human feedback and an automatic prompt engineer flow. LangSmith makes it easy to score and improve LLM systems even when it is hard to craft a hard metric.

You can push the optimized version of your prompt to the hub (here and in future iterations) to version each change.

In [56]:
new_prompt = ChatPromptTemplate.from_messages([("user", new_prompt_str)])
hub.push(prompt_name, new_prompt)

'https://smith.langchain.com/hub/wfh/academic-tweet-generator/5bfbea74'

In [58]:
tweet_generator = create_tweet_generator(new_prompt)
result = tweet_generator.invoke({"paper": docs[0].page_content})
print(result)

What makes a programming language well-suited for developing mobile agents? 🤔 This 1997 thesis examines the essential characteristics:

Key requirements include support for agent migration, inter-agent communication, interfacing with host resources, security mechanisms, execution efficiency, cross-platform availability, and ease of programming. 📋

The paper compares languages like Telescript, Java, Agent Tcl and Obliq. Telescript was designed specifically for mobile agents and elegantly handles migration, communication and security. ✅ 

But Java, despite being general-purpose, provides the core capabilities and holds the advantage of being an open standard supported across many platforms. ☕

The work anticipates a future where mobile agents roam the internet to search for information, monitor data, engage in e-commerce and distribute computation. Though 25+ years old, it identifies issues still relevant as multi-agent systems become prevalent. 🔮

While many competing languages existed,

#### Extensions:

We haven't optimized the meta-prompts above - feel free to make them your own by forking and updating them!
Some easy extensions you could try out include:
1. Including the full history of previous prompts and annotations (or most recent N prompts with feedback) in the "optimizer prompt" step. This may help it better converge (especially if you're using a small dataset)
2. Updating the optimizer prompt to encourage usage of few-shot examples, or to encourage other prompting tricks.
3. Incorporating an LLM judge by including the annotation few-shot examples and instructing it to critique the generated outputs: this could help speed-up the human annotation process.
4. Generating and including a validation set (to avoid over-fitting this training dataset)