
As AI language models become more sophisticated, the quality of prompts used to interact with them becomes increasingly important. Optimized prompts can lead to more accurate, relevant, and useful responses, enhancing the overall performance of AI applications. This tutorial aims to equip learners with practical techniques to systematically improve their prompts.

## Key Components

1. **A/B Testing Prompts**: A method to compare the effectiveness of different prompt variations.
2. **Iterative Refinement**: A strategy for gradually improving prompts based on feedback and results.
3. **Performance Metrics**: Ways to measure and compare the quality of responses from different prompts.
4. **Practical Implementation**: Hands-on examples using OpenAI's GPT model and LangChain.

## Method Details

1. **Setup**: We'll start by setting up our environment with the necessary libraries and API keys.

2. **A/B Testing**:
   - Define multiple versions of a prompt
   - Generate responses for each version
   - Compare results using predefined metrics

3. **Iterative Refinement**:
   - Start with an initial prompt
   - Generate responses and evaluate
   - Identify areas for improvement
   - Refine the prompt based on insights
   - Repeat the process to continuously enhance the prompt

4. **Performance Evaluation**:
   - Define relevant metrics (e.g., relevance, specificity, coherence)
   - Implement scoring functions
   - Compare scores across different prompt versions


In [1]:
! pip install langchain langchain-google-genai

Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.4-py3-none-any.whl.metadata (5.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
Downloading langchain_google_genai-2.1.4-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: filetype, google-ai-generativelanguage, langchain-google-genai
  Attempting uninstall: google-ai-generativelangu

In [2]:
import re
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
import numpy as np

os.environ['GOOGLE_API_KEY']=''

llm=ChatGoogleGenerativeAI(model='gemini-1.5-flash')

def generate_response(prompt):
  """Generate a response using the language model.

  Args:
      prompt (str): The input prompt.

  Returns:
      str: The generated response.
  """
  return llm.invoke(prompt).content

# A/B Testing Prompt
We will ask LLM to generate  answers from 2 prompts, Also ask to rate them on the scale of 1 to 10. will ask - Tell which prompt is best for relevant answer.

In [3]:
# Define prompt variations
prompt_a = PromptTemplate(
    input_variables=["topic"],
    template="Explain {topic} in simple terms."
)

prompt_b = PromptTemplate(
    input_variables=["topic"],
    template="Provide a beginner-friendly explanation of {topic}, including key concepts and an example."
)

# Updated function to evaluate response quality
def evaluate_response(response, criteria):
    """Evaluate the quality of a response based on given criteria.

    Args:
        response (str): The generated response.
        criteria (list): List of criteria to evaluate.

    Returns:
        float: The average score across all criteria.
    """
    scores = []
    for criterion in criteria:
        print(f"Evaluating response based on {criterion}...")
        prompt = f"On a scale of 1-10, rate the following response on {criterion}. Start your response with the numeric score:\n\n{response}"
        response = generate_response(prompt)
        # show 50 characters of the response
        # Use regex to find the first number in the response
        score_match = re.search(r'\d+', response)
        if score_match:
            score = int(score_match.group())
            scores.append(min(score, 10))  # Ensure score is not greater than 10
        else:
            print(f"Warning: Could not extract numeric score for {criterion}. Using default score of 5.")
            scores.append(5)  # Default score if no number is found
    return np.mean(scores)

# Perform A/B test
topic = "machine learning"
response_a = generate_response(prompt_a.format(topic=topic))
response_b = generate_response(prompt_b.format(topic=topic))

criteria = ["clarity", "informativeness", "engagement"]
score_a = evaluate_response(response_a, criteria)
score_b = evaluate_response(response_b, criteria)

print(f"Prompt A score: {score_a:.2f}")
print(f"Prompt B score: {score_b:.2f}")
print(f"Winning prompt: {'A' if score_a > score_b else 'B'}")

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Prompt A score: 9.00
Prompt B score: 9.00
Winning prompt: B


# Iterative Refinment
* Will ask LLM to generate answers on two different prompts
* Will ask to rate them on scale 1 to 10
* Ask which one is the best prompt for relevant output
* Will ask LLM to generate feedback and imporvement for Best prompt's output.
* Will ask LLM to according to feedback refine the prompt and generate the output.

In [4]:
def refine_prompt(initial_prompt, topic, iterations=3):
    """Refine a prompt through multiple iterations.

    Args:
        initial_prompt (PromptTemplate): The starting prompt template.
        topic (str): The topic to explain.
        iterations (int): Number of refinement iterations.

    Returns:
        PromptTemplate: The final refined prompt template.
    """
    current_prompt = initial_prompt
    for i in range(iterations):
        try:
            response = generate_response(current_prompt.format(topic=topic))
        except KeyError as e:
            print(f"Error in iteration {i+1}: Missing key {e}. Adjusting prompt...")
            # Remove the problematic placeholder
            current_prompt.template = current_prompt.template.replace(f"{{{e.args[0]}}}", "relevant example")
            response = generate_response(current_prompt.format(topic=topic))

        # Generate feedback and suggestions for improvement
        feedback_prompt = f"Analyze the following explanation of {topic} and suggest improvements to the prompt that generated it:\n\n{response}"
        feedback = generate_response(feedback_prompt)

        # Use the feedback to refine the prompt
        refine_prompt = f"Based on this feedback: '{feedback}', improve the following prompt template. Ensure to only use the variable {{topic}} in your template:\n\n{current_prompt.template}"
        refined_template = generate_response(refine_prompt)

        current_prompt = PromptTemplate(
            input_variables=["topic"],
            template=refined_template
        )

        print(f"Iteration {i+1} prompt: {current_prompt.template}")

    return current_prompt

# Perform A/B test
topic = "machine learning"
response_a = generate_response(prompt_a.format(topic=topic))
response_b = generate_response(prompt_b.format(topic=topic))

criteria = ["clarity", "informativeness", "engagement"]
score_a = evaluate_response(response_a, criteria)
score_b = evaluate_response(response_b, criteria)

print(f"Prompt A score: {score_a:.2f}")
print(f"Prompt B score: {score_b:.2f}")
print(f"Winning prompt: {'A' if score_a > score_b else 'B'}")

# Start with the winning prompt from A/B testing
initial_prompt = prompt_b if score_b > score_a else prompt_a
refined_prompt = refine_prompt(initial_prompt, "machine learning")

print("\nFinal refined prompt:")
print(refined_prompt.template)

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Prompt A score: 9.00
Prompt B score: 9.00
Winning prompt: B
Iteration 1 prompt: Here are a few improved prompt templates based on the feedback, using only the `{topic}` variable:

**Option 1 (More detailed):**

Explain {topic}, including its core principles, different types, the role of training data, the importance of generalization, and potential challenges.  Use a simple analogy to illustrate the process.

**Option 2 (Focus on a specific aspect – suitable if {topic} is a specific type of machine learning):**

Explain {topic}, detailing its process, the role of training data, model training, model evaluation, and the concept of generalization.  Discuss potential challenges such as overfitting and bias.

**Option 3 (Comparative

 # Comparing Original and Refined prompt

In [5]:
original_response = generate_response(initial_prompt.format(topic="machine learning"))
refined_response = generate_response(refined_prompt.format(topic="machine learning"))

original_score = evaluate_response(original_response, criteria)
refined_score = evaluate_response(refined_response, criteria)

print(f"Original prompt score: {original_score:.2f}")
print(f"Refined prompt score: {refined_score:.2f}")
print(f"Improvement: {(refined_score - original_score):.2f} points")

Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Evaluating response based on clarity...
Evaluating response based on informativeness...
Evaluating response based on engagement...
Original prompt score: 9.00
Refined prompt score: 9.00
Improvement: 0.00 points
