# Evaluating Prompt Effectiveness
As prompt engineering becomes increasingly crucial in AI applications, it's essential to have robust methods for assessing prompt effectiveness. This enables developers and researchers to optimize their prompts, leading to better AI model performance and more reliable outputs.

## Key Components
1. Metrics for measuring prompt performance
2. Manual evaluation techniques
3. Automated evaluation techniques
4. Practical examples using OpenAI and LangChain

In [1]:
! pip install langchain langchain-google-genai

Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.4-py3-none-any.whl.metadata (5.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
Downloading langchain_google_genai-2.1.4-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: filetype, google-ai-generativelanguage, langchain-google-genai
  Attempting uninstall: google-ai-generativelangu

In [2]:
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate

os.environ['GOOGLE_API_KEY']=''

# Inatiate the LLM
llm=ChatGoogleGenerativeAI(model='gemini-1.5-flash')


# Initialize sentence transformer for semantic similarity
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    embeddings = sentence_model.encode([text1, text2])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

# Manual Evaluation

In [4]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

# Example usage
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Imagine you have a puppy you're trying to teach to sit.  You show it the "sit" command, and if it sits, you give it a treat (positive reinforcement). If it doesn't, you don't. Over time, the puppy learns to associate the command with the action that gets it a treat.

Machine learning is similar.  Instead of a puppy, we have a computer program. Instead of treats, we have data.  We "train" the program by feeding it lots of data, and it learns patterns and relationships within that data.  Then, when we give it new, unseen data, it can use what it learned to make predictions or decisions.

For example:

* **Spam filter:**  Trained on lots of spam and non-spam emails, it learns to identify characteristics of spam (certain words, senders, etc.) and filter them out.
* **Image recognition:** Trained on millions of images of cats and dogs, it learns to distinguish between them and can identify a cat or dog in a new pictu

# Automated Evaluation

In [5]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")

    return {"relevance": relevance, "specificity": specificity}

# Example usage
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning:**  The algorithm learns from a labeled dataset, meaning the data includes both input features and the corresponding correct output.  The algorithm learns to map inputs to outputs based on this labeled data. Examples include image classification (labeling images with categories) and spam detection (classifying emails as spam or not spam).

2. **Unsupervised Learning:** The algorithm learns from an unlabeled dataset, meaning the data only contains input features without corresponding correct outputs. The algorithm tries to find patterns, structures, or relationships within the data. Examples include clustering (grouping similar data points together) and dimensionality reduction (reducing the number of variables while preserving important information).

3. **Reinforcement Learning:** The algorithm learns through trial and error by interacting with a

{'relevance': np.float32(0.82206047), 'specificity': 0.6625766871165644}

# Compare

In [6]:
  def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})

    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)

    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")

    return sorted_results

# Example usage
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Machine learning can be broadly categorized in several ways, and these categories often overlap.  Here are some common types:

**Based on Learning Style:**

* **Supervised Learning:** The algorithm learns from a labeled dataset, where each data point is tagged with the correct answer.  The goal is to learn a mapping from inputs to outputs.  Examples include:
    * **Regression:** Predicting a continuous output (e.g., house price prediction).
    * **Classification:** Predicting a categorical output (e.g., spam detection).

* **Unsupervised Learning:** The algorithm learns from an unlabeled dataset, identifying patterns and structures without explicit guidance. Examples include:
    * **Clustering:** Grouping similar data points together (e.g., customer segmentation).
    * **Dimensionality Reduction:** Reducing the number of variables while preserving important information (e.g., principal component analysis).
    * **Association Ru

[{'prompt': 'List the types of machine learning.',
  'relevance': np.float32(0.6978775),
  'specificity': 0.5761421319796954},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': np.float32(0.652776),
  'specificity': 0.5704225352112676},
 {'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': np.float32(0.5918125),
  'specificity': 0.6416184971098265}]

In [7]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content

    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)

    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)

    return {"prompt": prompt, "response": response, **auto_results}

# Example usage

prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting in machine learning occurs when a model learns the training data *too well*.  Instead of learning the underlying patterns and relationships in the data that generalize to new, unseen data (which is the goal), it memorizes the specific details and noise present in the training set.  This results in a model that performs exceptionally well on the training data but poorly on any new, unseen data.

Think of it like this:  Imagine you're trying to learn the relationship between hours studied and exam scores.  Overfitting would be like memorizing the exact scores of every student in your training dataset instead of understanding the general trend that more study time tends to lead to better scores.  You'd perfectly predict the scores of the students you already know, but you'd be terrible at predicting the score of a new student.

Here's a breakdown of the key aspects:

* **High traini

{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting in machine learning occurs when a model learns the training data *too well*.  Instead of learning the underlying patterns and relationships in the data that generalize to new, unseen data (which is the goal), it memorizes the specific details and noise present in the training set.  This results in a model that performs exceptionally well on the training data but poorly on any new, unseen data.\n\nThink of it like this:  Imagine you're trying to learn the relationship between hours studied and exam scores.  Overfitting would be like memorizing the exact scores of every student in your training dataset instead of understanding the general trend that more study time tends to lead to better scores.  You'd perfectly predict the scores of the students you already know, but you'd be terrible at predicting the score of a new student.\n\nHere's a breakdown of the key aspects:\n\n* **High training acc