# Prompt Effectiveness Evaluation

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")

In [2]:
from langchain_groq import ChatGroq
from langchain_core.prompts import PromptTemplate

llm = ChatGroq(model="llama3-8b-8192", max_tokens=1000)

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import numpy as np

sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(text1, text2):
    """Calculate semantic similarity between two texts using cosine similarity."""
    embeddings = sentence_model.encode([text1, text2])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

  from .autonotebook import tqdm as notebook_tqdm


#Metrics for Measuring Prompt Performance

In [4]:
def relevance_score(response, expected_content):
    """Calculate relevance score based on semantic similarity to expected content."""
    return semantic_similarity(response, expected_content)

In [5]:
def consistency_score(responses):
    """Calculate consistency score based on similarity between multiple responses."""
    if len(responses) < 2:
        return 1.0  # Perfect consistency if there's only one response
    similarities = []
    for i in range(len(responses)):
        for j in range(i+1, len(responses)):
            similarities.append(semantic_similarity(responses[i], responses[j]))
    return np.mean(similarities)

In [6]:
def specificity_score(response):
    """Calculate specificity score based on response length and unique word count."""
    words = response.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0

#Manual Evaluation Techniques

In [7]:
def manual_evaluation(prompt, response, criteria):
    """Simulate manual evaluation of a prompt-response pair."""
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print("\nEvaluation Criteria:")
    for criterion in criteria:
        score = float(input(f"Score for {criterion} (0-10): "))
        print(f"{criterion}: {score}/10")
    print("\nAdditional Comments:")
    comments = input("Enter any additional comments: ")
    print(f"Comments: {comments}")

In [8]:
prompt = "Explain the concept of machine learning in simple terms."
response = llm.invoke(prompt).content
criteria = ["Clarity", "Accuracy", "Simplicity"]
manual_evaluation(prompt, response, criteria)

Prompt: Explain the concept of machine learning in simple terms.
Response: Machine learning is a way for computers to learn and improve on their own, without being explicitly programmed. Here's a simple explanation:

Imagine you're trying to teach a child to recognize different types of animals. You show them a picture of a cat, and say "this is a cat." Then, you show them a picture of a dog, and say "this is a dog." The child looks at the pictures and tries to figure out what makes a cat different from a dog.

As the child sees more and more pictures, they start to notice patterns. They might notice that cats have pointy ears and dogs have floppy ears. They might notice that cats are usually small and dogs are usually big. The child is learning from the pictures, and their brain is forming connections between the different features they see.

Machine learning is similar, but instead of a child, it's a computer program that's learning from data. The data is like a big collection of pic

#Automated Evaluation Techniques

In [9]:
def automated_evaluation(prompt, response, expected_content):
    """Perform automated evaluation of a prompt-response pair."""
    relevance = relevance_score(response, expected_content)
    specificity = specificity_score(response)
    
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"\nRelevance Score: {relevance:.2f}")
    print(f"Specificity Score: {specificity:.2f}")
    
    return {"relevance": relevance, "specificity": specificity}

In [10]:
prompt = "What are the three main types of machine learning?"
expected_content = "The three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
response = llm.invoke(prompt).content
automated_evaluation(prompt, response, expected_content)

Prompt: What are the three main types of machine learning?
Response: The three main types of machine learning are:

1. **Supervised Learning**: In supervised learning, the algorithm is trained on labeled data, meaning that each example in the training data is accompanied by a target or response variable. The goal is to learn a mapping between input data and output labels, so that the algorithm can make predictions on new, unseen data. Examples of supervised learning tasks include image classification, speech recognition, and sentiment analysis.
2. **Unsupervised Learning**: In unsupervised learning, the algorithm is trained on unlabeled data, and the goal is to discover patterns or structure in the data. This type of learning is often used for clustering, dimensionality reduction, and anomaly detection. Examples of unsupervised learning tasks include customer segmentation, recommender systems, and image segmentation.
3. **Reinforcement Learning**: In reinforcement learning, the algorit

{'relevance': np.float32(0.7931731), 'specificity': 0.5844748858447488}

#Comparitive Analysis

In [11]:
def compare_prompts(prompts, expected_content):
    """Compare the effectiveness of multiple prompts for the same task."""
    results = []
    for prompt in prompts:
        response = llm.invoke(prompt).content
        evaluation = automated_evaluation(prompt, response, expected_content)
        results.append({"prompt": prompt, **evaluation})
    
    # Sort results by relevance score
    sorted_results = sorted(results, key=lambda x: x['relevance'], reverse=True)
    
    print("Prompt Comparison Results:")
    for i, result in enumerate(sorted_results, 1):
        print(f"\n{i}. Prompt: {result['prompt']}")
        print(f"   Relevance: {result['relevance']:.2f}")
        print(f"   Specificity: {result['specificity']:.2f}")
    
    return sorted_results

In [12]:
prompts = [
    "List the types of machine learning.",
    "What are the main categories of machine learning algorithms?",
    "Explain the different approaches to machine learning."
]
expected_content = "The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning."
compare_prompts(prompts, expected_content)

Prompt: List the types of machine learning.
Response: Here are the main types of machine learning:

1. **Supervised Learning**: In this type of learning, the algorithm is trained on labeled data, where the output is already known. The goal is to learn a mapping between input data and output labels, so the algorithm can make predictions on new, unseen data.

Example: Image classification, sentiment analysis, spam detection

2. **Unsupervised Learning**: In this type of learning, the algorithm is trained on unlabeled data, and it must find patterns or structure in the data on its own. The goal is to discover hidden relationships or groupings in the data.

Example: Clustering, dimensionality reduction, anomaly detection

3. **Reinforcement Learning**: In this type of learning, the algorithm learns by interacting with an environment and receiving feedback in the form of rewards or penalties. The goal is to learn a policy that maximizes the rewards and minimizes the penalties.

Example: Gam

[{'prompt': 'List the types of machine learning.',
  'relevance': np.float32(0.82228476),
  'specificity': 0.4608030592734226},
 {'prompt': 'What are the main categories of machine learning algorithms?',
  'relevance': np.float32(0.73073),
  'specificity': 0.5818181818181818},
 {'prompt': 'Explain the different approaches to machine learning.',
  'relevance': np.float32(0.7171119),
  'specificity': 0.4584717607973422}]

#All Together

In [13]:
def evaluate_prompt(prompt, expected_content, manual_criteria=['Clarity', 'Accuracy', 'Relevance']):
    """Perform a comprehensive evaluation of a prompt using both manual and automated techniques."""
    response = llm.invoke(prompt).content
    
    print("Automated Evaluation:")
    auto_results = automated_evaluation(prompt, response, expected_content)
    
    print("\nManual Evaluation:")
    manual_evaluation(prompt, response, manual_criteria)
    
    return {"prompt": prompt, "response": response, **auto_results}

In [14]:
prompt = "Explain the concept of overfitting in machine learning."
expected_content = "Overfitting occurs when a model learns the training data too well, including its noise and fluctuations, leading to poor generalization on new, unseen data."
evaluate_prompt(prompt, expected_content)

Automated Evaluation:
Prompt: Explain the concept of overfitting in machine learning.
Response: Overfitting is a common problem in machine learning where a model becomes too specialized in fitting the training data, resulting in poor performance on new, unseen data. In other words, the model becomes too complex and starts to memorize the training data rather than learning generalizable patterns.

Here are some key characteristics of overfitting:

1. **High training accuracy**: The model performs very well on the training data, often with high accuracy.
2. **Poor generalization**: The model performs poorly on new, unseen data, including validation and test sets.
3. **High variance**: The model is highly sensitive to small changes in the training data, making it prone to overfitting.

Overfitting occurs when a model is too complex, such as:

1. **Too many parameters**: A model with too many parameters can fit the noise in the training data, leading to overfitting.
2. **Too much regulariz

{'prompt': 'Explain the concept of overfitting in machine learning.',
 'response': "Overfitting is a common problem in machine learning where a model becomes too specialized in fitting the training data, resulting in poor performance on new, unseen data. In other words, the model becomes too complex and starts to memorize the training data rather than learning generalizable patterns.\n\nHere are some key characteristics of overfitting:\n\n1. **High training accuracy**: The model performs very well on the training data, often with high accuracy.\n2. **Poor generalization**: The model performs poorly on new, unseen data, including validation and test sets.\n3. **High variance**: The model is highly sensitive to small changes in the training data, making it prone to overfitting.\n\nOverfitting occurs when a model is too complex, such as:\n\n1. **Too many parameters**: A model with too many parameters can fit the noise in the training data, leading to overfitting.\n2. **Too much regulariza