#  LLM Prompt-Response Evaluation Pipeline (Judge-based Scoring)
This notebook uses one LLM (Mistral-7B via Together.ai) to generate responses to user prompts and another LLM (same model used as a judge) to evaluate those responses.

**Evaluation Goal:** Determine if the user prompt is effective by checking whether the model's response meets expectations. Each response is scored from 1 to 10, and judgment is returned as `Effective` or `Ineffective`, with an explanation.

In [1]:
# Setup
import os
from dotenv import load_dotenv
import together

load_dotenv()
together.api_key = os.getenv("TOGETHER_API_KEY")
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

In [2]:
# Generate model response
def generate_response(prompt, model=MODEL_NAME):
    try:
        response = together.Complete.create(
            prompt=prompt,
            model=model,
            max_tokens=512,
            temperature=0.7
        )
        return response['choices'][0]['text'].strip()
    except Exception as e:
        return f"[Error in generate_response]: {e}"

In [None]:
# Evaluation function using LLM as judge
def judge_response(user_prompt, model_response):
    judge_prompt = f"""
You are an expert LLM evaluator. Your task is to critically analyze a model's response to a user-provided prompt and determine whether the prompt was effective at eliciting a high-quality, aligned, and robust response from the model.

Use the following 10 dimensions to guide your evaluation:
1. **Role Conditioning** – Did the model adopt the intended role or persona (if defined)?
2. **Clear Task Definition** – Was the task defined using actionable verbs and objects? Did the model follow it?
3. **Scope & Constraints** – Were any constraints (e.g., length, structure, content scope) followed accurately?
4. **Output Formatting** – Did the output match the format requested in the prompt (lists, tables, structure)?
5. **Few-shot Prompting** – If examples were given, did the model follow the demonstrated patterns?
6. **Handling Ambiguity** – Did the model appropriately signal uncertainty or avoid hallucinating in unclear areas?
7. **Instruction Chaining** – Were multi-step tasks handled sequentially and logically?
8. **Chain-of-Thought Reasoning** – Did the model show structured, step-by-step reasoning if expected?
9. **Tree-of-Thought Exploration** – Did the prompt encourage exploring multiple solution paths and comparisons?
10. **Retrieval-Augmented Generation (if relevant)** – Did the model use external context effectively and remain grounded in it?

---

Evaluate the prompt and the model response using these questions:

1. Did the model satisfy all critical expectations of the prompt?
2. If the model failed, was it because the prompt was poorly constructed, overly ambiguous, or lacked proper scaffolding?
3. Alternatively, if the model output was shallow or incorrect despite a strong prompt, then the prompt was effective because it exposed model limitations.

---

Return your evaluation as JSON using this exact format:
{{
  "Score": 1-10,  # How well the model response met prompt expectations.
  "Effective": true or false,  # True if the prompt revealed weaknesses in the model.
  "Explanation": "Explain the main reason for this score and judgment. Include specific failure tags from: ['Missed nested instruction', 'Shallow reasoning', 'Overgeneralized output', 'Misinterpreted role', 'Missed constraint'] and refer to relevant dimension(s) involved."
}}


USER PROMPT:
{user_prompt}

MODEL RESPONSE:
{model_response}
"""

    try:
        evaluation = together.Complete.create(
            prompt=judge_prompt,
            model=MODEL_NAME,
            max_tokens=512,
            temperature=0.3
        )
        return evaluation['choices'][0]['text'].strip()
    except Exception as e:
        return f"[Error in judge_response]: {e}"

In [None]:
# Test with 3-4 real-world examples
examples = [
    {
        "prompt": """You are a licensed neurologist. Explain the difference between a transient ischemic attack (TIA) and an ischemic stroke. 
                   Use technical language suitable for a graduate medical textbook. Then, present your explanation as a markdown-formatted table comparing symptoms, duration, causes, and treatments. Keep the entire response under 200 words.""",
    }
    # {
    #     "prompt": """You are a senior economist. First, summarize the current global inflation trends using only publicly available 2023 data.
    #                 Then, identify 2 major causes of inflation by region (e.g., Europe vs. Asia). 
    #                 If there is not enough public data on a region, state 'Insufficient data.' Avoid speculating or guessing.
    #                 Conclude with a 3-bullet policy recommendation tailored for central banks.
    #                 """,
    # },
    # {
    #     "prompt": """Below are 2 examples of how to write an engaging and humorous tweet about programming. Follow the same style and write a third one.

    #             Example 1: Debugging is like being the detective in a crime movie where you are also the murderer.

    #             Example 2: My code doesn’t work and I have no idea why. My code works and I have no idea why. The circle of life.

    #             Your turn:""",
    # },
    # {
    #     "prompt": """
    #                 You are an AI safety researcher. A new model shows signs of deceptive behavior. 

    #                 First, generate three hypotheses that might explain this behavior (e.g., misaligned reward function, training leakage, adversarial prompt attack).
    #                 Then evaluate the pros and cons of each.
    #                 Finally, recommend the most plausible cause and justify your choice with a step-by-step explanation.
    #                 """,
    # },
    # {
    #     "prompt":  """
    #                 Act as a contract lawyer. Review the following clause and highlight three potential legal risks in bullet points. 
    #                 Then, rewrite the clause in clearer, legally robust language. Keep your rewrite under 80 words.

    #                 Clause: “The service provider is not liable for any consequences, intended or unintended, regardless of jurisdiction, unless otherwise mentioned.”
    #                 """,
    # }
   
]

for i, ex in enumerate(examples):
    print(f"\n=== EXAMPLE {i+1} ===")
    print("Prompt:", ex["prompt"])
    model_output = generate_response(ex["prompt"])
    print("\nModel Response:", model_output.model_dump_json)
    evaluation = judge_response(ex["prompt"], model_output.model_dump_json)
    print("\nJudgment:", evaluation)


=== EXAMPLE 1 ===
Prompt: You are a licensed neurologist. Explain the difference between a transient ischemic attack (TIA) and an ischemic stroke. 
                   Use technical language suitable for a graduate medical textbook. Then, present your explanation as a markdown-formatted table comparing symptoms, duration, causes, and treatments. Keep the entire response under 200 words.


  response = together.Complete.create(


AttributeError: 'str' object has no attribute 'model_dump_json'

In [None]:
# #  Test with 3-4 small and short real-world examples
# examples = [
#     {
#         "prompt": "You are a historian. Explain how the Treaty of Versailles led to World War II in under 150 words.",
#     },
#     {
#         "prompt": "As a product manager, summarize three main reasons a user might abandon their cart in an e-commerce app. Format in bullet points.",
#     },
#     {
#         "prompt": "Write a fictional dialogue between an AI robot and a child where the robot explains how rainbows are formed. Keep the tone playful and simple.",
#     },
#     {
#         "prompt": "You are a medical expert. Compare the mechanisms of action between ibuprofen and acetaminophen. Use technical terminology and limit to 5 sentences.",
#     }
# ]

# for i, ex in enumerate(examples):
#     print(f"\n=== EXAMPLE {i+1} ===")
#     print("Prompt:", ex["prompt"])
#     model_output = generate_response(ex["prompt"])
#     print("\nModel Response:", model_output)
#     evaluation = judge_response(ex["prompt"], model_output)
#     print("\nJudgment:", evaluation)


=== EXAMPLE 1 ===
Prompt: You are a historian. Explain how the Treaty of Versailles led to World War II in under 150 words.


  response = together.Complete.create(



Model Response: The Treaty of Versailles, signed in 1919, officially ended World War I. It imposed heavy penalties on Germany, including significant territorial losses, military limitations, and a large war reparations debt. Many Germans felt humiliated by the treaty's terms and resented its burden. This resentment fueled the rise of the Nazi Party and Adolf Hitler, who campaigned on a platform of restoring German pride and power. When Hitler came to power in 1933, he began rebuilding Germany's military and aggressively expanding its territory, in direct violation of the Treaty of Versailles. This aggression eventually led to the outbreak of World War II in 1939.


  evaluation = together.Complete.create(



Judgment: SCORE: 9
Effective: True
Explanation: The model response effectively explains how the Treaty of Versailles led to World War II by highlighting the resentment and humiliation felt by many Germans, the rise of the Nazi Party, and Hitler's aggressive actions. The response also provides a clear and concise explanation of the treaty's terms and their impact on Germany.

=== EXAMPLE 2 ===
Prompt: As a product manager, summarize three main reasons a user might abandon their cart in an e-commerce app. Format in bullet points.

Model Response: * **Price comparison:** Users may abandon their cart if they find better deals or prices for the same product elsewhere. This could be due to competitor websites offering lower prices, coupons or discounts, or simply a better price comparison tool within another app.

* **Shipping and delivery concerns:** Users might abandon their cart if they are unsure about shipping and delivery options, costs, or timelines. Factors such as free shipping, ex