<a href="https://colab.research.google.com/github/Harooniqbal4879/AgenticAI/blob/main/LLM_Valuation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 LLM Judge Homework: Step-by-Step Evaluation
In this assignment, you'll incrementally build an LLM-based judge to compare two summaries.

You'll go through the following steps:
1. Base judgment: which is better overall?
2. Add rubric: accuracy, coverage, clarity
3. Add explanations per rubric
4. Add chain-of-thought reasoning

In [None]:
!pip install openai



In [None]:
import openai
from getpass import getpass
openai.api_key = getpass('Enter your OpenAI API key: ')
client = openai.OpenAI(api_key=openai.api_key)

Enter your OpenAI API key: ··········


## 🧾 Example Set

In [None]:
examples = [
    {
        "id": "ex1",
        "context": "The UN released a report warning of global temperature rise and called for urgent international action.",
        "summary_a": "The UN warned that climate change is worsening and action is needed.",
        "summary_b": "The UN praised global progress in reducing emissions."
    },
    {
        "id": "ex2",
        "context": "NASA launched Artemis I, an uncrewed spacecraft that will orbit the Moon and return to Earth, preparing for human missions.",
        "summary_a": "NASA launched Artemis I to prepare for future Moon missions.",
        "summary_b": "NASA's Artemis I failed to launch due to engine problems."
    },
    {
        "id": "ex3",
        "context": "A study found intermittent fasting improves blood sugar and cholesterol levels.",
        "summary_a": "Intermittent fasting improves health markers like blood sugar and cholesterol.",
        "summary_b": "Fasting was linked to poor nutrition and increased blood pressure."
    }
]

## 🔹 Step 1: Base Judgment – A or B?
No rubric, just pick the better one and explain.

In [None]:
def judge_base(context, summary_a, summary_b,client):
    prompt = f'''
You're evaluating two summaries of an article.

Article:
\"\"\"{context}\"\"\"

Summary A:
\"\"\"{summary_a}\"\"\"

Summary B:
\"\"\"{summary_b}\"\"\"

Which one is better and why? Reply in JSON:
{{
  "final_answer": "A" or "B",
  "explanation": "..."
}}
'''
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
import json
def run_judge_base(client):
    print("Running Step 1: Base Judgment")
    for ex in examples:
        result = judge_base(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_base(client)

## 🔹 Step 2: Add Rubric Dimensions – Accuracy, Coverage, Clarity

In [None]:
def judge_with_rubric(context, summary_a, summary_b, client):
    prompt = f'''
<<YOUR PROMPT HERE>>
'''
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
import json
def run_judge_base(client):
    print("Running Step 1: Base Judgment")
    for ex in examples:
        result = judge_with_rubric(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_base(client)

## 🔹 Step 3: Add Explanations per Rubric Dimension and OneShot

In [None]:
def judge_with_rubric_expl(context, summary_a, summary_b, client):
    one_shot = '''
<<YOUR ONE_SHOT HERE>>
'''

    prompt = f'''
{one_shot}

<<YOUR PROMPT HERE>>

}}
'''

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": "You are a strict JSON evaluator for summarization quality using rubric dimensions."},
                  {"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()


In [None]:
def run_judge_rubric_expl(client):
    print("Running Step 3: Rubric + Explanation")
    for ex in examples:
        result = judge_with_rubric_expl(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_rubric_expl(client)


## 🔹 Step 4: Add Chain-of-Thought Reasoning

In [None]:
def judge_chain_of_thought(context, summary_a, summary_b, client):
    prompt = f"""

<<YOUR PROMPT HERE>>

"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a rubric-based evaluator that reasons step-by-step before scoring."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()


In [None]:
def run_judge_chain_of_thought(client):
    print("Running Step 4: Chain-of-Thought Judgment")
    for ex in examples:
        result = judge_chain_of_thought(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_chain_of_thought(client)