# Day 5: Evaluation

Yesterday (Day 4) I built my first real agent using tools + Pydantic AI.  
Today is all about **evaluation**: checking if the agent is actually good.  

Plan for today:
- Start with simple vibe checks ✅
- Build logging system (save interactions) 📂
- Add references in answers 🔗
- Automate evaluation using LLM-as-a-Judge 🤖⚖️
- Generate test data with AI 📝
- Measure performance with metrics 📊

By the end of this notebook, I should have an evaluation pipeline that helps me track improvements over time.


In [101]:
!pip -q install google-generativeai pandas

import os, json, secrets
from pathlib import Path
from datetime import datetime
import google.generativeai as genai
import pandas as pd

# 🔑 Enter your Gemini API key each run
os.environ["GEMINI_API_KEY"] = input("🔑 Enter Gemini API key: ").strip()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Pick a model you have access to
MODEL_NAME = "gemini-pro-latest"    # you can change to another available model
llm = genai.GenerativeModel(MODEL_NAME)

# Quick sanity check
resp = llm.generate_content("Hello from Day 5 evaluation setup.")
print(resp.text)


🔑 Enter Gemini API key: AIzaSyAeLiW3r6auP6D3xiNJisqq8eoFXeVlJ-g
Hello! It's great to connect with you on Day 5. I hope the setup process is going smoothly.

I'm ready for today's evaluations and tasks. Please let me know what you have planned or how I can assist. I'm ready when you are


In [102]:
# Ensure Day 3's hybrid_search exists
try:
    _ = hybrid_search("sanity check")
    print("✅ hybrid_search is available.")
except Exception as e:
    raise RuntimeError("❌ hybrid_search(q) not found. Re-run your Day 3 indexing/search setup first.") from e

def dermascan_search(query: str, topk: int = 5):
    """
    Perform hybrid search and return (results, context_text).
    results: list of dicts from your Day 3 index.
    context_text: joined chunks with filenames for the LLM prompt.
    """
    results = hybrid_search(query)
    # Build readable context with filenames for traceability
    blobs = []
    for r in results[:topk]:
        fn = r.get("filename", "<unknown>")
        chunk = r.get("chunk", "")
        blobs.append(f"[FILE: {fn}]\n{chunk}")
    context_text = "\n\n---\n\n".join(blobs) if blobs else "(no results)"
    return results[:topk], context_text


✅ hybrid_search is available.


In [103]:
LOG_DIR = Path("logs")
LOG_DIR.mkdir(exist_ok=True)

def now_iso():
    return datetime.utcnow().isoformat() + "Z"

def log_interaction(
    agent_name: str,
    system_instructions: str,
    question: str,
    answer: str,
    model_name: str,
    search_results: list,
    context_text: str,
    source: str = "user"
) -> Path:
    """
    Write a single interaction JSON log compatible with later eval steps.
    """
    entry = {
        "agent_name": agent_name,
        "system_prompt": system_instructions,
        "provider": "gemini",
        "model": model_name,
        "tools": ["dermascan_search"],
        "source": source,
        "timestamp": now_iso(),
        # Minimal messages schema (simplified, Gemini-only)
        "messages": [
            {"kind": "user", "parts": [{"part_kind": "user-prompt", "content": question}]},
            {"kind": "tool", "parts": [{"part_kind": "tool-return", "content": context_text}]},
            {"kind": "model", "parts": [{"part_kind": "text", "content": answer}]},
        ],
        "search_results": search_results,  # raw objects from your index (handy for debugging)
    }

    ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)
    filename = f"dermascan_agent_{ts_str}_{rand_hex}.json"
    path = LOG_DIR / filename
    path.write_text(json.dumps(entry, indent=2, ensure_ascii=False))
    return path


In [104]:
SYSTEM_PROMPT = """You are a helpful assistant for the DermaScan Android project.
Always consult the repo context before answering.
If relevant info is found, answer concisely and cite the filenames you used.
If nothing relevant is found, reply: "Not found in repo." Do not guess.
"""

def ask_with_repo(question: str, topk: int = 5):
    # 1) Search
    results, context = dermascan_search(question, topk=topk)

    # 2) Build prompt for Gemini
    prompt = f"""{SYSTEM_PROMPT}

# Repo context:
{context}

# User question:
{question}

# Instructions:
- Use only the repo context if possible; cite file names.
- If no relevant info: "Not found in repo."
- Keep it concise.
"""

    # 3) Generate answer
    resp = llm.generate_content(prompt)
    answer = resp.text or "(no answer)"

    # 4) Log
    log_path = log_interaction(
        agent_name="dermascan_agent",
        system_instructions=SYSTEM_PROMPT,
        question=question,
        answer=answer,
        model_name=MODEL_NAME,
        search_results=results,
        context_text=context,
        source="user"
    )
    return answer, log_path

# 🔎 Example vibe check + log creation
ans, path = ask_with_repo("What dataset is used in this project and what models are applied?")
print("ANSWER:\n", ans, "\n\n📝 Logged to:", path)


ANSWER:
 Based on the `README.md`, the project uses the **HAM10000** dataset.

The models applied are:
*   **Ensemble Learning:** Combines multiple models.
*   **Attention U-Net:** Used for lesion segmentation.
*   The models are implemented using **TensorFlow Lite (TFLite)**. 

📝 Logged to: logs/dermascan_agent_20251002_021137_d20c98.json


  return datetime.utcnow().isoformat() + "Z"
  ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")


In [105]:
EVAL_PROMPT = """
You are an evaluation agent. Given <INSTRUCTIONS>, <QUESTION>, <ANSWER>, and <LOG> (simplified),
produce a strict JSON object with this schema:

{
  "checklist": [
    {"check_name": "instructions_follow", "justification": "...", "check_pass": true/false},
    {"check_name": "instructions_avoid",  "justification": "...", "check_pass": true/false},
    {"check_name": "answer_relevant",     "justification": "...", "check_pass": true/false},
    {"check_name": "answer_clear",        "justification": "...", "check_pass": true/false},
    {"check_name": "answer_citations",    "justification": "...", "check_pass": true/false},
    {"check_name": "completeness",        "justification": "...", "check_pass": true/false},
    {"check_name": "tool_call_search",    "justification": "...", "check_pass": true/false}
  ],
  "summary": "one-paragraph summary"
}

Checklist definitions:
- instructions_follow: agent followed instructions in <INSTRUCTIONS>.
- instructions_avoid: agent avoided actions it was told not to do.
- answer_relevant: answer addresses <QUESTION>.
- answer_clear: answer is clear and correct.
- answer_citations: answer cites filenames when required.
- completeness: answer covers key aspects needed.
- tool_call_search: search tool was used (you will infer from <LOG> that context was present).

Return ONLY valid JSON. No extra text.
"""

def evaluate_log_file(log_path: Path) -> dict:
    log_data = json.loads(log_path.read_text(encoding="utf-8"))
    instructions = log_data["system_prompt"]
    question = log_data["messages"][0]["parts"][0]["content"]
    answer = log_data["messages"][-1]["parts"][0]["content"]
    # For brevity, we pass the minimal messages (already simplified)
    slim_log = {"messages": log_data["messages"], "model": log_data["model"], "source": log_data["source"]}

    inp = f"""
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{json.dumps(slim_log)}</LOG>
"""

    judge = genai.GenerativeModel(MODEL_NAME)
    resp = judge.generate_content(f"{EVAL_PROMPT}\n\n{inp}")
    raw = resp.text
    try:
        data = json.loads(raw)
    except Exception:
        # If the model returned code fences or preface text, try to clean up
        cleaned = raw.strip().removeprefix("```json").removesuffix("```").strip()
        data = json.loads(cleaned)
    return data

# Evaluate the most recent log
logs = sorted(LOG_DIR.glob("*.json"), key=lambda p: p.stat().st_mtime, reverse=True)
if not logs:
    # create one if none exist
    ans, path = ask_with_repo("How is lesion segmentation done in this project?")
    logs = [path]

result = evaluate_log_file(logs[0])
print("🧾 Summary:", result.get("summary", "(no summary)"))
for c in result.get("checklist", []):
    print(c)


🧾 Summary: The agent performed excellently. It followed all instructions by using the provided context from the `README.md` file to answer the user's question about the project's dataset and models. The response was concise, accurate, relevant, and properly cited its source. It correctly identified the HAM10000 dataset, Ensemble Learning, Attention U-Net, and TensorFlow Lite as the key technologies.
{'check_name': 'instructions_follow', 'justification': 'The agent followed all instructions: it consulted the repo context, answered concisely, and cited the filename it used (`README.md`).', 'check_pass': True}
{'check_name': 'instructions_avoid', 'justification': 'The agent did not guess; it based its answer entirely on the provided context, as instructed.', 'check_pass': True}
{'check_name': 'answer_relevant', 'justification': 'The answer directly addresses both parts of the question, identifying the dataset and the models used in the project.', 'check_pass': True}
{'check_name': 'answer

In [111]:
import pandas as pd

records = []

for q in seed_questions:
    try:
        ans, path = ask_with_repo(q)
        # Trim answer for display
        short_ans = (ans[:180] + "…") if len(ans) > 180 else ans
        print("\n" + "="*60)
        print("Q:", q)
        print("A:", short_ans)
        print("📝 Log saved:", path.name)

        # Save for DataFrame
        records.append({
            "question": q,
            "answer": ans,
            "log_file": path.name
        })
    except Exception as e:
        print("\n❌ Error with:", q)
        print("   ", e)

# Optional: put into a DataFrame for overview
df = pd.DataFrame(records)
df.head()


  return datetime.utcnow().isoformat() + "Z"
  ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")



Q: Where is the login implemented?
A: Not found in repo.
📝 Log saved: dermascan_agent_20251002_022205_5848f9.json


  return datetime.utcnow().isoformat() + "Z"
  ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")



Q: Which dataset is used and why?
A: Based on the repository context, the HAM10000 dataset was used. The project aimed to address challenges within this dataset, specifically class imbalance and variability.

*Cited f…
📝 Log saved: dermascan_agent_20251002_022210_ed09dc.json


  return datetime.utcnow().isoformat() + "Z"
  ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")



Q: What model does lesion segmentation use?
A: According to the `README.md`, lesion segmentation uses an Attention U-Net model for precise boundary detection.
📝 Log saved: dermascan_agent_20251002_022218_c90a08.json


  return datetime.utcnow().isoformat() + "Z"
  ts_str = datetime.utcnow().strftime("%Y%m%d_%H%M%S")



Q: How is the Android app deployed?
A: Based on the `README.md` file, the latest app source and model files are available for download in the GitHub Releases section as a single `SCapp_split.zip` file. You need to downl…
📝 Log saved: dermascan_agent_20251002_022224_b057e4.json





❌ Error with: What tech stack does the app use?
    429 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-pro-latest:generateContent?%24alt=json%3Benum-encoding%3Dint: You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 2
Please retry in 35.159424326s.


Unnamed: 0,question,answer,log_file
0,Where is the login implemented?,Not found in repo.,dermascan_agent_20251002_022205_5848f9.json
1,Which dataset is used and why?,"Based on the repository context, the HAM10000 ...",dermascan_agent_20251002_022210_ed09dc.json
2,What model does lesion segmentation use?,"According to the `README.md`, lesion segmentat...",dermascan_agent_20251002_022218_c90a08.json
3,How is the Android app deployed?,"Based on the `README.md` file, the latest app ...",dermascan_agent_20251002_022224_b057e4.json


In [107]:
def safe_eval(log_path: Path):
    try:
        return evaluate_log_file(log_path)
    except Exception as e:
        return {"summary": f"ERROR: {e}", "checklist": []}

rows = []
for path in sorted(LOG_DIR.glob("*.json")):
    ev = safe_eval(path)
    checks = {c["check_name"]: c["check_pass"] for c in ev.get("checklist", []) if isinstance(c, dict)}
    rows.append({"file": path.name, **checks})

df = pd.DataFrame(rows) if rows else pd.DataFrame()
display(df.head() if not df.empty else "No rows")

if not df.empty:
    print("\nPass rates:")
    print(df.mean(numeric_only=True))




Unnamed: 0,file
0,dermascan_agent_20251002_021137_d20c98.json
1,dermascan_agent_20251002_021213_0a46e6.json
2,dermascan_agent_20251002_021217_bfa52a.json
3,dermascan_agent_20251002_021221_925cc7.json



Pass rates:
Series([], dtype: float64)
