# Human in the Loop
1. User provides a prompt as normal
2. Prompt based evaluation is used to valuate the responses
3. Get best answer
4. Check the confidence of the answer based on how close the ratings are from each LLM
5. If the answer is “low confidence” then take the reasonings from all different models, and ask the user which one most accurately grades the low confidence response. Then replace the averaged answer with that corresponding grade.
6. Back to step 3

In [5]:
from prompt_based_evaluator import PromptBasedEvaluator
import json
import pandas as pd

In [6]:
user_query = "How many gigabytes of VRAM should I have for 1080p gaming?"

rubric = """Correctness & Accuracy (25 points) — Ensures claims are factually accurate and verifiable, addressing the most critical concern of hallucination-free responses. This is weighted highest because inaccurate information undermines all other qualities.

Completeness (20 points) - Verifies the answer addresses all aspects of the query without significant omissions. This prevents shallow or partial responses that technically answer only part of the question.

Clarity & Coherence (18 points) - Assesses whether the answer is well-organized with logical flow. Research shows that coherence and relevance are strong signals of problem-solving quality.

Relevance (18 points) - Ensures all information pertains to the question, avoiding tangential content that confuses the issue. This maintains focus and efficiency.

Conciseness (10 points) - Rewards efficiency by penalizing unnecessary verbosity or repetition while maintaining completeness. This balances against verbose but complete responses.

Appropriateness for Context (9 points) — Checks whether tone, depth, and format match what the questioner likely needs. Technical questions require different treatment than conversational ones."""

In [7]:
evaluator = PromptBasedEvaluator(
    "TNG: DeepSeek R1T2 Chimera (free)",
    "Meta: Llama 3.3 70B Instruct (free)",
    "AllenAI: Molmo2 8B (free)"
)
await evaluator.n_sq_evaluate(user_query, rubric)
df = evaluator.score_table()

Got user query answers.
Got scoring results.


In [8]:
print(json.dumps(evaluator.user_query_answers, indent=4))

[
    {
        "model": "TNG: DeepSeek R1T2 Chimera (free)",
        "response": "The amount of VRAM (Video RAM) you need for **1080p gaming** depends on the **games you play**, **graphics settings**, and **future-proofing goals**. Here's a simplified breakdown:\n\n### **General Recommendations**\n1. **Minimum (Low Settings / Older Games):**  \n   - **4GB VRAM** is the **absolute minimum** for playable 1080p gaming in less demanding or older titles (e.g., eSports games like *CS2*, *Valorant*, or indie games).  \n   - *Downsides:* Modern AAA games will struggle, even at medium settings. Expect stuttering or texture issues.\n\n2. **Standard for 1080p (Medium-High Settings):**  \n   - **6GB VRAM**: Works smoothly for most games at **medium-high settings** (e.g., *Fortnite*, *Apex Legends*, *Elden Ring*).  \n   - *Downsides:* Newer AAA games (e.g., *Cyberpunk 2077*, *Hogwarts Legacy*) may require lowering textures or settings.\n\n3. **Ideal for Future-Proofing (Ultra Settings / Mods / AAA

In [9]:
print(json.dumps(evaluator.evaluation_query_answers, indent=4))

[
    {
        "evaluated_model": "Meta: Llama 3.3 70B Instruct (free)",
        "evaluating_model": "TNG: DeepSeek R1T2 Chimera (free)",
        "score": 85,
        "reasoning": "Response provides generally accurate VRAM recommendations with minor underestimation at lower tiers (2-4GB being borderline adequate for modern titles) and one non-critical GDDR spec error (GTX 1070 highlighted as GDDR5 instead of GDDR5X), but demonstrates strong completeness (covers settings tiers and system factors), clarity, and appropriateness - though Correctness deduction dominates due to rubric priority."
    },
    {
        "evaluated_model": "AllenAI: Molmo2 8B (free)",
        "evaluating_model": "TNG: DeepSeek R1T2 Chimera (free)",
        "score": 84,
        "reasoning": "The candidate's recommendation of 16GB VRAM for 1080p with ray tracing is factually excessive (inaccuracy in Correctness & Accuracy, weighted 25%), but other elements show solid completeness, clarity, and relevance."
    },
 

In [10]:
df

Evaluated Model (Column),TNG: DeepSeek R1T2 Chimera (free),Meta: Llama 3.3 70B Instruct (free),AllenAI: Molmo2 8B (free)
Judge Model (Row),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TNG: DeepSeek R1T2 Chimera (free),,85.0,84.0
Meta: Llama 3.3 70B Instruct (free),96.0,,87.0
AllenAI: Molmo2 8B (free),93.0,94.0,
