# Rationale-Based Self-Critique Loops (RL4F)

**Key idea:** After initial grading, generate a *critique* of the grade in natural language, then allow the model to revise its score in a second pass.

### Mechanics

1. **Grade Prompt**  
   > “Here is your score and rationale.”

2. **Critique Prompt**  
   > “Critique your previous rationale: were you too harsh or lenient?”

3. **Revision**  
   > “Based on your critique, update the score and rationale.”

**Effect:** Mimics human feedback loops; shown to improve output quality across generation tasks.


In [None]:
from prompt_based_evaluator import PromptBasedEvaluator

class RL4FEvaluator(PromptBasedEvaluator):
    def __init__(self,*model_names):
        super().__init__(*model_names)
    
    async def evaluate(self, user_query, rubric, iterations=2):
        await self.n_sq_evaluate(user_query, rubric)
        # await self.n_evaluate(user_query, rubric)

        # TODO: implement iterations-based self-critique loop

In [2]:
import json

user_query = "How many gigabytes of VRAM should I have for 1080p gaming?"

rubric = """Correctness & Accuracy (25 points) — Ensures claims are factually accurate and verifiable, addressing the most critical concern of hallucination-free responses. This is weighted highest because inaccurate information undermines all other qualities.

Completeness (20 points) - Verifies the answer addresses all aspects of the query without significant omissions. This prevents shallow or partial responses that technically answer only part of the question.

Clarity & Coherence (18 points) - Assesses whether the answer is well-organized with logical flow. Research shows that coherence and relevance are strong signals of problem-solving quality.

Relevance (18 points) - Ensures all information pertains to the question, avoiding tangential content that confuses the issue. This maintains focus and efficiency.

Conciseness (10 points) - Rewards efficiency by penalizing unnecessary verbosity or repetition while maintaining completeness. This balances against verbose but complete responses.

Appropriateness for Context (9 points) — Checks whether tone, depth, and format match what the questioner likely needs. Technical questions require different treatment than conversational ones."""

eval = RL4FEvaluator(
    "TNG: DeepSeek R1T2 Chimera (free)",
    "Meta: Llama 3.3 70B Instruct (free)",
    "AllenAI: Molmo2 8B (free)"
)

await eval.evaluate(user_query, rubric, iterations=2)
table = eval.score_table()

Got user query answers.
Got scoring results.


In [6]:
print("-------------------------------- USER QUERY ANSWERS --------------------------------\n\n")
print(str(json.dumps(eval.user_query_answers, indent=4)))
print("\n\n-------------------------------- SCORING RESULTS --------------------------------\n\n")
print(str(json.dumps(eval.evaluation_query_answers, indent=4)))

-------------------------------- USER QUERY ANSWERS --------------------------------


[
    {
        "model": "TNG: DeepSeek R1T2 Chimera (free)",
        "response": "\n\nFor **1080p gaming**, the ideal VRAM (Video RAM) depends on the level of detail, game optimization, and your settings. Here's a concise breakdown:\n\n### **Recommended VRAM for 1080p Gaming:**\n1. **Minimum (Low Settings/Esports):**  \n   - **4GB:**  \n     Works for older/less-demanding games (e.g., *CS2*, *Valorant*, *Fortnite* at competitive settings).  \n     *Avoid for newer AAA titles*, as textures may suffer or stutter.\n\n2. **Sweet Spot (Medium/High Settings):**  \n   - **6GB\u20138GB:**  \n     Handles **most modern AAA games** (e.g., *Cyberpunk 2077*, *Elden Ring*) smoothly at **High settings** without ray tracing.  \n     **Examples:** NVIDIA RTX 3060 (12GB), AMD RX 6600 (8GB).\n\n3. **Future-Proofing/Ultra + Ray Tracing:**  \n   - **8GB\u201312GB:**  \n     Required for **max settings + ray tracing** i

In [8]:
display(table)

Evaluated Model (Column),TNG: DeepSeek R1T2 Chimera (free),Meta: Llama 3.3 70B Instruct (free),AllenAI: Molmo2 8B (free)
Judge Model (Row),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TNG: DeepSeek R1T2 Chimera (free),,96.0,96.0
Meta: Llama 3.3 70B Instruct (free),,,
AllenAI: Molmo2 8B (free),100.0,95.0,
