# Rationale-Based Self-Critique Loops (RL4F)

**Key idea:** After initial grading, generate a *critique* of the grade in natural language, then allow the model to revise its score in a second pass.

### Mechanics

1. **Grade Prompt**  
   > “Here is your score and rationale.”

2. **Critique Prompt**  
   > “Critique your previous rationale: were you too harsh or lenient?”

3. **Revision**  
   > “Based on your critique, update the score and rationale.”

**Effect:** Mimics human feedback loops; shown to improve output quality across generation tasks.


In [1]:
from prompt_based_evaluator import PromptBasedEvaluator
import textwrap

class RL4FEvaluator(PromptBasedEvaluator):
    def __init__(self,*model_names):
        super().__init__(*model_names)
        # critique_history: list of "rounds". Each round is a list of dicts, one per
        # (evaluating_model, evaluated_model), with before_*, raw_response, after_*.
        self.critique_history = []

    def format_critique_entry(self, item):
        """Format one critique-history entry for readable before/after display."""
        lines = [
            f"Judge: {item['evaluating_model']}  →  Candidate: {item['evaluated_model']}",
            "  Before:",
            f"    Score: {item['before_score']}",
            f"    Reasoning: {item['before_reasoning']}",
            "  Raw response (critique + revision):",
            "    " + item["raw_response"].replace("\n", "\n    "),
            "  After:",
            f"    Score: {item.get('after_score')}",
            f"    Reasoning: {item.get('after_reasoning')}",
        ]
        return "\n".join(lines)

    def _self_critique_and_revision_prompt(self, user_query, rubric, response, reasoning, score):
        """
        Combined prompt: critique your previous evaluation, then output revised
        reasoning and score as JSON. Saves one API call per pair per iteration.
        """
        return textwrap.dedent(f"""\
        You previously evaluated a candidate's response and gave a score with a rationale. Now critique your evaluation and then provide a revised score and rationale.

        QUERY:
        {user_query}

        CANDIDATE RESPONSE (that you evaluated):
        {response}

        RUBRIC:
        {rubric}

        YOUR PREVIOUS EVALUATION:
        - Reasoning: {reasoning}
        - Score: {score} (out of 100)

        Instructions:

        1. Critique: Briefly critique your previous rationale and score. Consider whether you were too harsh or lenient, missed rubric criteria, or misapplied weightings. Be specific (e.g., "I may have been too strict on Completeness").

        2. Revision: After your critique, output your revised evaluation as a single JSON object. You must end your response with exactly one line that is only this JSON object (no other text on that line):
        {{"reasoning": "<one-sentence revised justification referencing rubric>", "score": <integer 0-100>}}

        Apply the rubric strictly. Correctness & Accuracy has the highest impact on the overall score.
        """)

    async def _critique_rationale(self, user_query, rubric):
        """
        For each (evaluating_model, evaluated_model) entry, ask the evaluating
        model to critique and revise its rationale/score; update entries in place.
        Uses batched query_models for one API call per pair.
        """
        if not self.evaluation_query_answers or not self.user_query_answers:
            return
        response_by_model = {item["model"]: item["response"] for item in self.user_query_answers}
        model_names = []
        queries = []
        for entry in self.evaluation_query_answers:
            response = response_by_model.get(entry["evaluated_model"], "")
            reasoning = entry.get("reasoning", "")
            score = entry.get("score", 0)
            prompt = self._self_critique_and_revision_prompt(user_query, rubric, response, reasoning, score)
            model_names.append(entry["evaluating_model"])
            queries.append(prompt)
        results = await self.query_models(model_names, queries)
        round_data = []
        for i, entry in enumerate(self.evaluation_query_answers):
            before_reasoning = entry.get("reasoning", "")
            before_score = entry.get("score", 0)
            raw = results[i]["response"]
            revised = self.extract_outermost_json(raw)
            after_reasoning = revised.get("reasoning") if revised else None
            after_score = int(revised["score"]) if revised and "score" in revised else None
            round_data.append({
                "evaluating_model": entry["evaluating_model"],
                "evaluated_model": entry["evaluated_model"],
                "before_reasoning": before_reasoning,
                "before_score": before_score,
                "raw_response": raw,
                "after_reasoning": after_reasoning,
                "after_score": after_score,
            })
        self.critique_history.append(round_data)
        for i, entry in enumerate(self.evaluation_query_answers):
            if round_data[i]["after_reasoning"] is not None and round_data[i]["after_score"] is not None:
                entry["score"] = round_data[i]["after_score"]
                entry["reasoning"] = round_data[i]["after_reasoning"]

    async def evaluate(self, user_query, rubric, iterations=2):
        self.critique_history = []
        # Step 1: Initial evaluation (populates user_query_answers and evaluation_query_answers)
        await self.n_sq_evaluate(user_query, rubric)
        # Step 2: Refine via critique-and-revision (iterations - 1 rounds)
        for _ in range(iterations - 1):
            await self._critique_rationale(user_query, rubric)

In [2]:
import json

user_query = "How many gigabytes of VRAM should I have for 1080p gaming?"

rubric = """Correctness & Accuracy (25 points) — Ensures claims are factually accurate and verifiable, addressing the most critical concern of hallucination-free responses. This is weighted highest because inaccurate information undermines all other qualities.

Completeness (20 points) - Verifies the answer addresses all aspects of the query without significant omissions. This prevents shallow or partial responses that technically answer only part of the question.

Clarity & Coherence (18 points) - Assesses whether the answer is well-organized with logical flow. Research shows that coherence and relevance are strong signals of problem-solving quality.

Relevance (18 points) - Ensures all information pertains to the question, avoiding tangential content that confuses the issue. This maintains focus and efficiency.

Conciseness (10 points) - Rewards efficiency by penalizing unnecessary verbosity or repetition while maintaining completeness. This balances against verbose but complete responses.

Appropriateness for Context (9 points) — Checks whether tone, depth, and format match what the questioner likely needs. Technical questions require different treatment than conversational ones."""

eval = RL4FEvaluator(
    "TNG: DeepSeek R1T2 Chimera (free)",
    "StepFun: Step 3.5 Flash (free)",
    "AllenAI: Molmo2 8B (free)"
)

await eval.evaluate(user_query, rubric, iterations=2)
table = eval.score_table()

Got user query answers.
Got scoring results.


In [3]:
print("-------------------------------- USER QUERY ANSWERS --------------------------------\n\n")
print(str(json.dumps(eval.user_query_answers, indent=4)))
print("\n\n-------------------------------- SCORING RESULTS --------------------------------\n\n")
print(str(json.dumps(eval.evaluation_query_answers, indent=4)))

-------------------------------- USER QUERY ANSWERS --------------------------------


[
    {
        "model": "TNG: DeepSeek R1T2 Chimera (free)",
        "response": "\n\nFor **1080p gaming** (1920x1080 resolution), the ideal amount of VRAM depends on the games you play, your desired graphics settings, and future-proofing. Here's a breakdown:\n\n### **1. Minimum: 4GB VRAM**  \n- Can run **older or less demanding games** at medium settings (e.g., _CS:GO_, _Rocket League_, _League of Legends_).  \n- **Not recommended** for modern AAA titles (e.g., _Cyberpunk 2077_, _Hogwarts Legacy_), as these often require 6GB+ even at 1080p.\n\n### **2. Recommended: 6GB VRAM**  \n- Handles **most modern AAA games** smoothly at **medium-to-high settings**.  \n- Still might struggle with **ultra settings or future games**, especially those using high-resolution textures.  \n- Good budget option for casual gamers.\n\n### **3. Ideal: 8GB VRAM**  \n- **Sweet spot for 1080p gaming** today.  \n- Allows **u

In [4]:
display(table)

Evaluated Model (Column),TNG: DeepSeek R1T2 Chimera (free),StepFun: Step 3.5 Flash (free),AllenAI: Molmo2 8B (free)
Judge Model (Row),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TNG: DeepSeek R1T2 Chimera (free),,98.0,60.0
StepFun: Step 3.5 Flash (free),95.0,,73.0
AllenAI: Molmo2 8B (free),91.0,96.0,


In [6]:
# Or every round, every pair
for round_idx, round_data in enumerate(eval.critique_history):
    print(f"=== Refinement round {round_idx + 1} ===")
    for item in round_data:
        print(eval.format_critique_entry(item))
        print("---")

=== Refinement round 1 ===
Judge: TNG: DeepSeek R1T2 Chimera (free)  →  Candidate: StepFun: Step 3.5 Flash (free)
  Before:
    Score: 100
    Reasoning: The response provides accurate, up-to-date VRAM recommendations for 1080p gaming (25/25 Correctness), thoroughly addresses all query aspects including edge cases like modding (20/20 Completeness), uses clear headers/logical flow (18/18 Clarity), maintains strict focus on VRAM needs (18/18 Relevance), delivers comprehensive information without fluff (10/10 Conciseness), and adopts a perfect technical-but-approachable tone (9/9 Appropriateness).
  Raw response (critique + revision):
    {"reasoning": "Minor deduction in Correctness (23/25) due to overstatement about 8GB universally handling Ultra settings in 2023-2024 AAA games at 1080p, slightly affecting factual precision; other rubric categories remain flawless.", "score": 98}
  After:
    Score: 98
    Reasoning: Minor deduction in Correctness (23/25) due to overstatement about 8GB 