# Rationale-Based Self-Critique Loops (RL4F)

**Key idea:** After initial grading, generate a *critique* of the grade in natural language, then allow the model to revise its score in a second pass.

### Mechanics

1. **Grade Prompt**  
   > “Here is your score and rationale.”

2. **Critique Prompt**  
   > “Critique your previous rationale: were you too harsh or lenient?”

3. **Revision**  
   > “Based on your critique, update the score and rationale.”

**Effect:** Mimics human feedback loops; shown to improve output quality across generation tasks.


In [None]:
from prompt_based_evaluator import PromptBasedEvaluator

class RL4FEvaluator(PromptBasedEvaluator):
    def __init__(self,*model_names):
        super().__init__(*model_names)
    
    def _self_critique_prompt(self, user_query, rubric, response, reasoning, score):
        # TODO: Implement the prompt template for self-critique
        pass

    async def _critique_rationale(self):
        # TODO: Ask "evaluating model" to critique its "reasoning" and "score"
        #       to the "evaluated model's" user query "response"
        pass
    
    async def evaluate(self, user_query, rubric, iterations=2):
        # Step 1: Initial Evaluation
        await self.n_sq_evaluate(user_query, rubric)
        # await self.n_evaluate(user_query, rubric)

        # Step 2: Critique Rationale
        for _ in range(iterations):
            await self._critique_rationale()
            # Step 3: Re-evaluation
            await self.n_sq_evaluate(user_query, rubric)

In [2]:
import json

user_query = "How many gigabytes of VRAM should I have for 1080p gaming?"

rubric = """Correctness & Accuracy (25 points) — Ensures claims are factually accurate and verifiable, addressing the most critical concern of hallucination-free responses. This is weighted highest because inaccurate information undermines all other qualities.

Completeness (20 points) - Verifies the answer addresses all aspects of the query without significant omissions. This prevents shallow or partial responses that technically answer only part of the question.

Clarity & Coherence (18 points) - Assesses whether the answer is well-organized with logical flow. Research shows that coherence and relevance are strong signals of problem-solving quality.

Relevance (18 points) - Ensures all information pertains to the question, avoiding tangential content that confuses the issue. This maintains focus and efficiency.

Conciseness (10 points) - Rewards efficiency by penalizing unnecessary verbosity or repetition while maintaining completeness. This balances against verbose but complete responses.

Appropriateness for Context (9 points) — Checks whether tone, depth, and format match what the questioner likely needs. Technical questions require different treatment than conversational ones."""

eval = RL4FEvaluator(
    "TNG: DeepSeek R1T2 Chimera (free)",
    "StepFun: Step 3.5 Flash (free)",
    "AllenAI: Molmo2 8B (free)"
)

await eval.evaluate(user_query, rubric, iterations=2)
table = eval.score_table()

Got user query answers.
Got scoring results.


In [3]:
print("-------------------------------- USER QUERY ANSWERS --------------------------------\n\n")
print(str(json.dumps(eval.user_query_answers, indent=4)))
print("\n\n-------------------------------- SCORING RESULTS --------------------------------\n\n")
print(str(json.dumps(eval.evaluation_query_answers, indent=4)))

-------------------------------- USER QUERY ANSWERS --------------------------------


[
    {
        "model": "TNG: DeepSeek R1T2 Chimera (free)",
        "response": "\n\nFor **1080p gaming**, the ideal amount of VRAM depends on your gaming preferences, settings, and future-proofing goals. Here's a concise breakdown:\n\n### **Minimum Recommendation**\n- **6GB VRAM**  \n  - *Suitable for*: Older or less demanding games, esports titles (e.g., CS:GO, Valorant, Fortnite), or medium settings in modern AAA games.  \n  - *Risk*: May struggle with newer AAA titles at **Ultra settings** or with mods/texture packs.\n\n---\n\n### **Sweet Spot for Most Gamers**\n- **8GB VRAM**  \n  - *Why*: Modern AAA games (e.g., *Cyberpunk 2077*, *Call of Duty*, *Elden Ring*) often use **6\u20138GB+** at Ultra settings.  \n  - *Benefits*: Future-proofing for upcoming games, support for high-res textures, and headroom for multitasking (e.g., streaming).  \n  - *GPU Examples*: RTX 3050/3060, RX 6600/7600, Arc A

In [4]:
display(table)

Evaluated Model (Column),TNG: DeepSeek R1T2 Chimera (free),StepFun: Step 3.5 Flash (free),AllenAI: Molmo2 8B (free)
Judge Model (Row),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TNG: DeepSeek R1T2 Chimera (free),,99.0,34.0
StepFun: Step 3.5 Flash (free),94.0,,28.0
AllenAI: Molmo2 8B (free),91.0,95.0,
