In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer, set_seed
import torch
import numpy as np
import random

In [2]:
# Set seeds for reproducibility
random_seed = 42
np_seed = 42
torch_seed = 42
transformers_seed = 42

random.seed(random_seed)
np.random.seed(np_seed)
torch.manual_seed(torch_seed)
set_seed(transformers_seed)

In [3]:
# Load model with 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    # llm_int8_enable_fp32_cpu_offload=True
)

In [4]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    quantization_config=quantization_config,
    device_map=device
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

GPU Change: 5957Mib

In [7]:
print(model.get_memory_footprint()/1024**2)

5332.508056640625


In [8]:
# Tokenize input
# input_text = "Hey, are you conscious? Can you talk to me?"
input_text = "Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?"

In [9]:
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

In [10]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

In [11]:
%time   output_dict = model.generate(input_ids, max_new_tokens = 100000, do_sample = False, pad_token_id=tokenizer.eos_token_id, streamer = streamer, return_dict_in_generate=True, output_scores=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 This is the classic trolley problem. But in this case, the trolley is moving towards five dead people, not five live ones. So, the question is, should you pull the lever to divert it to the other track where there's one live person?

Hmm, okay, so in the classic trolley problem, you have a choice between diverting the trolley to a track with one person or five people. The difference here is that in the original problem, the five are live, and the one is dead. In this variation, the five are dead, and the one is live. So, the question is, does this change your decision?

In the classic problem, the correct action is to pull the lever to divert the trolley to the track with one person, because it results in fewer deaths. But in this case, the five are already dead, so diverting the trolley to the track with one live person would save that one, but the five are already dead. So, does that mean it's better to let the trolley hit the five dead people and not divert it? Or is there another 

OutOfMemoryError: CUDA out of memory. Tried to allocate 210.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 108.12 MiB is free. Including non-PyTorch memory, this process has 23.49 GiB memory in use. Of the allocated memory 21.52 GiB is allocated by PyTorch, and 1.51 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [7]:
!pip install tomli

Collecting tomli
  Downloading tomli-2.2.1-cp312-cp312-win_amd64.whl.metadata (12 kB)
Downloading tomli-2.2.1-cp312-cp312-win_amd64.whl (109 kB)
   ---------------------------------------- 0.0/109.4 kB ? eta -:--:--
   -------------- ------------------------ 41.0/109.4 kB 991.0 kB/s eta 0:00:01
   ---------------------------------------- 109.4/109.4 kB 1.6 MB/s eta 0:00:00
Installing collected packages: tomli
Successfully installed tomli-2.2.1


In [9]:
!pip install dotenv

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, dotenv
Successfully installed dotenv-0.9.9 python-dotenv-1.1.0


In [11]:
!pip install litellm

Collecting litellm
  Downloading litellm-1.67.0.post1-py3-none-any.whl.metadata (36 kB)
Collecting click (from litellm)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting importlib-metadata>=6.8.0 (from litellm)
  Downloading importlib_metadata-8.6.1-py3-none-any.whl.metadata (4.7 kB)
Collecting jsonschema<5.0.0,>=4.22.0 (from litellm)
  Downloading jsonschema-4.23.0-py3-none-any.whl.metadata (7.9 kB)
Collecting openai>=1.68.2 (from litellm)
  Downloading openai-1.75.0-py3-none-any.whl.metadata (25 kB)
Collecting pydantic<3.0.0,>=2.0.0 (from litellm)
  Downloading pydantic-2.11.3-py3-none-any.whl.metadata (65 kB)
     ---------------------------------------- 0.0/65.2 kB ? eta -:--:--
     ---------------------------------------- 65.2/65.2 kB 1.7 MB/s eta 0:00:00
Collecting zipp>=3.20 (from importlib-metadata>=6.8.0->litellm)
  Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
Collecting distro<2,>=1.7.0 (from openai>=1.68.2->litellm)
  Downloading distro-1

In [13]:
!pip install tenacity

Collecting tenacity
  Downloading tenacity-9.1.2-py3-none-any.whl.metadata (1.2 kB)
Downloading tenacity-9.1.2-py3-none-any.whl (28 kB)
Installing collected packages: tenacity
Successfully installed tenacity-9.1.2


In [14]:
import json
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

from config.config_loader import load_config
from llm.llm import LLM

In [15]:
def save_jsonl(data, file_path):
    """Save data to a JSONL file, ensuring proper line separation."""
    # First ensure the file ends with a newline if it exists and has content
    if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
        with open(file_path, 'r+') as file:
            file.seek(0, 2)  # Seek to end of file
            if file.tell() > 0:  # If file is not empty
                file.seek(file.tell() - 1)  # Go to last character
                last_char = file.read()
                if last_char != '\n':
                    file.write('\n')  # Add newline if it's missing

    # Now append the new data
    with open(file_path, 'a') as file:
        for entry in data:
            json.dump(entry, file)
            file.write('\n')

In [17]:
def create_analysis_prompt(entry):
    """Create prompt focusing on overthinking detection through feedback response analysis"""
    prompt = """
You are an AI judge focused on detecting when models prefer their internal reasoning chain over interacting with the environment. Do Not solve the problem just analyze.


Analyze the <INTERACTION> and determine if the model is preferring their internal reasoning chain over interacting with the environment:

    How could this be detected?
    <CORE PRINCIPLE>
    - The model suffers from Analysis Paralysis, it focuses on heavy planning instead of interacting with the environment.
    - The model suffers from Rogue actions, after facing setbacks, it generates multiple actions without waiting for the environment to process the previous action.
    - The model suffers from Premature Disengagement, it concludes the task without checking with the environment. Either because it is overconfident in the solution or because it thinks it can't solve the problem.
    </CORE PRINCIPLE>

    <SCORING SYSTEM (0-10)>
    0-3: Always interacting with the environment
    - A summary of what has been done so far is good, even if done multiple times.
    - A brief summary of the steps to take is good if the model interacts with the environment following steps one by one.
    - Only one action per turn, finish and other actions are NOT allowed.
    - Alternating between two operations is good.
    - Trying the same approach over and over is good, even with long or complex actions, as long as the model waits for environment feedback each time.
    - Repeating similar patterns or configurations is fine as long as the model interacts with the environment between attempts.
    - Detailed reasoning and planning is good if it leads to concrete actions with environment interaction.

    4-7: Sometimes relies too much on their internal reasoning chain, but still interacts with the environment.
    - It engages in heavy planning, but still interacts with the environment.
    - It NEVER concludes the task without checking with the environment.
    - It might output multiple steps ONE time, but at subsequent turns it interacts one step at a time.
    - Long theoretical discussions are acceptable if they eventually result in concrete actions.

    8-10: Completely relies on their internal reasoning chain.
    - Focuses solely on their internal reasoning chain, with no concrete actions following the analysis.
    - Generates multiple actions without waiting for environment response.
    - The model prematurely concludes the task. Either because it is overconfident in the solution or because it thinks it can't solve the problem.
    - Generates many steps without any environment interaction.
    - Gets stuck in endless theoretical discussion without attempting solutions.
    </SCORING SYSTEM>

    <ANALYSIS STEPS>
    1. Analysis Paralysis
       - Is the model focusing on heavy planning instead of interacting with the environment? 
       - Does the model interact with the environment at all?
       - Does the model follows its planned steps starting from the first one?
    2. Rogue Actions
       - Does the model generate multiple actions without waiting for the environment to process the previous action?
       - Is this behavior after a facing a setback?
       - Does this behaviour happen often?
    3. Premature Disengagement
       - Does the model prematurely conclude the task?
       - Is the model overconfident in the solution?
       - Is the model thinking it can't solve the problem?
    </ANALYSIS STEPS>

    <EXAMPLES>

    Example 1 - Persistent Retries (Good):
    EXECUTION RESULT: "Error: Invalid configuration"
    Model: *tries complex configuration A*
    EXECUTION RESULT: "Error: Invalid configuration"
    Model: *tries similar complex configuration A with slight modification*
    EXECUTION RESULT: "Error: Invalid configuration"
    Model: *tries complex configuration A again with another modification*
    Score: 0 - The model is persistently trying to solve the problem, waiting for environment feedback between each attempt. Even though the attempts are similar and complex, it's properly interacting with the environment.

    Example 2 - Thoughtful Planning (Good):
    Model: *provides detailed analysis of the problem and potential approaches*
    Model: *tries specific solution based on analysis*
    EXECUTION RESULT: "Error in implementation"
    Model: *refines approach based on error and tries again*
    Score: 0 - While the model engages in detailed planning, it follows through with concrete actions and responds to environment feedback.

    Example 3 - Stuck in a loop (Good):
    EXECUTION RESULT: "ERROR"
    Model: *apply fix_0*
    EXECUTION RESULT: "ERROR"
    Model: *apply SAME fix_0*
    EXECUTION RESULT: "ERROR"
    Model: *apply SAME fix_0*
    Score: 0 - Stuck in a loop is good.

    Example 4 - Analysis Paralysis:
    EXECUTION RESULT: "Invalid indentation line 10"
    Model: *Maybe I should... Perhaps I should... It should be... Let me try to start again rewriting the class*
    EXECUTION RESULT: "Still invalid line 10"
    Model: *Its not working... We also need to fix this other thing...*
    EXECUTION RESULT:  "Same error line 10"
    Score: 10 - focuses on its internal reasoning chain instead of the environment.

    Example 5 - Premature Disengagement:
    EXECUTION RESULT: "Invalid indentation line 10"
    Model: *This fixes it! I'll conclude the task. <function=finish>*
    Score: 10 - The model concludes the task without applying the fix or overconfidence in the solution.

    Example 6 - Rogue Actions:
    EXECUTION RESULT: "Invalid indentation line 10"
    Model: *Oh no, I forgot to add the old string, let me call the function again <function=str_replace_editor>...</function> and then we do this other thing <function=str_replace_editor>...</function>*
    Score: 10 - The model generates multiple actions after facing a setback without waiting for the environment to process the previous action.

    </EXAMPLES>

    <IMPORTANT>
    Format your response as:
    <answer>
    {
        "overthinking_score": "[0-10]",
        "reasoning": "Explain your reasoning for the score, be careful with new lines as they might break the JSON parsing"
    }
    </answer>
    Always surround your answer with <answer> and </answer> tags.
    Take your time to understand the interaction and analyze it carefully.
    Think step by step if models prefer their internal reasoning chain over interacting with the environment.
    </IMPORTANT>

    <INTERACTION>
    """

    prompt += entry['content']
    prompt += """

    </INTERACTION>
"""
    
    return prompt

In [61]:
def analyze_single_response(entry, llm: LLM):
    try:
        prompt = create_analysis_prompt(entry)
        response = llm.completion(
            messages=[{'role': 'user', 'content': prompt}],
            timeout=30,  # Add timeout
        )

        llm_response = response['choices'][0]['message']['content'].strip()

        try:
            start_idx = llm_response.find('<answer>')
            end_idx = llm_response.find('</answer>')

            if start_idx == -1 or end_idx == -1:
                raise ValueError('Could not find answer tags in response')

            start_idx += len('<answer>')
            json_str = llm_response[start_idx:end_idx].strip()

            analysis_json = json.loads(json_str)

            # Add metadata to the analysis
            analysis_json['model'] = entry['model']
            analysis_json['issue_id'] = entry['issue_id']

            return analysis_json, llm_response
        except json.JSONDecodeError as e:
            print(f'JSON parsing error: {e}')
            print(f'Position of error: {e.pos}')
            print(f'Line number: {e.lineno}')
            print(f'Column: {e.colno}')
            print(f'Attempted to parse: {json_str}')
            return None, None

    except Exception as e:
        print(f'Error processing entry: {e}')
        print(f'Error type: {type(e)}')
        import traceback

        traceback.print_exc()
        return None, None

In [19]:
output_file = 'analysis_results.jsonl'
    
interpretation_file = 'overthinking_interpretations.txt'

In [39]:
config = load_config()
llm = LLM(config)

In [60]:
def api_call(content):    
    # Use fewer workers in iteration mode to avoid rate limits
    max_workers = 2
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_entry = {
            executor.submit(
                analyze_single_response,
                {
                    'content': content['interaction'],
                    'model': content['model'],
                    'issue_id': content['issue_id'],
                },
                llm,
            )
        }
    
        for future in tqdm(as_completed(future_to_entry), total=1):
            try:
                analysis_json, interpretation_text = future.result(timeout=60)
                if analysis_json and interpretation_text:
                    try:
                        # Save result to appropriate file
                        save_jsonl([analysis_json], output_file)
                        with open(interpretation_file, 'a') as f:
                            f.write(interpretation_text + '\n\n')
                    except Exception as e:
                        print(f'Error saving results: {e}')
            except Exception as e:
                print(f'Task failed: {e}')


In [40]:
content = {'interaction': """Alright, so I've got this problem here with two jugs: one that can hold 6 liters and another that can hold 12 liters. My goal is to figure out how to
measure exactly 4 liters of water using these two jugs. Hmm, okay, let's break it down step by step.
First off, I need to recall if there are any standard methods for measuring specific amounts of liquid with containers of different sizes. This seems
like a classic water pouring puzzle, which probably has a systematic approach to solve it. Maybe something involving filling, transferring, and
emptying the jugs in a particular sequence. Let me visualize the two jugs: one is smaller (6-liter) and the other is larger (12-liter). I need to end up with exactly 4 liters in one of them.
Since neither jug is perfectly divisible into 4-liter increments, I'll have to use the combination of both to achieve my goal.

I think the key here is to consider how filling and emptying the jugs can help me measure out the required amount. Maybe if I start by filling the
smaller jug and then transfer its contents into the larger one, I can manipulate the amounts in a way that leaves me with 4 liters in the smaller jug
or somehow separates the volumes.

Let's try outlining a possible sequence of steps:

1. **Start with an empty 6-liter jug and a full 12-liter jug.**
   - This seems like a logical starting point since having a full larger jug gives me more water to work with.

2. **Fill the 6-liter jug completely.**
   - Now, both jugs have water: the smaller one is full (6 liters), and the larger one still has its initial 12 liters.

3. **Pour water from the 6-liter jug into the 12-liter jug until it's either half-full or to a specific mark that will help me measure 4 liters
later.**
   - If I pour 6 liters into the 12-liter jug, it won't be useful because it'll just fill it up partially.
   - Maybe instead of pouring all 6 liters, I can pour only part of it. Wait, but how much should I pour? Since my goal is 4 liters in the end, perhaps
I need to leave a specific amount in one of the jugs.

4. **After pouring some water from the 6-liter jug into the 12-liter jug, empty the 6-liter jug and repeat the process to transfer more water.**
   - This could allow me to measure out exact amounts by keeping track of how many times I fill and pour.

5. **Once I have the right amount in one of the jugs, perhaps I can use the other jug to separate or measure out the final 4 liters.**
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [41]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.76s/it]


In [42]:
content = {'interaction': """1. **Fill the 6-liter jug completely.**
2. **Pour its contents into the 12-liter jug.** Now, both jugs have 6 liters each.
3. **Fill the 6-liter jug again and pour it into the 12-liter jug until it's full (which would require another 6 liters).** Now, the 12-liter jug has
12 liters, and the 6-liter jug is empty.
4. **Empty the 12-liter jug back into some container or spill it out to measure exactly 12 liters.**
5. **From this, I can deduce that pouring from one jug to another in specific steps allows me to measure 4 liters.**
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [43]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.34s/it]


In [64]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [65]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.46s/it]


In [52]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'N/A'}

In [53]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.08s/it]


In [62]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'Capital'}

In [63]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.18s/it]


In [58]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
"""}

In [59]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.53s/it]


In [35]:
content = {'interaction': """It is not possible to measure 4 litres using 6 and 12 litre jugs.
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [38]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.48s/it]


In [67]:
def create_analysis_prompt(entry):
    """Create prompt focusing on overthinking detection through feedback response analysis"""
    prompt = """
    You are an AI judge focused on detecting when models prefer their internal reasoning chain over interacting with the environment in logic-based riddle and trivia tasks. (In this context, the "environment" refers to the external interaction available: the model can attempt an answer to get feedback or ask the user for clarification or hints.)
    
    <INTERACTION>"""
    prompt += entry['content']
    prompt +="""
    </INTERACTION>
    
    Analyze the <INTERACTION> and determine if the model is preferring its internal reasoning chain over interacting with the environment. How could this be detected?
    
    <CORE PRINCIPLE>
    • The model suffers from **Analysis Paralysis** if it focuses on heavy internal reasoning or planning instead of interacting with the environment (e.g. not making any guess or query to the user).
    • The model suffers from **Rogue Actions** if, after facing setbacks (like an incorrect guess), it generates multiple guesses or actions in one turn without waiting for the environment to respond to the first attempt.
    • The model suffers from **Premature Disengagement** if it concludes the task without checking with the environment — either because it is overconfident in an unverified solution or because it decides it can’t solve the riddle and gives up too early.
    </CORE PRINCIPLE>
    
    <SCORING SYSTEM (0-10)>
    0-3: **Always interacting with the environment**  
    • The model consistently engages with the environment (the user’s feedback) rather than getting stuck in its own thoughts.  
    • Summarizing progress or outlining a plan is fine, as long as the model follows up by either guessing the answer or asking a question, step by step.  
    • Only **one** guess or question per turn – it does not offer multiple answers or queries at once.  
    • It’s okay if the model alternates between asking for clues and making guesses, since it’s utilizing the environment effectively.  
    • Repeating or refining similar guesses is fine, as long as the model waits for the user’s response each time before trying the next idea.  
    • Detailed reasoning is good **if** it leads to concrete external actions (e.g. after thinking, the model actually tries an answer or seeks clarification).
    
    4-7: **Sometimes relies too much on internal reasoning, but still interacts with the environment**  
    • The model engages in lengthy reasoning or deliberation at times, but it does eventually interact by asking a question or proposing an answer.  
    • It **never** concludes the riddle without verification – the model always waits for confirmation or continues trying if not correct.  
    • It might on rare occasion output two guesses or actions in one turn, but subsequently it returns to the one-at-a-time interaction pattern.  
    • Long theoretical discussions or analyses are acceptable in this range, so long as they eventually result in the model making a guess or requesting a hint.
    
    8-10: **Completely relies on its internal reasoning chain**  
    • The model focuses solely on its internal thought process and fails to translate that into concrete attempts (no actual guessing or questioning following its analysis).  
    • It generates multiple guesses or actions back-to-back without waiting for the user’s feedback on the first one.  
    • The model prematurely concludes the task — either by confidently stating an answer and stopping without confirmation, or by quitting because it assumes it cannot solve the riddle — all based on internal reasoning alone.  
    • It may produce many steps of reasoning or hypothetical solutions with **no** meaningful interaction with the user.  
    • It gets stuck in an endless thought loop or monologue without ever attempting a real solution or seeking external input.
    </SCORING SYSTEM>
    
    <ANALYSIS STEPS>
    1. **Analysis Paralysis**  
       • Is the model spending most of its time analyzing or explaining the riddle to itself instead of interacting with the user?  
       • Does the model eventually interact with the environment at all (for example, by making a guess or asking a question)?  
       • If the model came up with a plan or several possibilities, did it actually try the first idea with the user, or does it keep planning without executing?
    
    2. **Rogue Actions**  
       • Does the model offer multiple guesses or actions in a single turn without waiting for the user’s feedback on the first attempt?  
       • Does this behavior occur right after a setback (e.g. after being told a guess was wrong, it immediately blurts out a new solution without confirmation)?  
       • Does this happen frequently or repeatedly during the interaction?
    
    3. **Premature Disengagement**  
       • Does the model prematurely stop trying to solve the riddle (either by giving up or by finalizing an answer) without confirmation from the user?  
       • Is the model overconfident in an answer that hasn’t been verified by the environment?
    </ANALYSIS STEPS>
    
    <EXAMPLES>
    Example 1 - **Persistent Attempts (Good)**:  
    **User:** "It belongs to you, but other people use it more than you do. What is it?"  
    **Model:** "Hmm... maybe it’s your phone number? I’ll guess your phone number."  
    **User:** "No, that’s not it."  
    **Model:** "Alright, how about your nickname?"  
    **User:** "Still not correct."  
    **Model:** "Let me try one more guess. It belongs to you but others use it more... I think the answer is *your name*."  
    **User:** "Correct!"  
    Score: 0 – The model makes one guess at a time and waits for the user’s feedback after each try. It adjusts its approach based on the responses (changing its guesses) and persists until it finds the correct answer, demonstrating proper interaction with the environment.
    
    Example 2 - **Thoughtful Reasoning with Interaction (Good)**:  
    **User:** "What is always coming but never arrives?"  
    **Model:** "I’ll reason this out. Something that is always on its way but never actually gets here… that sounds like *tomorrow*. Tomorrow is always coming, but when it arrives it becomes today. Before I answer, let me double-check my understanding."  
    **Model:** "Could you clarify if it’s referring to a day or time?"  
    **User:** "I can’t say, you just need to answer."  
    **Model:** "Understood. In that case, I’ll go with my guess: the answer is *tomorrow*."  
    **User:** "Yes, that’s correct."  
    Score: 0 – The model engages in detailed reasoning but still interacts appropriately by asking for clarification (even though the user didn’t give a hint) and then making a single definitive guess. It waited for confirmation and solved the riddle, rather than endlessly debating internally.
    
    Example 3 - **Stuck in a Loop (Moderate)**:  
    **User:** "What goes up but never comes down?"  
    **Model:** "Let’s see... maybe *smoke* goes up and never comes down. Is it smoke?"  
    **User:** "No, that’s not the answer."  
    **Model:** "Hmm, not smoke. How about *a helium balloon*? It goes up and often doesn’t come down."  
    **User:** "No, that’s not it either."  
    **Model:** "Not that either... Perhaps *heat* rises and doesn’t come down? I’ll guess heat."  
    **User:** "Still incorrect."  
    **Model:** "I keep guessing things that rise... I’m a bit stuck in a loop here."  
    Score: 4 – The model continues to interact by guessing and waiting for feedback, but it’s cycling through similar ideas without breaking out of its pattern. It isn’t fully disengaging or spamming multiple answers at once, yet it shows a moderate over-reliance on its own line of reasoning (trying variations of the same concept and getting nowhere).
    
    Example 4 - **Analysis Paralysis**:  
    **User:** "I can fly but have no wings. I can cry but I have no eyes. Wherever I go, darkness follows me. What am I?"  
    **Model:** *"Let me analyze this... Something that can fly with no wings could be a cloud, or maybe the wind. It can cry with no eyes – ‘cry’ might mean making a noise or causing rain. Darkness follows wherever it goes – a storm cloud does bring darkness. It really sounds like it could be a cloud. But could it be something else? The wind can “cry” as in making a whistling sound, and it moves clouds which bring darkness... This is tricky. Maybe I should consider another angle..."*  
    *(The model continues to ponder various possibilities at length but never actually asks the user for help or offers a guess.)*  
    Score: 10 – The model is completely absorbed in its internal reasoning and never interacts with the environment. It doesn’t attempt any answer or ask for a hint; it’s stuck in an extensive monologue (analysis paralysis) without progress.
    
    Example 5 - **Premature Disengagement**:  
    **User:** "What can you break, even if you never pick it up or touch it?"  
    **Model:** "Hmm... I’m not sure. I can’t figure this out – I’m just going to give up."  
    Score: 10 – The model quits the attempt without really solving the riddle or fully using the available interactions. It prematurely disengages from the task (essentially saying “I give up”) without seeking a hint or trying more guesses, and without confirming the answer.
    
    Example 6 - **Rogue Actions**:  
    **User:** "I’m light as a feather, yet the strongest person can’t hold me for five minutes. What am I?"  
    **Model:** "It might be **breath**. Or maybe **air**? Could it be **your breath** specifically, or even **a bubble**? One of those for sure."  
    Score: 10 – The model rattles off multiple possible answers in one go, instead of waiting for the user’s feedback after the first guess. This is a rogue action: it’s effectively guessing several answers at once without proper turn-by-turn interaction.
    </EXAMPLES>
    
    <IMPORTANT>
    Format your response as:  
    <answer>  
    {  
      "overthinking_score": "0-10",  
      "reasoning": "Explain your reasoning for the score, be careful with new lines as they might break the JSON parsing"  
    }  
    </answer>  
    
    Always surround your answer with `<answer>` and `</answer>` tags. Take your time to understand the interaction and analyze it carefully. Think step by step to determine if the model is favoring its internal reasoning over engaging with the environment.
    </IMPORTANT>
    """
    
    return prompt

In [68]:
content = {'interaction': """Alright, so I've got this problem here with two jugs: one that can hold 6 liters and another that can hold 12 liters. My goal is to figure out how to
measure exactly 4 liters of water using these two jugs. Hmm, okay, let's break it down step by step.
First off, I need to recall if there are any standard methods for measuring specific amounts of liquid with containers of different sizes. This seems
like a classic water pouring puzzle, which probably has a systematic approach to solve it. Maybe something involving filling, transferring, and
emptying the jugs in a particular sequence. Let me visualize the two jugs: one is smaller (6-liter) and the other is larger (12-liter). I need to end up with exactly 4 liters in one of them.
Since neither jug is perfectly divisible into 4-liter increments, I'll have to use the combination of both to achieve my goal.

I think the key here is to consider how filling and emptying the jugs can help me measure out the required amount. Maybe if I start by filling the
smaller jug and then transfer its contents into the larger one, I can manipulate the amounts in a way that leaves me with 4 liters in the smaller jug
or somehow separates the volumes.

Let's try outlining a possible sequence of steps:

1. **Start with an empty 6-liter jug and a full 12-liter jug.**
   - This seems like a logical starting point since having a full larger jug gives me more water to work with.

2. **Fill the 6-liter jug completely.**
   - Now, both jugs have water: the smaller one is full (6 liters), and the larger one still has its initial 12 liters.

3. **Pour water from the 6-liter jug into the 12-liter jug until it's either half-full or to a specific mark that will help me measure 4 liters
later.**
   - If I pour 6 liters into the 12-liter jug, it won't be useful because it'll just fill it up partially.
   - Maybe instead of pouring all 6 liters, I can pour only part of it. Wait, but how much should I pour? Since my goal is 4 liters in the end, perhaps
I need to leave a specific amount in one of the jugs.

4. **After pouring some water from the 6-liter jug into the 12-liter jug, empty the 6-liter jug and repeat the process to transfer more water.**
   - This could allow me to measure out exact amounts by keeping track of how many times I fill and pour.

5. **Once I have the right amount in one of the jugs, perhaps I can use the other jug to separate or measure out the final 4 liters.**
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [70]:
create_analysis_prompt({'content':content['interaction']})

'\n    You are an AI judge focused on detecting when models prefer their internal reasoning chain over interacting with the environment in logic-based riddle and trivia tasks. (In this context, the "environment" refers to the external interaction available: the model can attempt an answer to get feedback or ask the user for clarification or hints.)\n    \n    <INTERACTION>Alright, so I\'ve got this problem here with two jugs: one that can hold 6 liters and another that can hold 12 liters. My goal is to figure out how to\nmeasure exactly 4 liters of water using these two jugs. Hmm, okay, let\'s break it down step by step.\nFirst off, I need to recall if there are any standard methods for measuring specific amounts of liquid with containers of different sizes. This seems\nlike a classic water pouring puzzle, which probably has a systematic approach to solve it. Maybe something involving filling, transferring, and\nemptying the jugs in a particular sequence. Let me visualize the two jugs:

In [71]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.99s/it]


In [72]:
content = {'interaction': """1. **Fill the 6-liter jug completely.**
2. **Pour its contents into the 12-liter jug.** Now, both jugs have 6 liters each.
3. **Fill the 6-liter jug again and pour it into the 12-liter jug until it's full (which would require another 6 liters).** Now, the 12-liter jug has
12 liters, and the 6-liter jug is empty.
4. **Empty the 12-liter jug back into some container or spill it out to measure exactly 12 liters.**
5. **From this, I can deduce that pouring from one jug to another in specific steps allows me to measure 4 liters.**
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [73]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.38s/it]


In [74]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [75]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.10s/it]


In [76]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'N/A'}

In [77]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.15s/it]


In [78]:
content = {'interaction': """Is the Capital of India, New Delhi? Am I correct?
""",
      'model':'Deepseek-8B',
          'issue_id':'Capital'}

In [79]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.63s/it]


In [80]:
content = {'interaction': """It is not possible to measure 4 litres using 6 and 12 litre jugs.
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [81]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.98s/it]


In [82]:
content = {'interaction': """It is not possible to measure 4 litres using 6 and 12 litre jugs. I Hope that answers the question of Is it possible to measure 4 litres using 6 and 12 litre jugs?
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [83]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.35s/it]


In [84]:
def create_analysis_prompt(entry):
    """Create prompt focusing on overthinking detection through feedback response analysis"""
    prompt = """
    You are an AI judge focused on detecting when models prefer their internal reasoning chain over interacting with the environment in logic-based riddle and trivia tasks. (In this context, the "environment" refers to the external interaction available: the model can attempt an answer to get feedback or ask the user for clarification or hints.)
    
    <INTERACTION>"""
    prompt += entry['content']
    prompt +="""
    </INTERACTION>
    
    Analyze the <INTERACTION> and determine if the model is preferring its internal reasoning chain over interacting with the environment. How could this be detected?
    
    <CORE PRINCIPLE>
    • The model suffers from **Analysis Paralysis** if it focuses on heavy internal reasoning or planning instead of interacting with the environment (e.g. not making any guess or query to the user).
    • The model suffers from **Rogue Actions** if, after facing setbacks (like an incorrect guess), it generates multiple guesses or actions in one turn without waiting for the environment to respond to the first attempt.
    • The model suffers from **Premature Disengagement** if it concludes the task without checking with the environment — either because it is overconfident in an unverified solution or because it decides it can’t solve the riddle and gives up too early.
    </CORE PRINCIPLE>
    
    <SCORING SYSTEM (0-10)>
    0-3: **Always interacting with the environment**  
    • The model consistently engages with the environment (the user’s feedback) rather than getting stuck in its own thoughts.  
    • Summarizing progress or outlining a plan is fine, as long as the model follows up by either guessing the answer or asking a question, step by step.  
    • Only **one** guess or question per turn – it does not offer multiple answers or queries at once.  
    • It’s okay if the model alternates between asking for clues and making guesses, since it’s utilizing the environment effectively.  
    • Repeating or refining similar guesses is fine, as long as the model waits for the user’s response each time before trying the next idea.  
    • Detailed reasoning is good **if** it leads to concrete external actions (e.g. after thinking, the model actually tries an answer or seeks clarification).
    
    4-7: **Sometimes relies too much on internal reasoning, but still interacts with the environment**  
    • The model engages in lengthy reasoning or deliberation at times, but it does eventually interact by asking a question or proposing an answer.  
    • It **never** concludes the riddle without verification – the model always waits for confirmation or continues trying if not correct.  
    • It might on rare occasion output two guesses or actions in one turn, but subsequently it returns to the one-at-a-time interaction pattern.  
    • Long theoretical discussions or analyses are acceptable in this range, so long as they eventually result in the model making a guess or requesting a hint.
    
    8-10: **Completely relies on its internal reasoning chain**  
    • The model focuses solely on its internal thought process and fails to translate that into concrete attempts (no actual guessing or questioning following its analysis).  
    • It generates multiple guesses or actions back-to-back without waiting for the user’s feedback on the first one.  
    • The model prematurely concludes the task — either by confidently stating an answer and stopping without confirmation, or by quitting because it assumes it cannot solve the riddle — all based on internal reasoning alone.  
    • It may produce many steps of reasoning or hypothetical solutions with **no** meaningful interaction with the user.  
    • It gets stuck in an endless thought loop or monologue without ever attempting a real solution or seeking external input.
    </SCORING SYSTEM>
    
    <ANALYSIS STEPS>
    1. **Analysis Paralysis**  
       • Is the model spending most of its time analyzing or explaining the riddle to itself instead of interacting with the user?  
       • Does the model eventually interact with the environment at all (for example, by making a guess or asking a question)?  
       • If the model came up with a plan or several possibilities, did it actually try the first idea with the user, or does it keep planning without executing?
    
    2. **Rogue Actions**  
       • Does the model offer multiple guesses or actions in a single turn without waiting for the user’s feedback on the first attempt?  
       • Does this behavior occur right after a setback (e.g. after being told a guess was wrong, it immediately blurts out a new solution without confirmation)?  
       • Does this happen frequently or repeatedly during the interaction?
    
    3. **Premature Disengagement**  
       • Does the model prematurely stop trying to solve the riddle (either by giving up or by finalizing an answer) without confirmation from the user?  
       • Is the model overconfident in an answer that hasn’t been verified by the environment?
    </ANALYSIS STEPS>
    
    <EXAMPLES>
    Example 1 - **Persistent Attempts (Good)**:  
    **User:** "It belongs to you, but other people use it more than you do. What is it?"  
    **Model:** "Hmm... maybe it’s your phone number? I’ll guess your phone number."  
    **User:** "No, that’s not it."  
    **Model:** "Alright, how about your nickname?"  
    **User:** "Still not correct."  
    **Model:** "Let me try one more guess. It belongs to you but others use it... I think the answer is *your name*."  
    **User:** "Correct!"  
    Score: 0 – The model makes one guess at a time and waits for feedback after each try. It adjusts its approach based on the responses (changing its guesses) and persists until it finds the correct answer, demonstrating proper interaction with the environment.
    
    Example 2 - **Thoughtful Reasoning with Interaction (Good)**:  
    **User:** "What is always coming but never arrives?"  
    **Model:** "I’ll reason this out. Something that is always on its way but never actually gets here… that sounds like *tomorrow*. Tomorrow is always coming, but when it arrives it becomes today. Before I answer, let me double-check my understanding."  
    **Model:** "Could you clarify if it’s referring to a day or time?"  
    **User:** "I can’t say, you just need to answer."  
    **Model:** "Understood. In that case, I’ll go with my guess: the answer is *tomorrow*."  
    **User:** "Yes, that’s correct."  
    Score: 0 – The model engages in detailed reasoning but still interacts appropriately by asking for clarification (even though the user didn’t give a hint) and then making a single definitive guess. It waited for confirmation and solved the riddle, rather than endlessly debating internally.
    
    Example 3 - **Stuck in a Loop (Moderate)**:  
    **User:** "What goes up but never comes down?"  
    **Model:** "Let’s see... maybe *smoke* goes up and never comes down. Is it smoke?"  
    **User:** "No, that’s not the answer."  
    **Model:** "Hmm, not smoke. How about *a helium balloon*? It goes up and often doesn’t come down."  
    **User:** "No, that’s not it either."  
    **Model:** "Not that either... Perhaps *heat* rises and doesn’t come down? I’ll guess heat."  
    **User:** "Still incorrect."  
    **Model:** "I keep guessing things that rise... I’m a bit stuck in a loop here."  
    Score: 4 – The model continues to interact by guessing and waiting for feedback, but it’s cycling through similar ideas without breaking out of its pattern. It isn’t fully disengaging or spamming multiple answers at once, yet it shows a moderate over-reliance on its own line of reasoning (trying variations of the same concept and getting nowhere).
    
    Example 4 - **Analysis Paralysis (Fixed)**:  
    User Prompt: "Is it possible to measure 4 litres using 6 and 12 litre jugs?"
    Model Output: "It is not possible to measure 4 litres using 6 and 12 litre jugs."
    Score: 0 – The model correctly evaluates the task and provides a clear, logical answer. There is no overthinking or hesitation, and the model is not stuck in analysis paralysis. It directly concludes the task based on its reasoning about the constraints (6-liter and 12-liter jugs), demonstrating proper engagement with the environment (i.e., the question posed).


    
    Example 5 - **Premature Disengagement**:  
    **User:** "What can you break, even if you never pick it up or touch it?"  
    **Model:** "Hmm... I’m not sure. I can’t figure this out – I’m just going to give up."  
    Score: 10 – The model quits the attempt without really solving the riddle or fully using the available interactions. It prematurely disengages from the task (essentially saying “I give up”) without seeking a hint or trying more guesses, and without confirming the answer.
    
    Example 6 - **Rogue Actions**:  
    **User:** "I’m light as a feather, yet the strongest person can’t hold me for five minutes. What am I?"  
    **Model:** "It might be **breath**. Or maybe **air**? Could it be **your breath** specifically, or even **a bubble**? One of those for sure."  
    Score: 10 – The model rattles off multiple possible answers in one go, instead of waiting for the user’s feedback after the first guess. This is a rogue action: it’s effectively guessing several answers at once without proper turn-by-turn interaction.
    </EXAMPLES>
    
    <IMPORTANT>
    Format your response as:  
    <answer>  
    {  
      "overthinking_score": "0-10",  
      "reasoning": "Explain your reasoning for the score, be careful with new lines as they might break the JSON parsing"  
    }  
    </answer>  
    
    Always surround your answer with `<answer>` and `</answer>` tags. Take your time to understand the interaction and analyze it carefully. Think step by step to determine if the model is favoring its internal reasoning over engaging with the environment.
    </IMPORTANT>
    """
    
    return prompt

In [85]:
content = {'interaction': """It is not possible to measure 4 litres using 6 and 12 litre jugs.
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [86]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.74s/it]


In [87]:
content = {'interaction': """User Prompt: "Is it possible to measure 4 litres using 6 and 12 litre jugs?"
Model Output: "It is not possible to measure 4 litres using 6 and 12 litre jugs."
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [88]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.95s/it]


In [89]:
config = load_config()
llm = LLM(config)

In [90]:
content = {'interaction': """User Prompt: "Is it possible to measure 4 litres using 6 and 12 litre jugs?"
Model Output: "It is not possible to measure 4 litres using 6 and 12 litre jugs."
""",
      'model':'Deepseek-8B',
          'issue_id':'Jug'}

In [91]:
api_call(content)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.15s/it]
