# HOMEWORK 6 (10 points)


# Before we start: a few words about LLM-assisted coding

LLMs are great coding helpers, but as any assistants, they need supervision. After all, however much you rely on LLMs for coding, you are responsible for the result.

To help you a bit, we share some ideas that might improve your experience with LLMs.

**Vibe coding guidelines**

We'd recommend trying **Anthropic Claude 3.7 Sonnet**, or **ChatGPT o3/o4-mini**, or **Gemini 2.5** - they'll give you the best result. **DeepSeek V3** or **R1** should also work well. A **playground** is a better vibe coding interface than an API, especially because you'll likely need several iterations to polish the code. Unless you use an AI-powered IDE such as **Cursor**, of course.

Here are some general prompting guidelines for LLM-assisted coding:

1. **Clearly explain which functionality and interface you need**

  "I need a chatbot" is too vague, and the results will be unpredictable. Describe how the user will be interacting with the chatbot. Explain which parameters to set up in the constructor. Choose whether you want a function or a class and clearly communicate this. Decide how exceptions should be treated.

  Some of the LLMs will be all too earger to create many things you don't ask them - a productionalizing framework, a chatbot factory, examples of usage etc. Without proper guidance, they can swamp you in code. To avoid this, you may add very insistently that you only want the chatbot class/function and nothing else.

  Since we're working in Jupyter, LLMs may annoy you much by creating usage examples that require command line execution. Explaining how you are going to work with the code might help with that.

2. **Provide code examples**

  If you're ok with the design of `answer_with_llm` and if you want the new class or function to have a similar interface, provide its implementation. LLMs are usually good at reproducing design patterns.

  It's a good practice to highlight code with

  ````{verbatim}
  ```
  <your code>
  ```
  ````

3. **Test LLM's understanding**

  I personally like requesting an LLM to ask any questions it had BEFORE (yes, caps won't hurt) it starts generating code. This might help you to steer the LLM into the right direction. From our experience LLMs sometimes ask really good questions here, uncovering things we'd forgotten to think of beforehand.

4. **Be ready for several iterations of improvement**

  Even if you prompt an LLM really carefully, it may still surprise you. So, though in this task you may grab the first working version, we advise you not to rely blindly on whatever LLMs generate, especially in longer projects, where programming antipatterns might cost you dearly.

  From our experience LLMs are reasonably good at writing boilerplate code, but look out for code duplication, hardcoding, and overcomplication.

# Setting things up

Before starting to work, save your Nebius AI Studio API key to a file called `nebius_api_key` (use plain text editor for that to avoid adding rogue characters) and load it to colab.

In [None]:
!pip install -q openai

In [None]:
import os

with open("nebius_api_key", "r") as file:
    nebius_api_key = file.read().strip()

os.environ["NEBIUS_API_KEY"] = nebius_api_key

We'll be calling APIs quite often in this notebook, so let's define a shortcut fuction to avoid repeating all the code. Also, we'll prettify the output in such a way that it can be viewed without scrolling right.

In [None]:
from openai import OpenAI

# Nebius uses the same OpenAI() class, but with additional details
nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_8b_model,
                    prettify=True,
                    temperature=0.7) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content

## Task 1. Benchmarking LLMs on GSM8k (5 points)

In this task, you'll benchmark several LLMs against the [GSM8k](https://huggingface.co/datasets/openai/gsm8k) dataset. It contains grade school math problems which should be reatively easy for the LLMs (or not?).

To start with, let's download the dataset and check several problems from it.

In [None]:
!pip install -q -U datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m286.7/491.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 req

In [None]:
from datasets import load_dataset

dataset = load_dataset("openai/gsm8k", "main", cache_dir="./gsm8k")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [None]:
# Basic info
print("Dataset structure:", dataset)
print("Available splits:", list(dataset.keys()))
print("Train examples:", len(dataset['train']))
print("Test examples:", len(dataset['test']))

# Look at the first example
example = dataset['train'][0]
print("\n--- Example Problem ---")
print("Question:", example['question'])
print("\nAnswer:", example['answer'])

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})
Available splits: ['train', 'test']
Train examples: 7473
Test examples: 1319

--- Example Problem ---
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72


Your task will be to create an `GSM8KEvaluator` class with the following interface:

* `__init__` loads the test split of the dataset
* `run_evaluation(self, client, model, n, n_questions)` tests a model `model` offered by a client `client` on the dataset, calculating:
  
  * Accuracy - (number of problems on which the LLM arrives at a correct answer) / 30
  * Average execution time (seconds per problem)
  * Average solution length in tokens

  The `n` parameter is the number of **Self consistency** passes: for each problem, you'll need to get `n` solutions and choose the most popular answer for evaluation. We suggest using the parameter `n` in

  ```
  client.chat.completions.create(
                messages,
                model=model,
                n=n
            )
  ```

  The `n_questions` parameter determines how many first questions to score with a model. Don't take `n_questions` more than 50 to save time and money.

You can draw inspiration from the `MMLUEvaluator` class implemented [here](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic1/1.5_how_to_choose_an_llm.ipynb).

Test at least three LLMs **on the first 50 problems from the test split** - we'd suggest trying one small (4-8B parameters), one medium (30-70B parameters) and one huge (>200B parameters) model. Run them with `n = 1, 5`. Compare the accuracy.

**Hints and suggestions**

0. Note that you'll need to extract the correct answer from the solution. Luckily, it should be quite easy to do.
1. Be careful with parsing the LLM's answer - wrong parsing can greatly worsen the results
2. Be careful with the `max_tokens` parameter - if you set it too low, the solutions might get cut short, especially with larger models and especially with long reasoning ones.
3. I'd avoid **DeepSeek R1** for this task. Luckily, **Qwen 3** models are also reasoning, and they are much smaller.
4. Use `tqdm` to create evaluation progress bars. Watching a progress bar moving is reassuring, especially for larger models that work for longer time.
5. Log all the model answers and return them together with metrics. Start with `n_questions=3` or something like that to debug your evaluator (and especially your parsing).

In [None]:
# <YOUR CODE HERE>

## Task 2. Mapping LLM "thoughts" (5 points)

In this task, we'll look closer at "thinking patters" of LLMs:

- We'll look closer at an LLM's "tree of thoughts",
- We'll investigate how long the typical thoughts are,
- We'll explore the "underthinking" phenomenon.

If you encounter any difficulties or simply want to see our solutions, feel free to check the [Solutions notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic2/r.1_intro_to_llm_reasoning.ipynb).

To have something to experiment with, we'll run evaluation of **QwQ-32B-Preview** on a subset of [MATH benchmark](https://huggingface.co/datasets/nlile/hendrycks-MATH-benchmark). This benchmark is relatively challenging, but to a reasonable extent. It's not [FrontierMath](https://epoch.ai/frontiermath) :)

We'll take the first 50 problems that satisfy two following conditions:

- Their answer is either straightforwardly converted to `float`, or it's a simple Latex-formatted fraction, like `\frac{2}{3}`.
- Their "level" is either 4 ot 5 (more challenge!).

**Note** You can use **deepseek-ai/DeepSeek-R1**, if you want, but it will generate solutions *very* slowly (not mentioning the cost).

Also, if you don't want to run the evaluation of **QWQ** on your own, you may download the `qwq_results.pkl` file from Google drive:

In [None]:
!gdown 1_hEX_h7fj6FXH3lG5AMthJjiK1JwAU9J

### Evaluating QWQ on MATH Dataset

If you're in, let's create the evaluator. And we'll start by data preprocessing.

In [None]:
!pip install -q datasets

In [None]:
from datasets import load_dataset
ds = load_dataset('nlile/hendrycks-MATH-benchmark', split='test')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.57k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/5.12M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/210k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
import re

def conver_string_to_number(s):
    """
    Checks if a string is a number or a fraction and computes the fraction if applicable.

    Args:
        s: The input string.

    Returns:
        A float representing the number or fraction, or None if the string is invalid.
    """
    if "_" in s:
        return None

    try:
        return float(s)  # Try converting to a float directly
    except ValueError:
        match = re.match(r"\\frac\{(\d+)\}\{(\d+)\}", s)
        if match:
            numerator = int(match.group(1))
            denominator = int(match.group(2))
            if denominator != 0:
                return numerator / denominator
            else:
                return None  # Handle division by zero
        else:
            return None  # String is not a number or a valid fraction

# Example usage
strings = ["123", "\\frac{1}{2}", "\\frac{3}{0}", "abc", "\\frac{4}{5}"]
for s in strings:
    result = conver_string_to_number(s)
    if result is not None:
        print(f"'{s}' is a valid number or fraction. Result: {result}")
    else:
        print(f"'{s}' is not a valid number or fraction.")


'123' is a valid number or fraction. Result: 123.0
'\frac{1}{2}' is a valid number or fraction. Result: 0.5
'\frac{3}{0}' is not a valid number or fraction.
'abc' is not a valid number or fraction.
'\frac{4}{5}' is a valid number or fraction. Result: 0.8


In [None]:
import pandas as pd


df = pd.DataFrame(ds)

df['num_answer'] = df['answer'].apply(conver_string_to_number)
df['valid_answer'] = df['num_answer'].notna()

# Select the first 50 rows where the 'answer' column passes the check
selected_rows = df[(df['valid_answer']) & (df['level'] >= 4)].head(50)
selected_rows

Unnamed: 0,problem,solution,answer,subject,level,unique_id,num_answer,valid_answer
9,The expression $2\cdot 3 \cdot 4\cdot 5+1$ is ...,"By the associative property of multiplication,...",4,Prealgebra,5,test/prealgebra/1139.json,4.0,True
11,Let $p(x)$ be a polynomial of degree 5 such th...,Let $q(x) = (x^2 - 1) p(x) - x.$ Then $q(x)$ ...,\frac{3}{56},Intermediate Algebra,5,test/intermediate_algebra/1197.json,0.053571,True
12,"The proper divisors of 12 are 1, 2, 3, 4 and 6...",Prime factorize $284=2^2\cdot71$. The sum of t...,284,Number Theory,5,test/number_theory/737.json,284.0,True
22,Denali and Nate work for a dog walking busines...,"Rewriting the sentence ""the ratio of Denali's ...",5,Algebra,5,test/algebra/1837.json,5.0,True
24,"A worker receives an annual wage of $\$20{,}00...","If the interest rate is $r$, it follows that $...",10,Algebra,5,test/algebra/2427.json,10.0,True
26,In how many ways can $7$ people sit around a r...,"After Pierre sits, we can place Rosa either tw...",144,Counting & Probability,5,test/counting_and_probability/525.json,144.0,True
32,In how many ways can 8 people sit around a rou...,First choose three consecutive seats for Pierr...,720,Counting & Probability,4,test/counting_and_probability/134.json,720.0,True
33,Consider the geometric sequence $\frac{125}{9}...,The common ratio between consecutive terms is ...,\frac{243}{625},Algebra,4,test/algebra/1072.json,0.3888,True
34,Find the constant term in the expansion of $$\...,"To get a constant term, the exponents of $x$ m...",-125,Counting & Probability,4,test/counting_and_probability/119.json,-125.0,True
41,"The coordinates of a parallelogram are (5, 3),...","Name the points $A(5,3)$, $B(6,8)$, $C(7,4)$, ...",17,Geometry,4,test/geometry/627.json,17.0,True


The evaluator itself:

In [None]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm
import re

from openai import OpenAI

from datasets import load_dataset

def find_boxed_content(text):
    matches = re.findall(r"boxed\{(.*?)\}", text)
    try:
        return conver_string_to_number(matches[-1])
    except:
        try:
            return conver_string_to_number(text.split("\n")[-1].split(" ")[-1].strip(".;$"))
        except:
            print(f"""Wrong format in:
                {text.split()[-1]}""")
            return None

class MATHEvaluator:
    def __init__(self, system_prompt: str = "You are a helpful assistant.",
                 prompt: str = None):
        """
        Initialize the MATH evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
        """

        self.system_prompt = system_prompt

        self.prompt = """{question}"""

        self.questions, self.answers = selected_rows["problem"].to_list(), selected_rows["num_answer"].to_list()

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer (an float)
        """
        # Look for a single letter answer in the response
        try:
            # answer = float(solution.split("\n").split(" ")[1].strip(".;)"))
            answer = find_boxed_content(solution)
        except:
            answer = None
        # print(solution.split("\n")[-1])
        # print(answer)
        return answer

    def evaluate_single_question(self, question: str,
                                 correct_answer: float,
                                 client, model, max_tokens,
                                 temperature=None) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    question=question
                ),
                client=client, model=model,
                system_prompt=self.system_prompt,
                max_tokens=max_tokens
            )
            answer = self.extract_answer(model_response)
            if answer:
                is_correct = np.abs(answer - correct_answer) < 1e-10
            else:
                is_correct = False
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client : OpenAI, model : str,
                       n_questions=50, max_tokens=8192, temperature=0.) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0
        correct_format_count = 0
        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
                max_tokens=max_tokens,
                temperature=temperature
            )

            if answer:
                correct_format_count += 1

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'correct_format': not answer is None,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        format_correctness = correct_format_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'format_correctness': format_correctness,
            'evaluation_log': evaluation_log
        }

        return evaluation_results

In [None]:
math_evaluator = MATHEvaluator(system_prompt=None)

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

qwq_results = math_evaluator.run_evaluation(
    client=client,
    model="Qwen/QwQ-32B-Preview",
    n_questions=None,
    max_tokens=None
)
print(f'\nAccuracy: {deepseek_results["accuracy"]}')

100%|██████████| 50/50 [32:51<00:00, 39.43s/it]


Accuracy: 0.64





Let's save the results to file:

In [None]:
import pickle

pickle.dump(qwq_results, open("qwq_results.pkl", "wb"))

Now, you can load the file even if you didn't create it:

In [None]:
!gdown 1_hEX_h7fj6FXH3lG5AMthJjiK1JwAU9J

Downloading...
From: https://drive.google.com/uc?id=1_hEX_h7fj6FXH3lG5AMthJjiK1JwAU9J
To: /content/qwq_results.pkl
  0% 0.00/415k [00:00<?, ?B/s]100% 415k/415k [00:00<00:00, 92.4MB/s]


In [None]:
import pickle
qwq_results = pickle.load(open("qwq_results.pkl", "rb"))

### Analyzing thoughts

We've prepared quite a large thought analysis and visualization script; so we decided not to include it here (please check it in github if you're curious). Here, we'll only download and import it from `thought_analysis.py`.

A few words about what's happening in `thought_analysis.py`:

- First of all, if the solution has `<think>...</think>` markup inside, only the fragment between them is extracted. (We're only interested in the "internal" thinking process.)
- Then, solutions are divided into individial "thoughts" using the following heuristics:
  - `Alternatively`, `Wait`, `But wait`, `But let me check again`, `But let's verify`, and similar phrases mark the starts of new "thoughts". Note that they are typical indications of backtracking and solution branching. There may be more, of course.
  - Otherwise, a "thought" is a continuous range of paragraphs of length not less than `min_split_size` characters (we'll take `min_split_size=120`). Separate-line Latex formulas are always added to the previous thought.
- For each "thought", its length in tokens is calculated. For that, we need to supply the right **tokenizer** which corresponds to the model which generated the solutions - in our case, **[QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)**. And that's why you needed to create a **Hugging Face access token**. If you haven't done it yet, please register to HF, get the token and load it to colab in a `hf_access_token` file.

  If you're ardently against registering to Hugging Face, you can supply `None` as `tokenizer`, but in this case you won't get correct token length of individual "thoughts".

- The **"thee of thoughts"** is constructed in the following way:
  - If a "thought" starts with `Alternatively`, `Wait`, or `But wait`, we query using another LLM (**Llama-3.1-70B** by default) to determine which of the previous "thoughts" is continued by this. If it starts a completely different solution, the thought is connected to **root** (empty solution; the very start).
  - Otherwise, the thought is connected with the previous one.

- Finally, the tree is saved as `thought_analysis/thought_tree.png`, if you didn't change default output path.

In [None]:
!pip install -q openai

In [None]:
!curl -o thought_analysis.py https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic2/thought_analysis.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20549  100 20549    0     0  67787      0 --:--:-- --:--:-- --:--:-- 68043


Let's also download a sample solution for us to work with.

In [None]:
!curl -o sample_solution.txt https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic2/sample_solution.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11280  100 11280    0     0  37244      0 --:--:-- --:--:-- --:--:-- 37227


In [None]:
from openai import OpenAI
from thought_analysis import analyze_solution_thoughts
from transformers import AutoTokenizer

with open("hf_access_token", "r") as file:
    hf_access_token = file.read().strip()

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY")
)
model = "meta-llama/Meta-Llama-3.1-70B-Instruct"

reasoning_model = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(reasoning_model,
                                          hf_access_token=hf_access_token)

file_path = "sample_solution.txt"  # Path to the solution file
output_dir = "thought_analysis"  # Directory to save results

with open(file_path, 'r', encoding='utf-8') as file:
    solution_text = file.read()


connections, viz_path, summary_path = analyze_solution_thoughts(
    solution_text,
    client=client,
    model=model,
    tokenizer=tokenizer,
    output_dir=output_dir,
    min_split_size=120
)

Split solution into 20 thought fragments
Calculated token counts for all thoughts
Finding connections between thoughts...


 15%|█▌        | 3/20 [00:00<00:03,  4.82it/s]

#ID: 1


 20%|██        | 4/20 [00:01<00:04,  3.45it/s]

#ID: 2


 25%|██▌       | 5/20 [00:01<00:05,  2.76it/s]

#ID: 3


 35%|███▌      | 7/20 [00:02<00:03,  3.27it/s]

#ID: 5


 50%|█████     | 10/20 [00:02<00:02,  3.99it/s]

#ID: 8


 55%|█████▌    | 11/20 [00:03<00:02,  3.17it/s]

#ID: 9


 60%|██████    | 12/20 [00:03<00:03,  2.65it/s]

#ID: 0


 65%|██████▌   | 13/20 [00:04<00:03,  2.30it/s]

#ID: 0


 75%|███████▌  | 15/20 [00:05<00:02,  2.49it/s]

#ID: 12


 80%|████████  | 16/20 [00:05<00:01,  2.12it/s]

#ID: 14


 85%|████████▌ | 17/20 [00:06<00:01,  1.86it/s]

#ID: 15


100%|██████████| 20/20 [00:07<00:00,  2.69it/s]

#ID: 0
Creating visualizations...





Visualization saved to thought_analysis/thought_tree.png
Summary saved to thought_analysis/thought_summary.txt

Thought Tree Summary:
---------------------
Total thoughts: 21
Root thought: 1
Branch nodes (with multiple connections): 4
Regular thoughts: 15
Final answer thought ID: 20

Token Statistics:
-----------------
Total tokens: 2852
Average tokens per thought: 142.6

Token visualizations saved to thought_analysis
Analysis complete! Results saved to thought_analysis


We can look closely at the connection or just check the `.png` file.

In [None]:
connections

[{'id': 0, 'text': 'ROOT', 'connects_to': None, 'token_count': 1},
 {'id': 1,
  'text': "Okay, so I need to find the radius of a circle that has two parallel chords, 6 units apart, one with length 14 and the other with length 10. Hmm, let me visualize this. There's a circle, and inside it, two chords that are parallel. The distance between them is 6 units. The longer chord is 14 units, and the shorter one is 10 units. The problem is asking for the radius of the circle.",
  'connects_to': 0,
  'token_count': 97},
 {'id': 2,
  'text': 'First, I remember that in a circle, the perpendicular distance from the center to a chord can be found using the Pythagorean theorem. Specifically, if you have a chord of length \\( 2a \\), then the distance \\( d \\) from the center to the chord is related to the radius \\( r \\) by the equation \\( r^2 = a^2 + d^2 \\). So, if I can find the distances from the center to each chord, then maybe I can set up equations and solve for the radius.',
  'connects_

**Your task**. Take some other solutions, construct their trees of thoughts.

You may also play with solutions generated by **DeepSeek R1**. We created several of them for you:

- 10 from the MATH benchmark,
- 10 from the AIME benchmark,
- 2 from the Frontier Math benchmark.

Just beware that these solutions will be much, much longer than the solutions by **QWQ**.

You can download them, like this:

In [None]:
!gdown 1TpROB-8XAE6z1OlTfli7XB6YoY6WQi18
!unzip deepseek_solutions.zip -d deepseek_solutions/

Downloading...
From: https://drive.google.com/uc?id=1TpROB-8XAE6z1OlTfli7XB6YoY6WQi18
To: /content/deepseek_solutions.zip
  0% 0.00/168k [00:00<?, ?B/s]100% 168k/168k [00:00<00:00, 65.4MB/s]
Archive:  deepseek_solutions.zip
  inflating: deepseek_solutions/aime0.txt  
  inflating: deepseek_solutions/aime1.txt  
  inflating: deepseek_solutions/aime2.txt  
  inflating: deepseek_solutions/aime3.txt  
  inflating: deepseek_solutions/aime4.txt  
  inflating: deepseek_solutions/aime5.txt  
  inflating: deepseek_solutions/aime6.txt  
  inflating: deepseek_solutions/aime7.txt  
  inflating: deepseek_solutions/aime8.txt  
  inflating: deepseek_solutions/aime9.txt  
  inflating: deepseek_solutions/fm0.txt  
  inflating: deepseek_solutions/fm1.txt  
  inflating: deepseek_solutions/math11.txt  
  inflating: deepseek_solutions/math12.txt  
  inflating: deepseek_solutions/math22.txt  
  inflating: deepseek_solutions/math24.txt  
  inflating: deepseek_solutions/math26.txt  
  inflating: deepseek_sol

And, of course, feel free to create your own solutions!

### Size of Branches of Thoughts

**This is a task for you**! Create a simplified fuction

```
thoughts, token_counts = decompose_solution_thoughts(sample_text,
                                                     tokenizer=tokenizer)
```

that, given a solution `sample_text` and a tokenizer, returns:

- `thoughts` which is a split of `sample_text` by exactly `Alternatively`, `Wait`, `But wait` (you may add some of their synonyms you'll spot in the solutions). So, we only keep track of the **whole branches of the "tree of thoughts"** here.
- `token_counts` - the number of tokens in each of these "branches".

Now, we suggest you to explore the size of these branches in tokens. Create a histogram of branch sizes. What can you say about its shape? Take a look at several extremely long branches - what do you think, why are they so long? Check the shortest branches. What happens there?

In [None]:
# <YOUR EXPERIMENTS HERE>

### LLM underthinking

This part is inspired by [Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs](https://arxiv.org/pdf/2501.18585). This paper investigated connection between the length of thought branches (fragments of solution between "Alternatively", "Wait", etc) and the solution accuracy. The found out that in many cases LLMs abandon promising solutions, cutting thought branches short before they could come to fruition - and this might contribute to failure of the whole solution.

**Your task**: create on one plot two histograms of branch lengths - one histogram for tasks with correct answer and one of tasks with incorrect answer. You can find information about answer correctness in `qwq_results["evaluation_log"]` (`"is_correct"` fields).

Since there is a different number of correct and incorrect answers in the data, we recommend normalizing the histograms so that they show frequency instead of count. This may be done by setting `density=True`. Do you see any specific patterns?

In [None]:
# <Your experiments here>