<a href="https://colab.research.google.com/github/Silvmike/LLM-Engineering-Essentials/blob/main/topic2/2.3_intro_to_llm_reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

The course is in development now, with more materials coming soon. [Subscribe to stay updated](https://academy.nebius.com/llm-engineering-essentials/update/)

# 2.3. Intro to LLM Reasoning

LLMs' reasoning ability has long fascinated people, drawing attention for its practical implications and sparking intriguing theoretical questions. In this series of notebooks, we'll explore what reasoning brings to the table, how our understanding of it has evolved over time, and why it's been generating so much hype lately.

The plan is:

* **R1. Intro to LLM Reasoning**. What LLM reasoning is, for which tasks it is useful (and for which it is not). Have LLMs really learnt to "think" like humans (not exactly).
* **R2. Inference-time compute**. How to make an LLM smarter with orchestration.
* **R3. Establishing non-linear reasoning capabilities**: how DeepSeek R1 was trained and some other approaches.

## Getting ready

In [None]:
!pip install -q openai

In [None]:
from google.colab import userdata
import os

os.environ['NEBIUS_API_KEY'] = userdata.get("nebius_api_key")

We'll be calling APIs quite often in this notebook, so let's define a shortcut fuction to avoid repeating all the code. Also, we'll prettify the output in such a way that it can be viewed without scrolling right.

In [None]:
from openai import OpenAI

# Nebius uses the same OpenAI() class, but with additional details
nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_8b_model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_8b_model,
                    prettify=True,
                    temperature=0.6) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content

# What is LLM Reasoning

There are several ways in which LLMs may respond to your queries.

Sometimes, they just give a direct answer.

In [None]:
result = answer_with_llm("""Who is the main character of the Mistborn trilogy?""",
                         model="meta-llama/Meta-Llama-3.1-405B-Instruct")
print(result)

However, if you give a math task to one of today's LLMs, you'll notice that it doesn't just spit out an answer - it provides a full solution instead

In [None]:
result = answer_with_llm("""In the fantasy world of Xu, they have unique math system:
- "a + b" means min(a,b)
- "a*b" means a + b
Solve the equation x*x + 2*x + 1 = 0""",
                         model="meta-llama/Meta-Llama-3.1-405B-Instruct")
print(result)

**Note**. If you're interested in math, Xu's math system is actually called [Tropical Geometry](https://en.wikipedia.org/wiki/Tropical_geometry).

Back in 2023, LLM researchers and users noticed that, at least for mathematical tasks, LLMs produced more accurate answers when generating a full solution rather than simply providing a direct answer. At that time, LLMs didn't always automatically write solutions, and you might have needed to prompt them with `"Take a deep breath and solve this problem step by step."`

Getting detailed, step-by-step solutions was an exciting development. Such solutions became known as **Chains of Thought (CoT)**, and wide discussion began about **LLM reasoning capabilities**.

Today math reasoning is a standard out-of-the box capability of the absolute majority of LLMs. But now, a more powerful paradigm is gaining popularity: **non-linear reasoning**.

### Non-linear reasoning

Soon after Chains of Thoughts (CoT) emerged, it became clear that they are not enough. Indeed, the CoT paradigm assumes that an LLM is able to generate the correct solution from the first attempt, while the human way of thinking involves checking several ideas, experimenting, self-criticizing, and backtracking before generating the final solution

So, the **non-linear reasoning** approach was born.

For a couple of years, non-linear reasoning was established with help of orchestration. Mechanisms such as [Tree of Thoughts](https://arxiv.org/pdf/2305.10601) or [Graph of Thoughts](https://arxiv.org/pdf/2308.09687) were suggested for solving complex problems. We'll discuss them in detail in the **Inference-time compute** notebook.

<center>
<img src="https://drive.google.com/uc?export=view&id=1WZWjI7aY3Vu0zEsAO8u7R73iwsC6KJeq" width=600 />

[Source](https://arxiv.org/pdf/2308.09687)
</center>

A general idea of such approaches is to generate a solution step by step (one prompt = one logical step, unlike CoT), while exploring several reasoning paths and somehow scoring individual steps or whole branches to eventually select the optimal reasoning path.

However, as often happens in Machine Learning, orchestration strategies eventually give way to end-to-end ones. And it seems that we're almost at the point where LLMs are able to perform non-linear reasoning on their own.

To illustrate this, let's compare outputs of Phi-4, Llama, and DeepSeek R1, which is a top-trend non-linear reasoning model.

**The task is:** Imagine that my binary classifier got recall 0.8 on a dataset with balanced classes (same number of class 0 and class 1 objects). What could be its minimal and maximal precision?

<details>
    <summary> Click to see the solution </summary>

Let $x$ be the number of class 1 objects. Than recall 0.8 means that 80% of them are classified as class 1 (that's TN) and 20% as class 0 (that's FN). Let's populate the magic table:

|                | Classified as class 1 | Classified as class 0 |
| :---------------- | :------: | ----: |
| Class 1        |   $0.8x$   | $0.2x$ |
| Class 0           |   ???   | ??? |

Since the dataset is balanced, Class 0 also contains $x$ elements. So, we get some

|                | Classified as class 1 | Classified as class 0 |
| :---------------- | :------: | ----: |
| Class 1        |   $0.8x$   | $0.2x$ |
| Class 0           |   $\alpha x$   | $(1 - \alpha)x$ |

where $0\leqslant \alpha \leqslant 1$ (and that's all we know about $\alpha$. Now, the precision is
$$\frac{0.8x}{0.8x + \alpha x} = \frac{0.8}{0.8 + \alpha},\quad 0\leqslant\alpha\leqslant1$$

Now, some math establishes the answer:
$$0\leqslant\alpha\leqslant1 \Rightarrow 0.8\leqslant 0.8 + \alpha\leqslant 1.8 \Rightarrow$$

$$\Rightarrow\frac1{1.8}\leqslant\frac1{0.8 + \alpha} \leqslant \frac1{0.8}
\Rightarrow \frac49=\frac{0.8}{1.8}\leqslant\frac{0.8}{0.8 + \alpha} \leqslant \frac{0.8}{0.8} = 1$$
</details>

In [None]:
result = answer_with_llm("""Imagine that my binary classifier got recall 0.8 on a dataset with balanced classes (same number of class 0 and class 1 objects).
What could be its minimal and maximal precision?""",
                model="microsoft/phi-4",
                system_prompt=None,
                max_tokens=4096)
print(result)

In [None]:
result = answer_with_llm("""Imagine that my binary classifier got recall 0.8 on a dataset with balanced classes (same number of class 0 and class 1 objects).
What could be its minimal and maximal precision?""",
                model="meta-llama/Meta-Llama-3.1-70B-Instruct",
                system_prompt=None,
                max_tokens=4096)
print(result)

In [None]:
result = answer_with_llm("""Imagine that my binary classifier got recall 0.8 on a dataset with balanced classes (same number of class 0 and class 1 objects).
What could be its minimal and maximal precision?""",
                model="deepseek-ai/DeepSeek-R1",
                system_prompt=None,
                max_tokens=4096)
print(result)

Let's briefly analyze the outputs.

* **Phi-4** by Microsoft and **Llama-3.1-70B** provide typical Chain-of-Thoughts solutions.
* **DeepSeek R1**'s reasoning features several backtracking episodes which definitely characterize it as non-linear:

  ```
  Okay, let's try to figure out the minimal and maximal precision for a binary
classifier....

  But wait, the classifier's predictions are also influenced by how many class 0
samples it classifies correctly or incorrectly...

  Wait, but can the classifier actually have FP=50?..

  Alternatively, perhaps there's a confusion matrix here...

  Wait, but wait, when FP is zero, that means the classifier predicted all class
0 samples as class 0.... --> FINAL ANSWER
  ```

Note also that **DeepSeek R1** outputs reasoning in `<think>...</think>`, and only after that it gives the final solution.


There are already quite a lot non-linear reasoning models, both proprietary and open source. They include:
  - OpenAI's **o1** and **o3**.
  - Anthropic's **Claude 3.7 Sonnet**
  - Google's Experimental Thinking **Gemini 1.5** and **2**
  - **Grok 3** by X
  - **DeepSeek R1** which is the open source model that produced lots of hype due several reasons: not only its low training and inference cost were scandalously low, but also this model is open source, with more or less clear and unexpected training strategy which we'll discuss in the 3rd of the reasoning-related notebooks.
  - **QWQ** by Alibaba.

In this notebook and in the next two, we'll investigate non-linear reasoning, bith native and orchestrated.

**Note**. We used empty `system_prompt` for a reason. It didn't affect a Machine Learning Theory task, because LLMs don't seem to recognize it as a math task, but notice the difference in output patterns for a math task with and without the helpful assistant system prompt:

In [None]:
result = answer_with_llm("""Inside a circle, two parallel chords are 6 units apart. One chord has length 14 and the other has length 10. Find the radius of the circle.""",
                model="meta-llama/Meta-Llama-3.1-70B-Instruct",
                system_prompt="You are a helpful assistant",
                max_tokens=4096)
print(result)

In [None]:
result = answer_with_llm("""Inside a circle, two parallel chords are 6 units apart. One chord has length 14 and the other has length 10. Find the radius of the circle.""",
                model="meta-llama/Meta-Llama-3.1-70B-Instruct",
                system_prompt=None,
                max_tokens=4096)
print(result)

# When is reasoning useful?

It is clear that reasoning significantly boosts LLM performance on certain tasks. LLMs incorporating non-linear reasoning have achieved remarkable breakthroughs on several benchmarks once thought to be beyond the reach of AI. In particular, the emergence of **o3** has shaken two of the most challenging benchmarks: [ARC-AGI](https://arcprize.org/arc) and [FrontierMath](https://epoch.ai/frontiermath).

<center>
<img src="https://pbs.twimg.com/media/Gi03TkpbMAAun6w?format=jpg&name=large" width=600 />

[Source](https://x.com/andrewwhite01/status/1886225029006062051)
</center>

However, while reasoning is great in math tasks, in some other cases it may be useless or even harmful. [To CoT or not to CoT](https://arxiv.org/pdf/2409.12183) is one of the papers investigating that. The authors did a number of experiments on different tasks and came to a conclusion that

* CoT is (non-surprisingly) quite useful in math tasks and tasks involving symbolic computations.
* CoT is not useful in tasks that check factual knowledge or involve commonsense reasoning.

<center>
<img src="https://drive.google.com/uc?export=view&id=1pQYlQOrOLPyv9EgGBdQTVivNtR5WUnnn" width=600 />

[Source](https://arxiv.org/pdf/2409.12183)
</center>

Another insightful papers showing the downsides of CoT, this time in visual tasks, is [Mind your Step (by Step): Chain-of-Thought can reduce performance on tasks where thinking makes humans worse](https://arxiv.org/pdf/2410.21333). The name is quite self-explanatory. Among others, the authors consider facial recognition - both humans and Multimodal LLMs do it better if not prompted to perform reasoning.

Let's also run some experiments!

We'll use the [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu), which contains tasks in many areas, from International Law to Abstract Algebra. Let's check the scores of **Llama-3.1-8B**, **Llama-3.1-70B**, and **Qwen-2.5-32B** in **High School Math** and **High School History** in two modes:

* First, when the LLM is prompted to only give the answer,
* Second, when the LLM is prompted to perform a step-by-step reasoning before giving the final answer.

We'll create a `MMLUEvaluator` class to steamline evaluation. If you create a similar evaluation class, be careful with the `max_tokens` parameter in the `answer_with_llm` function. It should be large enough; otherwise solutions may be cut short, resulting in surptisingly low accuracy with CoT. This is especially true for non-linear reasoning models.

In [None]:
!pip install -q datasets

In [None]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm

from datasets import load_dataset
from openai import OpenAI

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

class MMLUEvaluator:
    def __init__(self, system_prompt: str = None, prompt: str = None,
                 topic: str = "high_school_mathematics"):
        """
        Initialize the MMLU evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
            topic: Which topic to choose
        """

        self.topic = topic
        self.topic_prettified = topic.replace("_", " ")
        self.system_prompt = system_prompt or f"You are an expert in {self.topic_prettified}."

        if not prompt:
            self.prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""
        else:
            self.prompt = prompt

        self.questions, self.choices, self.answers = self.load_mmlu_data(topic=self.topic)

    def load_mmlu_data(self, topic: str) -> pd.DataFrame:
        """
        Load MMLU test data on a given topic.

        Args:
            topic: Which topic to choose

        Returns:
            DataFrame with questions and answers
        """

        dataset = load_dataset("cais/mmlu", topic, split="test")

        dataset = dataset
        dataset = pd.DataFrame(dataset)

        # Load questions and choices separately
        questions = dataset["question"]
        choices = pd.DataFrame(
            data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
        )
        # In the dataset, true answer labels are in 0-3 format;
        # We convert it to A-D
        answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

        return questions, choices, answers

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer letter (A, B, C, D, or Failed to parse)
        """
        # Look for a single letter answer in the response
        try:
            answer = solution.strip('.')[-1]
        except:
            answer = "Failed to parse"
        return answer

    def evaluate_single_question(self, question: str, choices: Dict[str, str],
                                 correct_answer: str,
                                 client, model) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    topic_prettified=self.topic_prettified,
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D']
                ),
                client=client, model=model,
                system_prompt=self.system_prompt,
                max_tokens=4096,
                temperature=0.
            )
            answer = self.extract_answer(model_response)
            is_correct = (answer.upper() == correct_answer.upper())
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client: OpenAI=nebius_client,
                       model: str="meta-llama/Meta-Llama-3.1-8B-Instruct",
                       n_questions: int = 50) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0

        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                choices=self.choices.iloc[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
            )

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'evaluation_log': evaluation_log
        }

        return evaluation_results

We'll create different prompts for No-CoT and CoT scenarios:

In [None]:
evaluation_results = {}

prompts = {"No CoT": """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
Output only the correct answer label, one of the letters A, B, C, or D.
Only output one letter - A, B, C, or D.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
""",
"With CoT": """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""}

Finally, let's look at the numbers:

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

for topic in ["high_school_world_history", "high_school_mathematics"]:
    for mode in ["No CoT", "With CoT"]:

        evaluator = MMLUEvaluator(topic=topic,
                          prompt=prompts[mode])

        results = evaluator.run_evaluation(
            client=client,
            model="meta-llama/Meta-Llama-3.1-70B-Instruct",
            n_questions=50
            )
        evaluation_results[(topic, mode)] = results["accuracy"]
        print(f"For topic {topic}, mode {mode}")
        print(f'\nAccuracy: {results["accuracy"]}')

In [None]:
for topic in ["high_school_world_history", "high_school_mathematics"]:
    for mode in ["No CoT", "With CoT"]:

        evaluator = MMLUEvaluator(topic=topic,
                          prompt=prompts[mode])

        results = evaluator.run_evaluation(
            client=client,
            model="meta-llama/Meta-Llama-3.1-8B-Instruct",
            n_questions=50
            )
        evaluation_results[(topic, mode)] = results["accuracy"]
        print(f"For topic {topic}, mode {mode}")
        print(f'\nAccuracy: {results["accuracy"]}')

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

for topic in ["high_school_world_history", "high_school_mathematics"]:
    for mode in ["No CoT", "With CoT"]:

        evaluator = MMLUEvaluator(topic=topic,
                          prompt=prompts[mode])

        results = evaluator.run_evaluation(
            client=client,
            model="Qwen/Qwen2.5-32B-Instruct",
            n_questions=50
            )
        evaluation_results[(topic, mode)] = results["accuracy"]
        print(f"For topic {topic}, mode {mode}")
        print(f'\nAccuracy: {results["accuracy"]}')

As you see, while the effect of CoT in **High School Math** is very significant, for **Hight School World History** it doesn't improve anything, and in some experiment launches it may even slightly spoil the result.

Of course, one experiment is not enought to *establish* a law, but it's a good illustration.

Another downside of reasoning models is that their output tends to get bloated even for relatively simple tasks for which other models would give a shorter and straightforward solution. This, of course, doesn't help the fact that they are generally slow and expensive.

A couple of examples:

Who is the main character of the Mistborn trilogy?

In [None]:
result = answer_with_llm("""Who is the main character of the Mistborn trilogy?""",
                         model="deepseek-ai/DeepSeek-R1")
print(result)

In [None]:
result = answer_with_llm("""Jack the Sparrow has 12 sailors and Davy Jones has 125 sailors.
If they join their crews, how many sailors will they have together?""",
                         model="deepseek-ai/DeepSeek-R1")
print(result)

I can't help thinking that **R1** is making fun of me with the sailor example... It even failed to answer with a reasonable `max_tokens` limit. But don't think that **R1** will be as indecisive with arithmetic in more challenging problems. It's just the LLM is trained to produced long solutions. We'll yet return to this idea in our third notebook.

# Why LLM Reasoning works?

It's very tempting to say "Because LLMs learn to reason like humans". But is it really so? In this section we'll discuss several papers investigating this.

## Let's reason dot by dot

The authors of the [Let's reason dot by dot](https://arxiv.org/pdf/2404.15758) paper conducted a curious experiment. They fine tuned an LLM to output dots (literally `"."` tokens) instead of actual reasoning tokens:

<center>
<img src="https://drive.google.com/uc?export=view&id=1G9g8zDC1wsr9YvdXC49c2nebb3Og38by" width=400 />
</center>

You'd expect that such a "stupid", dot-minded model won't be good at anything, but in reality it behaves significantly better than the same LLM fine tuned for giving an immediate answer without reasoning.

<center>
<img src="https://drive.google.com/uc?export=view&id=1VCz8aDXVckgG5Eh08xMNjoFbUROvBX7i" width=600 />
</center>

<details>
<summary> Click to see the description of the 3SUM task
</summary>

We'll have to explain a few things first. $\mathbb{Z}_{10}$ is the group of **remainders modulo $10$**, that is $\mathbb{Z}_{10} = \{\overline{0}, \overline{1},\ldots, \overline{9}\}$, where addition is performed modulo $10$. For example,
$$\overline{3} + \overline{4} = \overline{7},$$
$$\overline{7} + \overline{8} = \overline{5},$$
because $7 + 8 = 15 \equiv 5\, (mod\, 10)$ (meaning: 15 and 5 give same remainders when divided by 10: `15%10 = 5%10`).
Now, in the 3SUM task we're given a set $(x_0,\ldots,x_n)$ of pairs $x_i = (x_i', x_i'')$, where $x_i', x_i''\in\mathbb{Z}_{10}$.


The 3SUM task here is determining whether there are 3 pairs $x_i,x_j,x_k$ among $(x_0,\ldots,x_n)\in\mathbb{Z}_{10}\times\mathbb{Z}_10$. The task is to determine whether there are three pairs $x_i, x_j, x_k$ such that
$$x_i + x_j + x_k = 0\,(mod\,10).$$

For example, if we have pairs
$$x_0 = (\overline{5}, \overline{4}),$$
$$x_1 = (\overline{7}, \overline{3}),$$
$$x_2 = (\overline{0}, \overline{1}),$$
$$x_3 = (\overline{8}, \overline{3}),$$
then
$$x_0 + x_1 + x_3 = (\overline{5} + \overline{7} + \overline{8},\
\overline{4} + \overline{3} + \overline{3}) = (\overline{0}, \overline{0}),$$
because
$$5 + 7 + 8 = 20 \equiv 0\, (mod\, 10)$$
$$4 + 3 + 3 = 10 \equiv 0\, (mod\, 10)$$

This task is good, because you can't solve it in one pass over the dataset, it really requires some computations.

The figure above shows that, with $n$ growing, an immediatly-answering model performes worse and worse, while the dot-reasoning keeps the same quality.

</details>

**Takeaways**. It turns out that it's not absolutely necessary for an LLM to output human-readable reasoning in order to solve problems. It beckons the hypothesis that the real "thought" process happens somewhere in the LLM's bowels, in the realm of vectors and matrices. If that's true, the neat verbal reasoning might be more of a byproduct. A bit later, we'll discuss a paper which leverages this, taking LLM reasoning even further from human readibility. But now, let's do an experiment of our own!

Let's run a very naive experiment. In the original paper, the authors fine tuned the model, but we'll just prompt **Llama-3.1-8B** to output dots instead of the actual reasoning.

In [None]:
dot_prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
Instead of reasoning, output dots. Then, after #ANSWER: only output one of the letters - A, B, C, or D, the correct answer label.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""

dot_evaluator = MMLUEvaluator(topic="high_school_mathematics",
                          prompt=dot_prompt)

dot_results = dot_evaluator.run_evaluation(
    client=client,
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    n_questions=50)
print(f'\nAccuracy: {dot_results["accuracy"]}')

Of course, the accuracy is much lower than with CoT, but still it's quite stably 10% higher than with immediate answering! (26% -> 36%)

Let's also check that the model really outputs dots. (It does!)

In [None]:
dot_results['evaluation_log'][0]

In [None]:
dot_results['evaluation_log'][1]

There seem to be no connection between the number of dots and the lengths of the actual CoT solutions. In our experiments, the number of dots remained stable for each problem, which is probably not very surprising given that the temperature is low, but still curious.

Please, also note that this experiment, although fun, doesn't reproduce well with other LLMs. So, there is something peculiar about Llama-3.1-8B here.

# Coconut (Chain of Continuous Thought)

In the [Training Large Language Models to Reason in a Continuous Latent Space](https://arxiv.org/pdf/2412.06769) paper by Meta, the authors try to totally get rid of human-readable reasoning.

To understand their idea, let's recall a couple of things about LLM architecture and generation process.

**1. Token embeddings**. As any neural networks, LLMs can only have vectors as inputs. So, instead of tokens themselves, LLMs consume their **vector embeddings**.

<center>
<img src="https://drive.google.com/uc?export=view&id=1-j02KittQhnV-feGdsh7A1bv2aJzwGJz" width=600 />
</center>

All the embeddings of a prompt go through a number of transformer blocks to become transformer **output** vectors (in grey). The last one of those is additionally passed though an **LM head** (also known as an **unembedding layer**) which is a multiclass classifier that predicts next token probabilities.

The newly generated token is then appended to the prompt and passed back to the LLM for it to produce the next one:

* ...
* $x_1x_2\ldots x_i\phantom{x_{i+1}}\longrightarrow x_{i+1}$
* $x_1x_2\ldots x_ix_{i+1}\longrightarrow x_{i+2}$
* ...

COCONUT suggests getting rid of completion tokens that correspond to the reasoning part of the completion. Instead, **the last transformer output vector (grey) becomes the new blue, i.e. is appended to the transformer input as the next embedding vector**:

<center>
<img src="https://drive.google.com/uc?export=view&id=1-MJLKdp2H443HoEVrHptbpcwE32j6XDe" width=800 />

[Source](https://arxiv.org/pdf/2412.06769)
</center>


# Ready for more?

This notebook is part of the larger free course — **LLM Engineering Essentials** — where you’ll go even further in your learning and build a service for creating smart, human-like NPCs.

🎓 New materials are coming soon. Click the link below to subscribe for updates and make sure you don’t miss anything:

[Stay updated](https://academy.nebius.com/llm-engineering-essentials/update/)

Of course, an LLM should be trained for such a mode of operation, and the authors do it in several stages, starting with a natural language reasoning and replacing them step by step with the steps of the new procedure (**“latent thoughts”**).

<center>
<img src="https://drive.google.com/uc?export=view&id=1u0jW8ALA6v4RUdDuNuxlQoh0mwQ5HX3A" width=800 />

[Source](https://arxiv.org/pdf/2412.06769)
</center>

They also introduce additional <bot> and <eot> (“beginning/end of thought”) tokens to indicate that it's time to start a thought or continue producing natural language outputs.

During the training process, the loss is masked on both questions and latent thoughts. This means that the latent thoughts aren't trained to repeat the initial natural language reasoning; only to facilitate further reasoning. Therefore, it's possible for the LLM to learn more effective representations of reasoning steps compared to human language.

Because latent thoughts aren't mapped to tokens, on inference the `<eot>` token isn't spawned naturally. Instead, the authors suggest either of these two ways of escaping latent reasoning:

1.	training a binary classifier on latent thoughts to predict `<eot>`/not `<eot>`,
2.	always pad the latent thoughts to a constant length.

The results are quite solid:

<center>
<img src="https://drive.google.com/uc?export=view&id=1u87jI_AStdWR0cXbvqcbniANF1OCsePg" width=800 />

[Source](https://arxiv.org/pdf/2412.06769)
</center>

# Practice: Exploring non-linear reasoning

If you encounter any difficulties or simply want to see our solutions, feel free to check the [Solutions notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic2/r.1_intro_to_llm_reasoning_solutions.ipynb).

## Task 1. Mapping LLM "thoughts"

In this task, we'll look closer at "thinking patters" of LLMs:

- We'll look closer at an LLM's "tree of thoughts",
- We'll investigate how long the typical thoughts are,
- We'll explore the "underthinking" phenomenon.

If you encounter any difficulties or simply want to see our solutions, feel free to check the [Solutions notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic2/r.1_intro_to_llm_reasoning.ipynb).

To have something to experiment with, we'll run evaluation of **QwQ-32B-Preview** on a subset of [MATH benchmark](https://huggingface.co/datasets/nlile/hendrycks-MATH-benchmark). This benchmark is relatively challenging, but to a reasonable extent. It's not [FrontierMath](https://epoch.ai/frontiermath) :)

We'll take the first 50 problems that satisfy two following conditions:

- Their answer is either straightforwardly converted to `float`, or it's a simple Latex-formatted fraction, like `\frac{2}{3}`.
- Their "level" is either 4 ot 5 (more challenge!).

**Note** You can use **deepseek-ai/DeepSeek-R1**, if you want, but it will generate solutions *very* slowly (not mentioning the cost).

Also, if you don't want to run the evaluation of **QWQ** on your own, you may download the `qwq_results.pkl` file from Google drive:

In [None]:
!gdown 1_hEX_h7fj6FXH3lG5AMthJjiK1JwAU9J

### Evaluating QWQ on MATH Dataset

If you're in, let's create the evaluator. And we'll start by data preprocessing.

In [None]:
!pip install -q datasets

In [None]:
from datasets import load_dataset
ds = load_dataset('nlile/hendrycks-MATH-benchmark', split='test')

In [None]:
import re

def conver_string_to_number(s):
    """
    Checks if a string is a number or a fraction and computes the fraction if applicable.

    Args:
        s: The input string.

    Returns:
        A float representing the number or fraction, or None if the string is invalid.
    """
    if "_" in s:
        return None

    try:
        return float(s)  # Try converting to a float directly
    except ValueError:
        match = re.match(r"\\frac\{(\d+)\}\{(\d+)\}", s)
        if match:
            numerator = int(match.group(1))
            denominator = int(match.group(2))
            if denominator != 0:
                return numerator / denominator
            else:
                return None  # Handle division by zero
        else:
            return None  # String is not a number or a valid fraction

# Example usage
strings = ["123", "\\frac{1}{2}", "\\frac{3}{0}", "abc", "\\frac{4}{5}"]
for s in strings:
    result = conver_string_to_number(s)
    if result is not None:
        print(f"'{s}' is a valid number or fraction. Result: {result}")
    else:
        print(f"'{s}' is not a valid number or fraction.")


In [None]:
import pandas as pd


df = pd.DataFrame(ds)

df['num_answer'] = df['answer'].apply(conver_string_to_number)
df['valid_answer'] = df['num_answer'].notna()

# Select the first 50 rows where the 'answer' column passes the check
selected_rows = df[(df['valid_answer']) & (df['level'] >= 4)].head(50)
selected_rows

The evaluator itself:

In [None]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm
import re

from openai import OpenAI

from datasets import load_dataset

def find_boxed_content(text):
    matches = re.findall(r"boxed\{(.*?)\}", text)
    try:
        return conver_string_to_number(matches[-1])
    except:
        try:
            return conver_string_to_number(text.split("\n")[-1].split(" ")[-1].strip(".;$"))
        except:
            print(f"""Wrong format in:
                {text.split()[-1]}""")
            return None

class MATHEvaluator:
    def __init__(self, system_prompt: str = "You are a helpful assistant.",
                 prompt: str = None):
        """
        Initialize the MATH evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
        """

        self.system_prompt = system_prompt

        self.prompt = """{question}"""

        self.questions, self.answers = selected_rows["problem"].to_list(), selected_rows["num_answer"].to_list()

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer (an float)
        """
        # Look for a single letter answer in the response
        try:
            # answer = float(solution.split("\n").split(" ")[1].strip(".;)"))
            answer = find_boxed_content(solution)
        except:
            answer = None
        # print(solution.split("\n")[-1])
        # print(answer)
        return answer

    def evaluate_single_question(self, question: str,
                                 correct_answer: float,
                                 client, model, max_tokens,
                                 temperature=None) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    question=question
                ),
                client=client, model=model,
                system_prompt=self.system_prompt,
                max_tokens=max_tokens
            )
            answer = self.extract_answer(model_response)
            if answer:
                is_correct = np.abs(answer - correct_answer) < 1e-10
            else:
                is_correct = False
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client : OpenAI, model : str,
                       n_questions=50, max_tokens=8192, temperature=0.) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0
        correct_format_count = 0
        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
                max_tokens=max_tokens,
                temperature=temperature
            )

            if answer:
                correct_format_count += 1

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'correct_format': not answer is None,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        format_correctness = correct_format_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'format_correctness': format_correctness,
            'evaluation_log': evaluation_log
        }

        return evaluation_results

In [None]:
math_evaluator = MATHEvaluator(system_prompt=None)

In [None]:
client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

qwq_results = math_evaluator.run_evaluation(
    client=client,
    model="Qwen/QwQ-32B-Preview",
    n_questions=None,
    max_tokens=None
)
print(f'\nAccuracy: {deepseek_results["accuracy"]}')

Let's save the results to file:

In [None]:
import pickle

pickle.dump(qwq_results, open("qwq_results.pkl", "wb"))

Now, you can load the file even if you didn't create it:

In [None]:
!gdown 1_hEX_h7fj6FXH3lG5AMthJjiK1JwAU9J

In [None]:
import pickle
qwq_results = pickle.load(open("qwq_results.pkl", "rb"))

### Analyzing thoughts

We've prepared quite a large thought analysis and visualization script; so we decided not to include it here (please check it in github if you're curious). Here, we'll only download and import it from `thought_analysis.py`.

A few words about what's happening in `thought_analysis.py`:

- First of all, if the solution has `<think>...</think>` markup inside, only the fragment between them is extracted. (We're only interested in the "internal" thinking process.)
- Then, solutions are divided into individial "thoughts" using the following heuristics:
  - `Alternatively`, `Wait`, `But wait`, `But let me check again`, `But let's verify`, and similar phrases mark the starts of new "thoughts". Note that they are typical indications of backtracking and solution branching. There may be more, of course.
  - Otherwise, a "thought" is a continuous range of paragraphs of length not less than `min_split_size` characters (we'll take `min_split_size=120`). Separate-line Latex formulas are always added to the previous thought.
- For each "thought", its length in tokens is calculated. For that, we need to supply the right **tokenizer** which corresponds to the model which generated the solutions - in our case, **[QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)**. And that's why you needed to create a **Hugging Face access token**. If you haven't done it yet, please register to HF, get the token and load it to colab in a `hf_access_token` file.

  If you're ardently against registering to Hugging Face, you can supply `None` as `tokenizer`, but in this case you won't get correct token length of individual "thoughts".

- The **"thee of thoughts"** is constructed in the following way:
  - If a "thought" starts with `Alternatively`, `Wait`, or `But wait`, we query using another LLM (**Llama-3.1-70B** by default) to determine which of the previous "thoughts" is continued by this. If it starts a completely different solution, the thought is connected to **root** (empty solution; the very start).
  - Otherwise, the thought is connected with the previous one.

- Finally, the tree is saved as `thought_analysis/thought_tree.png`, if you didn't change default output path.

In [None]:
!pip install -q openai

In [None]:
!curl -o thought_analysis.py https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic2/thought_analysis.py

Let's also download a sample solution for us to work with.

In [None]:
!curl -o sample_solution.txt https://raw.githubusercontent.com/Nebius-Academy/LLM-Engineering-Essentials/main/topic2/sample_solution.txt

In [None]:
from google.colab import files
files.upload()

In [None]:
from thought_analysis import analyze_solution_thoughts
from transformers import AutoTokenizer
from google.colab import userdata
import os

hf_access_token = userdata.get('hf_access_token')
os.environ["HF_TOKEN"] = hf_access_token

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY")
)
model = "meta-llama/Meta-Llama-3.1-70B-Instruct"

reasoning_model = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(reasoning_model,
                                          hf_access_token=hf_access_token)

file_path = "sample_solution.txt"  # Path to the solution file
#file_path = "deepseek_solutions/math41.txt"  # requires downloading zip file below
output_dir = "thought_analysis"  # Directory to save results

with open(file_path, 'r', encoding='utf-8') as file:
    solution_text = file.read()


connections, viz_path, summary_path = analyze_solution_thoughts(
    solution_text,
    client=client,
    model=model,
    tokenizer=tokenizer,
    output_dir=output_dir,
    min_split_size=120
)

In [None]:
from IPython.display import Image
Image('thought_analysis/thought_tree.png')

We can look closely at the connection or just check the `.png` file.

In [None]:
connections

**Your task**. Take some other solutions, construct their trees of thoughts.

You may also play with solutions generated by **DeepSeek R1**. We created several of them for you:

- 10 from the MATH benchmark,
- 10 from the AIME benchmark,
- 2 from the Frontier Math benchmark.

Just beware that these solutions will be much, much longer than the solutions by **QWQ**.

You can download them, like this:

In [None]:
!gdown 1TpROB-8XAE6z1OlTfli7XB6YoY6WQi18
!unzip deepseek_solutions.zip -d deepseek_solutions/

And, of course, feel free to create your own solutions!

### Size of Branches of Thoughts

**This is a task for you**! Create a simplified fuction

```
thoughts, token_counts = decompose_solution_thoughts(sample_text,
                                                     tokenizer=tokenizer)
```

that, given a solution `sample_text` and a tokenizer, returns:

- `thoughts` which is a split of `sample_text` by exactly `Alternatively`, `Wait`, `But wait` (you may add some of their synonyms you'll spot in the solutions). So, we only keep track of the **whole branches of the "tree of thoughts"** here.
- `token_counts` - the number of tokens in each of these "branches".

Now, we suggest you to explore the size of these branches in tokens. Create a histogram of branch sizes. What can you say about its shape? Take a look at several extremely long branches - what do you think, why are they so long? Check the shortest branches. What happens there?

In [None]:
def decompose_solution_thoughts(sample_text: str, tokenizer=tokenizer):
  def tokenize(text: str):
    return tokenizer.encode(text)

  def rsplit(text: str):
    import re
    return re.split(r'(Alternatively)|(Wait)|(But wait)', text, flags=re.IGNORECASE)

  thoughts = rsplit(sample_text)

  token_counts = list(map(lambda branch: len(tokenize(str(branch))), rsplit(sample_text)))
  return (thoughts, token_counts)

with open("deepseek_solutions/math41.txt", 'r', encoding='utf-8') as file:
  solution_text = file.read()
  (thoughts, token_counts) = decompose_solution_thoughts(solution_text)

  import matplotlib.pyplot as plt

  plt.hist(token_counts, density=True)

  plt.xlabel('Value')
  plt.ylabel('Frequency')

  plt.show()

### LLM underthinking

This part is inspired by [Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs](https://arxiv.org/pdf/2501.18585). This paper investigated connection between the length of thought branches (fragments of solution between "Alternatively", "Wait", etc) and the solution accuracy. The found out that in many cases LLMs abandon promising solutions, cutting thought branches short before they could come to fruition - and this might contribute to failure of the whole solution.

**Your task**: create on one plot two histograms of branch lengths - one histogram for tasks with correct answer and one of tasks with incorrect answer. You can find information about answer correctness in `qwq_results["evaluation_log"]` (`"is_correct"` fields).

Since there is a different number of correct and incorrect answers in the data, we recommend normalizing the histograms so that they show frequency instead of count. This may be done by setting `density=True`. Do you see any specific patterns?

In [None]:
from functools import reduce

def extract_data(log):
  (thoughts, token_counts) = decompose_solution_thoughts(log['model_response'])
  return (token_counts, log['is_correct'])

results = [extract_data(p) for p in qwq_results['evaluation_log']]
correct = [p[0] for p in results if p[1] == True]
incorrect = [p[0] for p in results if p[1] == False]

correct_sizes = reduce(lambda a, b: a + b, correct)
incorrect_sizes = reduce(lambda a, b: a + b, incorrect)

import matplotlib.pyplot as plt

fig = plt.figure()
correct_plt = fig.add_subplot(2, 1, 1)
correct_plt.hist(correct_sizes, density=True)
correct_plt.title.set_text("Correct")

incorrect_plt = fig.add_subplot(2, 1, 2, sharex=correct_plt)
incorrect_plt.hist(incorrect_sizes, density=True)
incorrect_plt.title.set_text("Incorrect")

fig.tight_layout()

plt.xlabel('Value')
plt.ylabel('Frequency')

plt.show()

## Task 2. Convince me with smiles

This is a continuation of the "reasoning dot by dot" section.

**Your task**: still working with **Llama-3.1-8B**, try to prompt it to output other things instead of reasoning. We personally recommend trying dots, commas, pluses, and smiles - they'll produce diverse and interesting patterns. However, feel free to try whatever you like! Don't forget to print several examples of this "reasoning". They might give you some additional insights.

Also, try running the same experiment with **Qwen/Qwen2.5-32B-Instruct**. Will you see a similar effect? Try changing not only models, but also prompts. Check how if affects the metrics.

In [None]:
# <YOUR EXPERIMENTS HERE>