We will be making a multimodal rag system based on gemini 2.0 (gemini-2.5-flash-preview-05-20 in my case) using this tutorial: https://medium.com/@christopher.henkel.ai/building-a-powerful-and-scalable-multimodal-rag-system-with-gemini-2-0-flash-88086536f237

First, load the necessary libraries:

In [6]:
%pip install -q google-generativeai pdf2image chromadb sentence_transformers fitz PyMuPDF

Note: you may need to restart the kernel to use updated packages.


In [7]:
from google import genai
from google.genai import types
# from sentence_transformers import SentenceTransformer
import requests
from PIL import Image
import pymupdf
import chromadb
import io, re, os
from dotenv import load_dotenv
from IPython.display import display, Markdown
import time

load_dotenv()
key = os.getenv("GEMINI_API_KEY")

# print(key)
client = genai.Client(api_key=key)
response = client.models.generate_content(
    model="gemini-2.0-flash", contents="Explain how AI works in a few words. Let's think step by step."
)
print(response.text)

Okay, let's break down how AI works in simple steps:

1.  **Data:** AI learns from data (lots of it!).
2.  **Patterns:** It finds patterns and relationships in that data.
3.  **Models:** It creates a model based on those patterns.
4.  **Prediction/Action:**  It uses the model to predict outcomes or take actions on new data.
5.  **Improvement:** It gets feedback and adjusts the model to improve its performance over time.

So, in a few words: **AI learns patterns from data to predict or act.**



Load the pdf document from an url and parse each page of the pdf and save it as an image.

In [8]:
pdf_url = "https://arxiv.org/pdf/2501.12948"

# download the PDF
response = requests.get(pdf_url)
print (response)
# open the PDF
pdf_document = pymupdf.open(stream=response.content, filetype="pdf")
print(pdf_document)
# save every pdf page as an image in a list
page_images = []
for page_number in range(pdf_document.page_count):
    page = pdf_document[page_number]
    print(page)
    pix = page.get_pixmap()

    # Convert PyMuPDF pixmap into PIL Image
    img_data = pix.tobytes("png")
    img = Image.open(io.BytesIO(img_data))
    page_images.append(img)

pdf_document.close()



<Response [200]>
Document('None', <memory, doc# 2>)
page 0 of <memory, doc# 2>
page 1 of <memory, doc# 2>
page 2 of <memory, doc# 2>
page 3 of <memory, doc# 2>
page 4 of <memory, doc# 2>
page 5 of <memory, doc# 2>
page 6 of <memory, doc# 2>
page 7 of <memory, doc# 2>
page 8 of <memory, doc# 2>
page 9 of <memory, doc# 2>
page 10 of <memory, doc# 2>
page 11 of <memory, doc# 2>
page 12 of <memory, doc# 2>
page 13 of <memory, doc# 2>
page 14 of <memory, doc# 2>
page 15 of <memory, doc# 2>
page 16 of <memory, doc# 2>
page 17 of <memory, doc# 2>
page 18 of <memory, doc# 2>
page 19 of <memory, doc# 2>
page 20 of <memory, doc# 2>
page 21 of <memory, doc# 2>


Visualize each page image to confirm that the parsing process actually worked.

In [9]:
# for idx, img in enumerate(page_images):
#     print(f"Displaying page {idx + 1}")
#     img.show()

Displaying page 1
Displaying page 2
Displaying page 3
Displaying page 4
Displaying page 5
Displaying page 6
Displaying page 7
Displaying page 8
Displaying page 9
Displaying page 10
Displaying page 11
Displaying page 12
Displaying page 13
Displaying page 14
Displaying page 15
Displaying page 16
Displaying page 17
Displaying page 18
Displaying page 19
Displaying page 20
Displaying page 21
Displaying page 22


OCR with Gemini 2.0 Flash (gemini-2.5-flash-preview-05-20 in my case)

Next, prompt the multimodal LLM to do the OCR with chunks. The LLM should create meaningful chunks defined by the tags < chunk>< /chunk> (without spaces). Tables should be formatted in html and additionally summarized (this will improve retrieval by adding more support). Same for figures.

In [10]:
# Maybe improve this prompt
OCR_PROMPT = """\
OCR the following document into Markdown. 
Chunk the document into sections of roughly 250 - 1000 words. 
The chunks should be meaningful and complete with no breaking of sentences.
Surround each chunk with <chunk> and </chunk> tags. 
If there are tables, format them as HTML. 
Additionally, include a summary of the findings for each table with numbers.
If there are figures or graphs, summarize the key insights using numerical data. 
Before the summary, indicate whether it is a graph and include its graph number. 
Ensure that the summary appears in the same section as the corresponding table or figure.
"""

OCR_PROMPT_2 = """\
You are an expert document processor. Your task is to OCR a document and convert it into Markdown format. The process involves the following steps:

1. Chunking: Divide the OCRed text into meaningful sections, each ranging from 250 to 1000 words. Ensure each chunk is complete and doesn't break sentences. Enclose each chunk within <chunk> and </chunk> tags.

2. Table Formatting: Format all tables as HTML. After each table, provide a summary of the table's key findings, emphasizing numerical data.

3. Figure/Graph Summarization: For each figure or graph, summarize the key insights using numerical data. Before each summary, indicate the graph type and its number (e.g., "Graph 1:"). The summary should appear in the same section as the corresponding figure/graph.

4. Output: The final output should be a Markdown document with the described chunking, table formatting, and figure/graph summarization.
"""

# Added chunking 128-512 tokens
OCR_PROMPT_3 = """
You are an expert document processor designed to extract information from images of document pages and convert it into a structured Markdown format suitable for a Retrieval-Augmented Generation (RAG) system.
Your task involves processing an image of a document page using OCR, chunking the resulting text, formatting tables, and summarizing figures and graphs. Follow these steps precisely:

1. OCR and Initial Cleanup: Perform OCR on the input image to extract the text content of the document page. Correct any obvious OCR errors (e.g., character substitutions, misspellings) where possible, but do not spend excessive time on this. The goal is to ensure the text is reasonably accurate.

2. Chunking: Divide the OCRed text into meaningful sections. Each chunk should contain between 128 and 512 tokens (350 words). The chunks should be contextually relevant and maintain sentence integrity; do not break sentences mid-way. If a sentence needs to be split across chunks, ensure the overlap between chunks includes enough preceding text to maintain context. If a paragraph exceeds 512 tokens, split it at a semantically appropriate point, ensuring no information is lost and context is preserved within each chunk. Enclose each chunk within <chunk> and </chunk> tags. Ensure all Markdown formatting is contained within the chunk.

3. Table Formatting and Summarization: Identify tables within the OCRed text. Format each table as an HTML table. Immediately following each table, provide a concise summary (2-3 sentences) of the table's key findings, emphasizing numerical data, trends, and significant values. If possible, include units. Example: "Table 1 shows the performance metrics of different models. Model A achieved the highest accuracy at 95%, while Model B had the fastest inference speed at 120ms."

4. Figure/Graph Summarization: Identify figures and graphs within the OCRed text. For each figure or graph, provide a concise summary (2-3 sentences) of the key insights derived from the visual, emphasizing numerical data, trends, and significant data points. Label each summary with the graph type and number (e.g., "Graph 1: Accuracy vs. Training Time"). Example: "Graph 2: shows that increasing the number of layers improves model performance up to 10 layers, but after that performance plateaus." The summary should appear immediately after the figure or graph description.

5. Target Audience Considerations: While the primary audience is professionals in the generative AI space, be mindful that some users may have less familiarity. Explain concepts clearly and avoid overly technical jargon where possible in your summaries.

6. Markdown Consistency: Use standard Markdown formatting where applicable. No specific Markdown flavor is required beyond basic formatting.

7. Error Handling: If the OCR process introduces errors that prevent accurate table or figure analysis, make a best effort to interpret the data, but prioritize accurate information when possible. Note any ambiguities or potential inaccuracies briefly within the corresponding summary.

8. Output: Return the entire output as a single string containing the Markdown-formatted document, including the <chunk> tags, HTML tables, and summaries, for use within a RAG system.
"""

def ocr_page(page_num, image):
    try:
        response = client.models.generate_content(
            model = "gemini-2.5-flash-preview-05-20",
            contents = [OCR_PROMPT_3, image]
        )
        page_in_ocr_format = response.text

        # parse <chunk> blocks
        chunks = re.findall(r"<chunk>(.*?)</chunk>", page_in_ocr_format, re.DOTALL)

        if not chunks:
            # if no chunk tags where created
            chunks = page_in_ocr_format.split("\n\n")
        
        chunk_blocks = []
        for idx, chunk_text in enumerate(chunks):
            # store Id, chunk text
            chunk_blocks.append({
                "id": f"page_{page_num}_chunk_{idx}",
                "text": chunk_text.strip()
            })
        print(f"Successfully processed page {page_num}")  
        return chunk_blocks
    except Exception as e:
        print(f"Error processing page {page_num}: {e}")
        return []

Don't run the below. Not all pages don't get processed. \\/ \\/

In [11]:
# # ocr and chunks each pdf page
# all_chunks = []
# for i, image in enumerate(page_images):
#     page_chunks = ocr_page(i, image)
#     all_chunks.extend(page_chunks)

# print(f"Total chunks: {len(all_chunks)}")

#### The above ocr task was not successful. Due to model overload, not all the pages were processed. (I am using a free account!!) This is being used to display the chunks gained from ocr using markdown. This helps us visualize what information the ocr truly captured beyond html tags.
#### For now, we will use what is processed for the rest of the rag system welp.

#### Had to restart the python kernel so now this is commented out!!


In [12]:
# for j in all_chunks:
#     display(Markdown(j[id]))
#     display(Markdown(j[text]))
#     # print("Text: " + chunk.text)


## Let's try this again, but with a wait so that gemini won't cry model overload.

In [13]:
import time

# ocr and chunks each pdf page with OCR_PROMPT_3 and time.sleep(1 minute)
all_chunks_again = []
for i, image in enumerate(page_images):
    page_chunks = ocr_page(i, image)
    all_chunks_again.extend(page_chunks)
    time.sleep(20)

print(f"Total chunks: {len(all_chunks_again)}")

for j in all_chunks_again:
    display(Markdown(j[id]))
    display(Markdown(j[text]))

Successfully processed page 0
Successfully processed page 1
Successfully processed page 2
Successfully processed page 3
Successfully processed page 4
Successfully processed page 5
Successfully processed page 6
Successfully processed page 7
Successfully processed page 8
Successfully processed page 9
Successfully processed page 10
Successfully processed page 11
Successfully processed page 12
Successfully processed page 13
Successfully processed page 14
Successfully processed page 15
Successfully processed page 16
Successfully processed page 17
Successfully processed page 18
Successfully processed page 19
Successfully processed page 20
Successfully processed page 21
Total chunks: 90


KeyError: <built-in function id>

In [14]:
for j in all_chunks_again:
    # print (j)
    display(Markdown(j['id']))
    display(Markdown(j['text']))

page_0_chunk_0

# DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

**DeepSeek-AI**
research@deepseek.com

## Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

page_1_chunk_0

# Contents

*   1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
    *   1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
    *   1.2 Summary of Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

*   2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
    *   2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
    *   2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model . . . . . . . . . . . 5
        *   2.2.1 Reinforcement Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 5
        *   2.2.2 Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
        *   2.2.3 Training Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
        *   2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero 6
    *   2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . . . . . . 9
        *   2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
        *   2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . . . . . . . . . 10
        *   2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . . . . . . . 10
        *   2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . . . 11
    *   2.4 Distillation: Empower Small Models with Reasoning Capability . . . . . . . . . . . 11

*   3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
    *   3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
    *   3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

*   4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
    *   4.1 Distillation v.s. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 14
    *   4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

*   5 Conclusion, Limitations, and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 16

*   A Contributions and Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

page_2_chunk_0

# 1. Introduction

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI).

page_2_chunk_1

Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI's o1 (OpenAI, 2024b) series models were the first to introduce inference-time scaling by increasing the length of the Chain-of-Thought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI's o1 series models.

page_2_chunk_2

In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to 71.0%, and with majority voting, the score further improves to 86.7%, matching the performance of OpenAI-o1-0912.

page_2_chunk_3

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

page_2_chunk_4

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models.

page_3_chunk_0

## 1.1. Contributions
### Post-Training: Large-Scale Reinforcement Learning on the Base Model
*   We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area.
*   We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.

page_3_chunk_1

### Distillation: Smaller Models Can Be Powerful Too
*   We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future.
*   Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024, 94.3% on MATH-500, and 57.2% on LiveCodeBench. These results significantly outperform previous open-source models and are comparable to ol-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

page_3_chunk_2

## 1.2. Summary of Evaluation Results
*   **Reasoning tasks:** (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.
*   **Knowledge:** On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 40 on this benchmark.

page_4_chunk_0

*   **Others:** DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, summarization, and more. It achieves an impressive length-controlled win-rate of 87.6% on AlpacaEval 2.0 and a win-rate of 92.3% on Are-naHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

page_4_chunk_1

## 2. Approach

### 2.1. Overview
Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.

page_4_chunk_2

### 2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Shao et al., 2024; Wang et al., 2023). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning process. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights.

page_4_chunk_3

#### 2.2.1. Reinforcement Learning Algorithm
**Group Relative Policy Optimization** In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question q, GRPO samples a group of outputs {o₁, o₂, ..., o_G} from the old policy π_θ_old and then optimizes the policy model π_θ by maximizing the following objective:

Equation 1: Group Relative Policy Optimization (GRPO) Objective
$$J_{GRPO}(\theta) = \mathbb{E}[Q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|Q)]$$
$$\left[ \frac{1}{G} \sum_{i=1}^G \left( \min\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \varepsilon, 1 + \varepsilon\right)\right) A_i - \beta D_{KL}(\pi_{\theta} || \pi_{ref})\right)\right] \quad (1)$$

This equation defines the objective function for Group Relative Policy Optimization (GRPO). It aims to maximize the expected reward while controlling the policy update step size through a clipped ratio and a KL divergence regularization term. The objective considers a group of outputs and uses an advantage estimate (A_i) to guide the optimization.

Equation 2: KL Divergence Term
$$D_{KL}(\pi_{\theta} || \pi_{ref}) = \frac{\pi_{ref}(o_i|q)}{\pi_{\theta}(o_i|q)} - \log \frac{\pi_{ref}(o_i|q)}{\pi_{\theta}(o_i|q)} - 1 \quad (2)$$

This equation specifies a particular form of KL divergence (Kullback-Leibler divergence) used as a regularization term in the GRPO objective. It measures the difference between the current policy (π_θ) and a reference policy (π_ref), helping to prevent large policy shifts during training.

where ε and β are hyper-parameters, and A_i is the advantage, computed using a group of rewards {r₁, r₂, ..., r_G} corresponding to the outputs within each group:

Equation 3: Advantage Calculation
$$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \dots, r_G\})}{\text{std}(\{r_1, r_2, \dots, r_G\})} \quad (3)$$

This equation defines how the advantage (A_i) for each output is calculated. It normalizes the reward (r_i) by subtracting the mean and dividing by the standard deviation of rewards within the same group, helping to provide a more stable learning signal.

page_5_chunk_0

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within `<think>` `</think>` and `<answer>` `</answer>` tags, respectively, i.e., `<think>` reasoning process here `</think>` `<answer>` answer here `</answer>`. User: **prompt**. Assistant:
Table 1 | Template for DeepSeek-R1-Zero. **prompt** will be replaced with the specific reasoning question during training.

**Table 1 Summary:**
Table 1 is described as a template for DeepSeek-R1-Zero. This template illustrates how the placeholder `**prompt**` is replaced with a specific reasoning question during the training process. As no table data is provided in the document image, a detailed HTML table cannot be generated.

page_5_chunk_1

### 2.2.2. Reward Modeling
The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:
*   **Accuracy rewards:** The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.
*   **Format rewards:** In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘`<think>`’ and ‘`<think>`’ tags.
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

page_5_chunk_2

### 2.2.3. Training Template
To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model’s natural progression during the RL process.

page_5_chunk_3

### 2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI’s 01-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model’s performance over time.

**Graph 2: Performance Trajectory of DeepSeek-R1-Zero on AIME 2024 Benchmark Summary:**
This figure (not shown) depicts DeepSeek-R1-Zero's consistent performance enhancement during its RL training on the AIME 2024 benchmark. The model achieved a significant increase in its average pass@1 score, rising from an initial 15.6% to 71.0%. This improvement positions its performance on par with OpenAI's 01-0912 model.

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s 01-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

**Table 2 Summary:**
Table 2 is described as presenting a comparative analysis between DeepSeek-R1-Zero and OpenAI's 01-0912 models across various reasoning benchmarks. The text indicates that the findings demonstrate reinforcement learning's effectiveness, but the description of the findings is incomplete as the page ends. Since no table data is provided, an HTML table cannot be generated.

page_6_chunk_0

Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks.

<table border="1">
<thead>
<tr>
<th>Model</th>
<th colspan="2">AIME 2024</th>
<th>MATH-500</th>
<th>GPQA Diamond</th>
<th>LiveCode Bench</th>
<th>CodeForces</th>
</tr>
<tr>
<th></th>
<th>pass@1</th>
<th>cons@64</th>
<th>pass@1</th>
<th>pass@1</th>
<th>pass@1</th>
<th>rating</th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenAI-o1-mini</td>
<td>63.6</td>
<td>80.0</td>
<td>90.0</td>
<td>60.0</td>
<td>53.8</td>
<td>1820</td>
</tr>
<tr>
<td>OpenAI-o1-0912</td>
<td>74.4</td>
<td>83.3</td>
<td>94.8</td>
<td>77.3</td>
<td>63.4</td>
<td>1843</td>
</tr>
<tr>
<td>DeepSeek-R1-Zero</td>
<td>71.0</td>
<td>86.7</td>
<td>95.9</td>
<td>73.3</td>
<td>50.0</td>
<td>1444</td>
</tr>
</tbody>
</table>

Table 2 provides a comparison of DeepSeek-R1-Zero's performance against two OpenAI o1 models on various reasoning-related benchmarks. DeepSeek-R1-Zero achieved the highest pass@1 score on MATH-500 (95.9%) and the highest cons@64 score on AIME 2024 (86.7%), demonstrating superior performance in these specific areas compared to both OpenAI models. While excelling in some metrics, DeepSeek-R1-Zero had a lower rating on CodeForces (1444) and a lower pass@1 on LiveCode Bench (50.0%) than the OpenAI models.

page_6_chunk_1

Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

Graph 2: AIME Accuracy During Training illustrates the accuracy progression of DeepSeek-R1-Zero on the AIME benchmark during its training, alongside baseline performances of OpenAI-o1-0912. The `r1-zero-cons@16` (red line) consistently improved and surpassed the `o1-0912-cons@64` baseline (purple dashed line), reaching approximately 0.85 accuracy by 8000 steps. Similarly, the `r1-zero-pass@1` (blue line) showed a steady upward trend, eventually exceeding the `o1-0912-pass@1` baseline (green dashed line) and reaching about 0.65 accuracy, highlighting DeepSeek-R1-Zero's effective learning without supervised fine-tuning.

page_6_chunk_2

DeepSeek-R1-Zero aims to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.

page_6_chunk_3

Self-evolution Process of DeepSeek-R1-Zero The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model's progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks. As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improve-

page_7_chunk_0

Figure 3 | The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

Graph 3: DeepSeek-R1-Zero Average Response Length during Training
This line graph displays the average response length of the DeepSeek-R1-Zero model across approximately 8000 training steps during the reinforcement learning (RL) process. A significant increasing trend is evident, showing that the model's average response length grows from roughly 1,000 tokens at the start to over 9,000 tokens by the end of the observed training period. This growth suggests that as the model trains, it inherently allocates more "thinking time" to solve reasoning tasks, reflecting an increasing complexity and depth in its thought processes.

page_7_chunk_1

Improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

page_7_chunk_2

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.

page_7_chunk_3

Aha Moment of DeepSeek-R1-Zero
A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an “aha moment”. This moment, as illustrated in Table 3 (Note: Table 3 is referenced in the text but not present on this page), occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model’s growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

page_7_chunk_4

This moment is not only an “aha moment” for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem-solving strategies. The “aha moment” serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

page_8_chunk_0

Question: If a > 1, then the sum of the real solutions of `sqrt(a) - sqrt(a+x) = x` is equal to
Response: <think>
To solve the equation `sqrt(a) - sqrt(a+x) = x`, let's start by squaring both ...
`(sqrt(a) - sqrt(a+x))^2 = x^2` => `a - sqrt(a+x) = x^2`.
Rearrange to isolate the inner square root term:
`(a - x^2)^2 = a+x` => `a^2 - 2ax^2 + (x^2)^2 = a+x` => `x^4 - 2ax^2 - x + (a^2 - a) = 0`
...
Wait, wait. Wait. That’s an aha moment I can flag here.
Let's reevaluate this step-by-step to identify the correct sum can be …
We started with the equation:
`sqrt(a) - sqrt(a+x) = x`
First, let's square both sides:
`a - sqrt(a+x) = x^2` => `sqrt(a+x) = a - x^2`
Next, I could square both sides again, treating the equation: …

page_8_chunk_1

**Concept 3: "Aha Moment" in DeepSeek-R1-Zero**
This section describes an interesting "aha moment" from an intermediate version of DeepSeek-R1-Zero, where the model learns to rethink its approach using an anthropomorphic, or human-like, tone. This conceptual breakthrough highlights the model's evolving capabilities and the power of reinforcement learning in enabling sophisticated and adaptive behaviors. It allows observers to appreciate the advanced learning processes at play.

page_8_chunk_2

### Drawback of DeepSeek-R1-Zero
Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes Reinforcement Learning (RL) with human-friendly cold-start data.

page_8_chunk_3

### 2.3. DeepSeek-R1: Reinforcement Learning with Cold Start
Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise regarding its development: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1, which consists of four outlined stages.

page_8_chunk_4

#### 2.3.1. Cold Start
Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of Reinforcement Learning (RL) training from the base model, for DeepSeek-R1 we construct and collect a small amount of long Chains of Thought (CoT) data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

page_8_chunk_5

In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for Reinforcement Learning (RL). Compared to DeepSeek-R1-Zero, the advantages of cold start data [text continues on next page].

page_9_chunk_0

include:
*   **Readability:** A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack Markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as `|special_token| <reasoning_process> |special_token| <summary>`, where the reasoning process is the Chain-of-Thought (CoT) for the query, and the summary is used to summarize the reasoning results.
*   **Potential:** By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models.

page_9_chunk_1

### 2.3.2. Reasoning-oriented Reinforcement Learning

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning (RL) training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that Chain-of-Thought (CoT) often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks.

page_9_chunk_2

### 2.3.3. Rejection Sampling and Supervised Fine-Tuning

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below.

**Reasoning data** We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, we only included data that could be evaluated using rule-based rewards. However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long paragraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples.

page_10_chunk_0

Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as "hello" we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning.
We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.

page_10_chunk_1

2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model's helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains.

page_10_chunk_2

For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

page_10_chunk_3

2.4. Distillation: Empower Small Models with Reasoning Capability
To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in §2.3.3. Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than that of Llama-3.1.

page_10_chunk_4

For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

3. Experiment
**Benchmarks** We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMIU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,

page_11_chunk_0

2024d), Aider ¹, LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces ², Chinese National High School Mathematics Olympiad (CNMO 2024) ³, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench.

page_11_chunk_1

**Evaluation Prompts** Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simple-evals framework. For MMLU-Redux, we adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, C++, C#, JavaScript, TypeScript, PHP, and Bash). Model performance on LiveCodeBench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related benchmarks are measured using a ‘diff’ format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark.

page_11_chunk_2

**Baselines** We conduct comprehensive evaluations against several strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-01-mini, and OpenAI-01-1217. Since accessing the OpenAI-01-1217 API is challenging in mainland China, we report its performance based on official reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen, 2024a).

**Evaluation Setup** We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we default to pass@k evaluation (Chen et al., 2021) and report pass@1 using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top-p value of 0.95 to generate k responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as

```
pass@1 = (1/k) * Sum_{i=1}^{k} p_i
```

where p_i denotes the correctness of the i-th response. This method provides more reliable performance estimates. For AIME 2024, we also report consensus (majority vote) results (Wang et al., 2022) using 64 samples, denoted as cons@64.

Footnotes:
¹ https://aider.chat
² https://codeforces.com
³ https://www.cms.org.cn/Home/comp/comp/cid/12.html

page_12_chunk_0

## 3.1. DeepSeek-R1 Evaluation

### Table 4 | Comparison between DeepSeek-R1 and other representative models.

<table>
  <thead>
    <tr>
      <th>Benchmark (Metric)</th>
      <th>Claude-3.5 Sonnet-1022</th>
      <th>GPT-4o 0513</th>
      <th>DeepSeek V3</th>
      <th>OpenAI o1-mini</th>
      <th>OpenAI o1-1217</th>
      <th>DeepSeek R1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td colspan="7"><b>Architecture</b></td>
    </tr>
    <tr>
      <td># Activated Params</td>
      <td>-</td>
      <td>-</td>
      <td>37B</td>
      <td>-</td>
      <td>-</td>
      <td>37B</td>
    </tr>
    <tr>
      <td># Total Params</td>
      <td>-</td>
      <td>-</td>
      <td>671B</td>
      <td>-</td>
      <td>-</td>
      <td>671B</td>
    </tr>
    <tr>
      <td colspan="7"><b>English</b></td>
    </tr>
    <tr>
      <td>MMLU (Pass@1)</td>
      <td>88.3</td>
      <td>87.2</td>
      <td>88.5</td>
      <td>85.2</td>
      <td>91.8</td>
      <td>90.8</td>
    </tr>
    <tr>
      <td>MMLU-Redux (EM)</td>
      <td>88.9</td>
      <td>88.0</td>
      <td>89.1</td>
      <td>86.7</td>
      <td>-</td>
      <td>92.9</td>
    </tr>
    <tr>
      <td>MMLU-Pro (EM)</td>
      <td>78.0</td>
      <td>72.6</td>
      <td>75.9</td>
      <td>80.3</td>
      <td>-</td>
      <td>84.0</td>
    </tr>
    <tr>
      <td>DROP (3-shot F1)</td>
      <td>88.3</td>
      <td>83.7</td>
      <td>91.6</td>
      <td>83.9</td>
      <td>90.2</td>
      <td>92.2</td>
    </tr>
    <tr>
      <td>IF-Eval (Prompt Strict)</td>
      <td>86.5</td>
      <td>84.3</td>
      <td>86.1</td>
      <td>84.8</td>
      <td>-</td>
      <td>83.3</td>
    </tr>
    <tr>
      <td>GPQA Diamond (Pass@1)</td>
      <td>65.0</td>
      <td>49.9</td>
      <td>59.1</td>
      <td>60.0</td>
      <td>75.7</td>
      <td>71.5</td>
    </tr>
    <tr>
      <td>SimpleQA (Correct)</td>
      <td>28.4</td>
      <td>38.2</td>
      <td>24.9</td>
      <td>7.0</td>
      <td>47.0</td>
      <td>30.1</td>
    </tr>
    <tr>
      <td>FRAMES (Acc.)</td>
      <td>72.5</td>
      <td>80.5</td>
      <td>73.3</td>
      <td>76.9</td>
      <td>-</td>
      <td>82.5</td>
    </tr>
    <tr>
      <td>AlpacaEval2.0 (LL-w/o-re)</td>
      <td>52.0</td>
      <td>51.1</td>
      <td>70.0</td>
      <td>57.8</td>
      <td>-</td>
      <td>87.6</td>
    </tr>
    <tr>
      <td>ArenaHard (GPT-4-1106)</td>
      <td>85.2</td>
      <td>80.4</td>
      <td>85.5</td>
      <td>92.0</td>
      <td>-</td>
      <td>92.3</td>
    </tr>
    <tr>
      <td colspan="7"><b>Code</b></td>
    </tr>
    <tr>
      <td>LiveCodeBench (Pass@1-COT)</td>
      <td>38.9</td>
      <td>32.9</td>
      <td>36.2</td>
      <td>53.8</td>
      <td>63.4</td>
      <td>65.9</td>
    </tr>
    <tr>
      <td>Codeforces (Percentile)</td>
      <td>20.3</td>
      <td>23.6</td>
      <td>58.7</td>
      <td>93.4</td>
      <td>96.6</td>
      <td>96.3</td>
    </tr>
    <tr>
      <td>Codeforces (Rating)</td>
      <td>717</td>
      <td>759</td>
      <td>1134</td>
      <td>1820</td>
      <td>2061</td>
      <td>2029</td>
    </tr>
    <tr>
      <td>SWE Verified (Resolved)</td>
      <td>50.8</td>
      <td>38.8</td>
      <td>42.0</td>
      <td>41.6</td>
      <td>48.9</td>
      <td>49.2</td>
    </tr>
    <tr>
      <td>Aider-Polyglot (Acc)</td>
      <td>45.3</td>
      <td>16.0</td>
      <td>49.6</td>
      <td>32.9</td>
      <td>61.7</td>
      <td>53.3</td>
    </tr>
    <tr>
      <td colspan="7"><b>Math</b></td>
    </tr>
    <tr>
      <td>AIME 2024 (Pass@1)</td>
      <td>16.0</td>
      <td>9.3</td>
      <td>39.2</td>
      <td>63.6</td>
      <td>79.2</td>
      <td>79.8</td>
    </tr>
    <tr>
      <td>MATH-500 (Pass@1)</td>
      <td>78.3</td>
      <td>74.6</td>
      <td>90.2</td>
      <td>90.0</td>
      <td>96.4</td>
      <td>97.3</td>
    </tr>
    <tr>
      <td>CNMO 2024 (Pass@1)</td>
      <td>13.1</td>
      <td>10.8</td>
      <td>43.2</td>
      <td>67.6</td>
      <td>-</td>
      <td>78.8</td>
    </tr>
    <tr>
      <td colspan="7"><b>Chinese</b></td>
    </tr>
    <tr>
      <td>CLUEWSVC (EM)</td>
      <td>85.4</td>
      <td>87.9</td>
      <td>90.9</td>
      <td>89.9</td>
      <td>-</td>
      <td>92.8</td>
    </tr>
    <tr>
      <td>Chinese C-Eval (EM)</td>
      <td>76.7</td>
      <td>76.0</td>
      <td>86.5</td>
      <td>68.9</td>
      <td>-</td>
      <td>91.8</td>
    </tr>
    <tr>
      <td>C-SimpleQA (Correct)</td>
      <td>55.4</td>
      <td>58.7</td>
      <td>68.0</td>
      <td>40.3</td>
      <td>-</td>
      <td>63.7</td>
    </tr>
  </tbody>
</table>

Table 4 summarizes the performance of DeepSeek-R1 against various models, including Claude-3.5, GPT-4o, DeepSeek V3, and OpenAI's o1-mini and o1-1217. DeepSeek-R1 consistently achieves high scores across diverse benchmarks, notably leading in MMLU-Redux (92.9), MATH-500 (97.3), and LiveCodeBench (65.9). While generally strong, DeepSeek-R1 shows lower scores in IF-Eval (83.3) and SimpleQA (30.1) compared to some competitors.

page_12_chunk_1

For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-ol surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%.

page_12_chunk_2

DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model's ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1's strengths in writing tasks and open-domain question answering. Its significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning capabilities but also improves performance across diverse domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that

page_13_chunk_0

DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying its robustness across multiple tasks.

On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-01-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On engineering-oriented coding tasks, OpenAI-01-1217 outperforms DeepSeek-R1 on Aider but achieves comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1 will improve in the next version, as the amount of related RL training data currently remains very limited.

page_13_chunk_1

## 3.2 Distilled Model Evaluation

### Table 5 | Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks.

<table>
  <thead>
    <tr>
      <th rowspan="2">Model</th>
      <th colspan="2">AIME 2024</th>
      <th rowspan="2">MATH-500<br>pass@1</th>
      <th rowspan="2">GPQA Diamond<br>pass@1</th>
      <th rowspan="2">LiveCode Bench<br>pass@1</th>
      <th rowspan="2">CodeForces<br>rating</th>
    </tr>
    <tr>
      <th>pass@1</th>
      <th>cons@64</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPT-4o-0513</td>
      <td>9.3</td>
      <td>13.4</td>
      <td>74.6</td>
      <td>49.9</td>
      <td>32.9</td>
      <td>759</td>
    </tr>
    <tr>
      <td>Claude-3.5-Sonnet-1022</td>
      <td>16.0</td>
      <td>26.7</td>
      <td>78.3</td>
      <td>65.0</td>
      <td>38.9</td>
      <td>717</td>
    </tr>
    <tr>
      <td>OpenAI-01-mini</td>
      <td>63.6</td>
      <td>80.0</td>
      <td>90.0</td>
      <td>60.0</td>
      <td>53.8</td>
      <td>1820</td>
    </tr>
    <tr>
      <td>QwQ-32B-Preview</td>
      <td>50.0</td>
      <td>60.0</td>
      <td>90.6</td>
      <td>54.5</td>
      <td>41.9</td>
      <td>1316</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Qwen-1.5B</td>
      <td>28.9</td>
      <td>52.7</td>
      <td>83.9</td>
      <td>33.8</td>
      <td>16.9</td>
      <td>954</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Qwen-7B</td>
      <td>55.5</td>
      <td>83.3</td>
      <td>92.8</td>
      <td>49.1</td>
      <td>37.6</td>
      <td>1189</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Qwen-14B</td>
      <td>69.7</td>
      <td>80.0</td>
      <td>93.9</td>
      <td>59.1</td>
      <td>53.1</td>
      <td>1481</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Qwen-32B</td>
      <td>72.6</td>
      <td>83.3</td>
      <td>94.3</td>
      <td>62.1</td>
      <td>57.2</td>
      <td>1691</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Llama-8B</td>
      <td>50.4</td>
      <td>80.0</td>
      <td>89.1</td>
      <td>49.0</td>
      <td>39.6</td>
      <td>1205</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Llama-70B</td>
      <td>70.0</td>
      <td>86.7</td>
      <td>94.5</td>
      <td>65.2</td>
      <td>57.5</td>
      <td>1633</td>
    </tr>
  </tbody>
</table>

Table 5 presents a comparison of DeepSeek-R1 distilled models and other models on various reasoning benchmarks. Notably, DeepSeek-R1-Distill-Qwen-32B achieved strong performance with 72.6% pass@1 on AIME 2024 and 94.3% on MATH-500. The DeepSeek-R1-Distill-Llama-70B model demonstrated leading results across several metrics, including 94.5% on MATH-500 and 65.2% on GPQA Diamond, often surpassing other DeepSeek-R1 variants and general models like OpenAI-01-mini.

page_13_chunk_2

As shown in Table 5, simply distilling DeepSeek-R1's outputs enables the efficient DeepSeek-R1-7B (i.e., DeepSeek-R1-Distill-Qwen-7B, abbreviated similarly below) to outperform non-reasoning models like GPT-4o-0513 across the board. DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics, while DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed 01-mini on most benchmarks. These results demonstrate the strong potential of distillation. Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.

page_13_chunk_3

## 4. Discussion

### 4.1 Distillation v.s. Reinforcement Learning

In Section 3.2, we can see that by distilling DeepSeek-R1, the small model can achieve impressive results. However, there is still one question left: can the model achieve comparable performance through the large-scale RL training discussed in the paper without distillation?

To answer this question, we conduct large-scale RL training on Qwen-32B-Base using math, code, and STEM data, training for over 10K steps, resulting in DeepSeek-R1-Zero-Qwen-32B. The experimental results, shown in Table 6, demonstrate that the 32B base model, after large-scale

page_14_chunk_0

<table border="1">
  <thead>
    <tr>
      <th>Model</th>
      <th colspan="2">AIME 2024</th>
      <th>MATH-500</th>
      <th>GPQA Diamond</th>
      <th>LiveCodeBench</th>
    </tr>
    <tr>
      <th></th>
      <th>pass@1</th>
      <th>cons@64</th>
      <th>pass@1</th>
      <th>pass@1</th>
      <th>pass@1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>QwQ-32B-Preview</td>
      <td>50.0</td>
      <td>60.0</td>
      <td>90.6</td>
      <td>54.5</td>
      <td>41.9</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Zero-Qwen-32B</td>
      <td>47.0</td>
      <td>60.0</td>
      <td>91.6</td>
      <td>55.0</td>
      <td>40.2</td>
    </tr>
    <tr>
      <td>DeepSeek-R1-Distill-Qwen-32B</td>
      <td>72.6</td>
      <td>83.3</td>
      <td>94.3</td>
      <td>62.1</td>
      <td>57.2</td>
    </tr>
  </tbody>
</table>

**Table 6: Comparison of Distilled and RL Models on Reasoning-Related Benchmarks**
Table 6 presents a comparison of different models, highlighting performance on various reasoning benchmarks. The DeepSeek-R1-Distill-Qwen-32B model consistently demonstrates superior performance across all metrics, achieving 72.6% pass@1 on AIME 2024 and 57.2% pass@1 on LiveCodeBench, which are significantly higher than the other models. This suggests that the distillation approach used in DeepSeek-R1-Distill-Qwen-32B leads to better results than other architectures like QwQ-32B-Preview and DeepSeek-R1-Zero-Qwen-32B across complex reasoning tasks.

RL training achieves performance on par with QwQ-32B-Preview. However, DeepSeek-R1-Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks.

Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.

page_14_chunk_1

### 4.2. Unsuccessful Attempts

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models.

**Process Reward Model (PRM)** PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline. In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

page_14_chunk_2

**Monte Carlo Tree Search (MCTS)** Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), we explored using Monte Carlo Tree Search (MCTS) to enhance test-time compute scalability. This approach involves breaking answers into smaller parts to allow the model to explore the solution space systematically. To facilitate this, we prompt the model to generate multiple tags that correspond to specific reasoning steps necessary for the search. For training, we first use collected prompts to find answers via MCTS guided by a pre-trained value model. Subsequently, we use the resulting question-answer pairs to train both the actor model and the value model, iteratively refining the process.

However, this approach encounters several challenges when scaling up the training. First, unlike chess, where the search space is relatively well-defined, token generation presents an

page_15_chunk_0

exponentially larger search space. To address this, we set a maximum extension limit for each node, but this can lead to the model getting stuck in local optima. Second, the value model directly influences the quality of generation since it guides each step of the search process. Training a fine-grained value model is inherently difficult, which makes it challenging for the model to iteratively improve. While AlphaGo's core success relied on training a value model to progressively enhance its performance, this principle proves difficult to replicate in our setup due to the complexities of token generation.

In conclusion, while MCTS can improve performance during inference when paired with a pre-trained value model, iteratively boosting model performance through self-search remains a significant challenge.

page_15_chunk_1

## 5. Conclusion, Limitations, and Future Work

In this work, we share our journey in enhancing model reasoning abilities through reinforcement learning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performance comparable to OpenAI-01-1217 on a range of tasks.

We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1 as the teacher model to generate 800K training samples, and fine-tune several small dense models. The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instruction-tuned models based on the same underlying checkpoints.

page_15_chunk_2

In the future, we plan to invest in research across the following directions for DeepSeek-R1.

*   **General Capability:** Currently, the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output. Moving forward, we plan to explore how long CoT can be leveraged to enhance tasks in these fields.
*   **Language Mixing:** DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates.

page_15_chunk_3

*   **Prompting Engineering:** When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.
*   **Software Engineering Tasks:** Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

page_16_chunk_0

References
AI@Meta. Llama 3.1 model card, 2024. URL https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md.
Anthropic. Claude 3.5 sonnet, 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.

page_16_chunk_1

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, Y. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024. URL https://arxiv.org/abs/2309.17179.
L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization, 2022. URL https://arxiv.org/abs/2210.10760.
A. P. Gema, J. O. J. Leang, G. Hong, A. Devoto, A. C. M. Mancino, R. Saxena, X. He, Y. Zhao, X. Du, M. R. G. Madani, C. Barale, R. McHardy, J. Harris, J. Kaddour, E. van Krieken, and P. Minervini. Are we done with mmlu? CoRR, abs/2406.04127, 2024. URL https://doi.org/10.48550/arXiv.2406.04127.

page_16_chunk_2

Google. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024.
Y. He, S. Li, J. Liu, Y. Tan, W. Wang, H. Huang, X. Bu, H. Guo, C. Hu, B. Zheng, et al. Chinese simpleqa: A chinese factuality evaluation for large language models. arXiv preprint arXiv:2411.07140, 2024.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, J. Liu, C. Lv, Y. Zhang, J. Lei, et al. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974, 2024. URL https://doi.org/10.48550/arXiv.2403.07974.

page_17_chunk_0

S. Krishna, K. Krishna, A. Mohananey, S. Schwarcz, A. Stambler, S. Upadhyay, and M. Faruqui. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. CoRR, abs/2409.12941, 2024. doi: 10.48550/ARXIV.2409.12941. URL https://doi.org/10.48550/arXiv.2409.12941.

page_17_chunk_1

A. Kumar, V. Zhuang, R. Agarwal, Y. Su, J. D. Co-Reyes, A. Singh, K. Baumli, S. Iqbal, C. Bishop, R. Roelofs, et al. Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024.

page_17_chunk_2

H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. CMMLU: Measuring massive multitask language understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023.

page_17_chunk_3

T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.

page_17_chunk_4

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023.

page_17_chunk_5

B. Y. Lin. ZeroEval: A Unified Framework for Evaluating Language Models, July 2024. URL https://github.com/WildEval/ZeroEval.

page_17_chunk_6

MAA. American invitational mathematics examination - aime. In American Invitational Mathematics Examination - AIME 2024, February 2024. URL https://maa.org/math-competitions/american-invitational-mathematics-examination-aime.

page_17_chunk_7

OpenAI. Hello GPT-4o, 2024a. URL https://openai.com/index/hello-gpt-4o/.

page_17_chunk_8

OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/.

page_17_chunk_9

OpenAI. Introducing SimpleQA, 2024c. URL https://openai.com/index/introducing-simpleqa/.

page_17_chunk_10

OpenAI. Introducing SWE-bench verified we're releasing a human-validated subset of swe-bench that more, 2024d. URL https://openai.com/index/introducing-swe-bench-verified/.

page_17_chunk_11

Qwen. Qwq: Reflect deeply on the boundaries of the unknown, 2024a. URL https://qwenlm.github.io/blog/qwq-32b-preview/.

page_17_chunk_12

Qwen. Qwen2.5: A party of foundation models, 2024b. URL https://qwenlm.github.io/blog/qwq-2.5.

page_17_chunk_13

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.

page_17_chunk_14

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

page_17_chunk_15

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017a. URL http://arxiv.org/abs/1712.01815.

page_18_chunk_0

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Liu, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nat., 550(7676):354–359, 2017b. doi: 10.1038/NATURE24270. URL https://doi.org/10.1038/nature24270

page_18_chunk_1

C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314.

page_18_chunk_2

T. Trinh, Y. Wu, Q. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 2024. doi: 10.1038/s41586-023-06747-5.

page_18_chunk_3

J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.

page_18_chunk_4

P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023.

page_18_chunk_5

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.

page_18_chunk_6

Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. CoRR, abs/2406.01574, 2024. URL https://doi.org/10.48550/arXiv.2406.01574.

page_18_chunk_7

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint, 2024.

page_18_chunk_8

H. Xin, Z. Z. Ren, J. Song, Z. Shao, W. Zhao, H. Wang, B. Liu, L. Zhang, X. Lu, Q. Du, W. Gao, Q. Zhu, D. Yang, Z. Gou, Z. F. Wu, F. Luo, and C. Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. URL https://arxiv.org/abs/2408.08152.

page_18_chunk_9

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.

page_19_chunk_0

# Appendix
## A. Contributions and Acknowledgments

This section lists individuals who contributed to the document, categorized into core contributors and general contributors.

### Core Contributors

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Daya Guo</td>
      <td>Hui Li</td>
    </tr>
    <tr>
      <td>Dejian Yang</td>
      <td>Jianzhong Guo</td>
    </tr>
    <tr>
      <td>Haowei Zhang</td>
      <td>Jiashi Li</td>
    </tr>
    <tr>
      <td>Junxiao Song</td>
      <td>Jingchang Chen</td>
    </tr>
    <tr>
      <td>Ruoyu Zhang</td>
      <td>Jingyang Yuan</td>
    </tr>
    <tr>
      <td>Runxin Xu</td>
      <td>Jinhao Tu</td>
    </tr>
    <tr>
      <td>Qihao Zhu</td>
      <td>Junjie Qiu</td>
    </tr>
    <tr>
      <td>Shirong Ma</td>
      <td>Junlong Li</td>
    </tr>
    <tr>
      <td>Peiyi Wang</td>
      <td>J.L. Cai</td>
    </tr>
    <tr>
      <td>Xiao Bi</td>
      <td>Jiaqi Ni</td>
    </tr>
    <tr>
      <td>Xiaokang Zhang</td>
      <td>Jian Liang</td>
    </tr>
    <tr>
      <td>Xingkai Yu</td>
      <td>Jin Chen</td>
    </tr>
    <tr>
      <td>Yu Wu</td>
      <td>Kai Dong</td>
    </tr>
    <tr>
      <td>Z.F. Wu</td>
      <td>Kai Hu*</td>
    </tr>
    <tr>
      <td>Zhibin Gou</td>
      <td>Kaichao You</td>
    </tr>
    <tr>
      <td>Zhihong Shao</td>
      <td>Kaige Gao</td>
    </tr>
    <tr>
      <td>Zhuoshu Li</td>
      <td>Kang Guan</td>
    </tr>
    <tr>
      <td>Ziyi Gao</td>
      <td>Kexin Huang</td>
    </tr>
    <tr>
      <td></td>
      <td>Kuai Yu</td>
    </tr>
  </tbody>
</table>

This table lists the "Core Contributors" to the document. It details 18 individuals who are primarily responsible for the work, alongside 19 additional individuals who are also considered core to the project. The asterisk (*) next to some names, like Kai Hu*, may indicate a special role or affiliation.

### Contributors

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Aixin Liu</td>
      <td>Lean Wang</td>
    </tr>
    <tr>
      <td>Bing Xue</td>
      <td>Lecong Zhang</td>
    </tr>
    <tr>
      <td>Bingxuan Wang</td>
      <td>Liang Zhao</td>
    </tr>
    <tr>
      <td>Bochao Wu</td>
      <td>Litong Wang</td>
    </tr>
    <tr>
      <td>Bei Feng</td>
      <td>Liyu Zhang</td>
    </tr>
    <tr>
      <td>Chengda Lu</td>
      <td>Lei Xu</td>
    </tr>
    <tr>
      <td>Chenggang Zhao</td>
      <td>Leyl Xia</td>
    </tr>
    <tr>
      <td>Chengqi Deng</td>
      <td>Mingchuan Zhang</td>
    </tr>
    <tr>
      <td>Chong Ruan</td>
      <td>Minghua Zhang</td>
    </tr>
    <tr>
      <td>Damai Dai</td>
      <td>Minghui Tang</td>
    </tr>
    <tr>
      <td>Deli Chen</td>
      <td>Mingxu Zhou</td>
    </tr>
    <tr>
      <td>Dongjie Ji</td>
      <td>Meng Li</td>
    </tr>
    <tr>
      <td>Erhang Li</td>
      <td>Miaojun Wang</td>
    </tr>
    <tr>
      <td>Fangyun Lin</td>
      <td>Mingming Li</td>
    </tr>
    <tr>
      <td>Fucong Dai</td>
      <td>Ning Tian</td>
    </tr>
    <tr>
      <td>Fuli Luo*</td>
      <td>Panpan Huang</td>
    </tr>
    <tr>
      <td>Guangbo Hao</td>
      <td>Peng Zhang</td>
    </tr>
    <tr>
      <td>Guanting Chen</td>
      <td>Qiancheng Wang</td>
    </tr>
    <tr>
      <td>Guowei Li</td>
      <td>Qinyu Chen</td>
    </tr>
    <tr>
      <td>H. Zhang</td>
      <td>Qiushi Du</td>
    </tr>
    <tr>
      <td>Hanwei Xu</td>
      <td>Ruiqi Ge*</td>
    </tr>
    <tr>
      <td>Honghui Ding</td>
      <td>Ruisong Zhang</td>
    </tr>
    <tr>
      <td>Huazuo Gao</td>
      <td>Ruizhe Pan</td>
    </tr>
    <tr>
      <td>Hui Qu</td>
      <td>Runji Wang</td>
    </tr>
    <tr>
      <td></td>
      <td>R.J. Chen</td>
    </tr>
    <tr>
      <td></td>
      <td>R.L. Jin</td>
    </tr>
  </tbody>
</table>

This table provides a comprehensive list of "Contributors" to the document, featuring 25 individuals in the left column and 25 in the right, indicating a broader group involved in the project. Similar to the core contributors, certain names like Fuli Luo* and Ruiqi Ge* are marked with an asterisk, potentially denoting specific roles or affiliations within this larger group.

page_20_chunk_0

<table>
  <thead>
    <tr>
      <th>Names (Column 1)</th>
      <th>Names (Column 2)</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>Ruyi Chen</td><td>Y.X. Wei</td></tr>
    <tr><td>Shanghao Lu</td><td>Yang Zhang</td></tr>
    <tr><td>Shangyan Zhou</td><td>Yanhong Xu</td></tr>
    <tr><td>Shanhuang Chen</td><td>Yao Li</td></tr>
    <tr><td>Shengfeng Ye</td><td>Yao Zhao</td></tr>
    <tr><td>Shiyu Wang</td><td>Yaofeng Sun</td></tr>
    <tr><td>Shuiping Yu</td><td>Yaohui Wang</td></tr>
    <tr><td>Shunfeng Zhou</td><td>Yi Yu</td></tr>
    <tr><td>Shuting Pan</td><td>Yichao Zhang</td></tr>
    <tr><td>S.S. Li</td><td>Yifan Shi</td></tr>
    <tr><td>Shuang Zhou</td><td>Yiliang Xiong</td></tr>
    <tr><td>Shaoqing Wu</td><td>Ying He</td></tr>
    <tr><td>Shengfeng Ye</td><td>Yishi Piao</td></tr>
    <tr><td>Tao Yun</td><td>Yisong Wang</td></tr>
    <tr><td>Tian Pei</td><td>Yixuan Tan</td></tr>
    <tr><td>Tianyu Sun</td><td>Yiyang Ma*</td></tr>
    <tr><td>T. Wang</td><td>Yiyuan Liu</td></tr>
    <tr><td>Wangding Zeng</td><td>Yongqiang Guo</td></tr>
    <tr><td>Wen Liu</td><td>Yuan Ou</td></tr>
    <tr><td>Wenfeng Liang</td><td>Yuduan Wang</td></tr>
    <tr><td>Wenjun Gao</td><td>Yue Gong</td></tr>
    <tr><td>Wenqin Yu*</td><td>Yuheng Zou</td></tr>
    <tr><td>Wentao Zhang</td><td>Yujia He</td></tr>
    <tr><td>W.L. Xiao</td><td>Yunfan Xiong</td></tr>
    <tr><td>Wei An</td><td>Yuxiang Luo</td></tr>
    <tr><td>Xiaodong Liu</td><td>Yuxiang You</td></tr>
    <tr><td>Xiaohan Wang</td><td>Yuxuan Liu</td></tr>
    <tr><td>Xiaokang Chen</td><td>Yuyang Zhou</td></tr>
    <tr><td>Xiaotao Nie</td><td>Y.X. Zhu</td></tr>
    <tr><td>Xin Cheng</td><td>Yanping Huang</td></tr>
    <tr><td>Xin Liu</td><td>Yaohui Li</td></tr>
    <tr><td>Xin Xie</td><td>Yi Zheng</td></tr>
    <tr><td>Xingchao Liu</td><td>Yuchen Zhu</td></tr>
    <tr><td>Xinyu Yang</td><td>Yunxian Ma</td></tr>
    <tr><td>Xinyuan Li</td><td>Ying Tang</td></tr>
    <tr><td>Xuecheng Su</td><td>Yukun Zha</td></tr>
    <tr><td>Xuheng Lin</td><td>Yuting Yan</td></tr>
    <tr><td>X.Q. Li</td><td>Z.Z. Ren</td></tr>
    <tr><td>Xiangyue Jin</td><td>Zehui Ren</td></tr>
    <tr><td>Xiaojin Shen</td><td>Zhangli Sha</td></tr>
    <tr><td>Xiaosha Chen</td><td>Zhe Fu</td></tr>
    <tr><td>Xiaowen Sun</td><td>Zhean Xu</td></tr>
    <tr><td>Xiaoxiang Wang</td><td>Zhenda Xie</td></tr>
    <tr><td>Xinnan Song</td><td>Zhengyan Zhang</td></tr>
    <tr><td>Xinyi Zhou</td><td>Zhewen Hao</td></tr>
    <tr><td>Xianzu Wang</td><td>Zhicheng Ma</td></tr>
    <tr><td>Xinxia Shan</td><td>Zhigang Yan</td></tr>
    <tr><td>Y.K. Li</td><td>Zhiyu Wu</td></tr>
    <tr><td>Y.Q. Wang</td><td>Zihui Gu</td></tr>
  </tbody>
</table>
This table presents a comprehensive list of 98 names, organized into two columns with 49 names each. These names likely represent individuals such as authors or contributors to a document. Notably, two names, 'Wenqin Yu*' and 'Yiyang Ma*', are marked with an asterisk, which may indicate a specific affiliation or special designation.

21

page_21_chunk_0

<table>
  <thead>
    <tr>
      <th>Author List 1</th>
      <th>Author List 2</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Zijia Zhu</td>
      <td>Zhen Huang</td>
    </tr>
    <tr>
      <td>Zijun Liu*</td>
      <td>Zhipeng Xu</td>
    </tr>
    <tr>
      <td>Zilin Li</td>
      <td>Zhongyu Zhang</td>
    </tr>
    <tr>
      <td>Ziwei Xie</td>
      <td>Zhen Zhang</td>
    </tr>
    <tr>
      <td>Ziyang Song</td>
      <td></td>
    </tr>
    <tr>
      <td>Zizheng Pan</td>
      <td></td>
    </tr>
  </tbody>
</table>

This table presents a list of authors, organized into two distinct columns. A total of ten authors are listed across both columns. Notably, Zijun Liu is marked with an asterisk, indicating their departure from the team.

Within each role, authors are listed alphabetically by the first name. Names marked with * denote individuals who have departed from our team.

Page 22

Now, let’s set up the vector database ChromaDB:

In [15]:
# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="documents")

print(f"ChromeDB contains {collection.count()} documents.")

ChromeDB contains 89 documents.


For the embedding model we use the Gemini API along with the task type for Retrieval — this will improve the quality of the embeddings. But, you can use any embedding model here like a BERT-based model.

For each chunk create the embeddings with an embedding model and save them along with the documents in the vector database:

In [16]:
chunk_texts = [ch["text"] for ch in all_chunks_again]

embedding_response = client.models.embed_content(
    model = "text-embedding-004",
    contents = chunk_texts,
    config = types.EmbedContentConfig(task_type = "RETRIEVAL_DOCUMENT")
)

# embedding_response is a dydantic model of type EmbedContentResponseDict:
actual_embeddings = [e.values for e in embedding_response.embeddings]

# Update documents along with the enbeddings to ChromaDB
collection.upsert(
    ids = [str(i) for i in range(len(chunk_texts))],
    embeddings = actual_embeddings,
    metadatas = [{"text": doc} for doc in chunk_texts]
)

In [17]:
print(f"ChromeDB contains {collection.count()} documents.")

ChromeDB contains 90 documents.


Now, we can embed a user query, perform a similarity search with all embedded documents to find the best matching top k documents.

In [None]:
def retrieve_relevant_documents(query, top_k=10):
    # generate embedding for the query
    query_embedding = client.models.embed_content(
        model = "text-embedding-004",
        contents = [query],
        config = types.EmbedContentConfig(task_type="RETRIEVAL_QUERY")
    )

    query_embedding_values = query_embedding.embeddings[0].values

    # Perform similarity search in ChromaDB
    results = collection.query(
        query_embeddings = [query_embedding_values],
        n_results = top_k
    )

    # Extract results
    retrieved_documents = [doc["text"] for doc in results["metadatas"][0]]

    return retrieved_documents    

Final RAG-part: Generate the answer

Finally, we can feed the top k retrieved documents as context along with the query to the LLM to generate the final answer:

In [19]:
query = "Why is DeepSeek considered a major breakthrough? How much better did DeepSeek-R1-Zero perform on the benchmarks compared to other models?"

relevant_docs = retrieve_relevant_documents(query)

print("\nTop relevant documents:")
for n, doc in enumerate(relevant_docs):
    print("\n\nDocument #", n)
    print("*", doc)

context = "\n\n".join(relevant_docs)

answer_response = f"""Use the following context to answer the question:
Context: 
{context}
Question: 
{query}
Answer:
"""

response = client.models.generate_content(
    model = "gemini-2.5-flash-preview-05-20",
    contents = answer_response,
)

print("\n=== Gemini 2.5 Flash Answer using relevant parts of the pdf document ===")
print(response.text)


Top relevant documents:


Document # 0
* DeepSeek-R1-Zero aims to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.


Document # 1
* DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model's ability to follow format instructions. These improvements can be linked to 

Displaying again with Markdown enabled.

In [25]:
print("\nTop relevant documents:")
for n, doc in enumerate(relevant_docs):
    print("\n\nDocument #", n)
    display(Markdown("* " + doc))

print("\n=== Gemini 2.5 Flash Answer using relevant parts of the pdf document ===")
display(Markdown(response.text))


Top relevant documents:


Document # 0


* DeepSeek-R1-Zero aims to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek-R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from 71.0% to 86.7%, thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.



Document # 1


* DeepSeek-R1 also delivers impressive results on IF-Eval, a benchmark designed to assess a model's ability to follow format instructions. These improvements can be linked to the inclusion of instruction-following data during the final stages of supervised fine-tuning (SFT) and RL training. Furthermore, remarkable performance is observed on AlpacaEval2.0 and ArenaHard, indicating DeepSeek-R1's strengths in writing tasks and open-domain question answering. Its significant outperformance of DeepSeek-V3 underscores the generalization benefits of large-scale RL, which not only boosts reasoning capabilities but also improves performance across diverse domains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an average of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that



Document # 2


* ### 2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to an impressive 71.0%, reaching performance levels comparable to OpenAI’s 01-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model’s performance over time.

**Graph 2: Performance Trajectory of DeepSeek-R1-Zero on AIME 2024 Benchmark Summary:**
This figure (not shown) depicts DeepSeek-R1-Zero's consistent performance enhancement during its RL training on the AIME 2024 benchmark. The model achieved a significant increase in its average pass@1 score, rising from an initial 15.6% to 71.0%. This improvement positions its performance on par with OpenAI's 01-0912 model.

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI’s 01-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

**Table 2 Summary:**
Table 2 is described as presenting a comparative analysis between DeepSeek-R1-Zero and OpenAI's 01-0912 models across various reasoning benchmarks. The text indicates that the findings demonstrate reinforcement learning's effectiveness, but the description of the findings is incomplete as the page ends. Since no table data is provided, an HTML table cannot be generated.



Document # 3


* For education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 demonstrates superior performance compared to DeepSeek-V3. This improvement is primarily attributed to enhanced accuracy in STEM-related questions, where significant gains are achieved through large-scale reinforcement learning. Additionally, DeepSeek-R1 excels on FRAMES, a long-context-dependent QA task, showcasing its strong document analysis capabilities. This highlights the potential of reasoning models in AI-driven search and data analysis tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-ol surpasses GPT-4o on this benchmark. However, DeepSeek-R1 performs worse than DeepSeek-V3 on the Chinese SimpleQA benchmark, primarily due to its tendency to refuse answering certain queries after safety RL. Without safety RL, DeepSeek-R1 could achieve an accuracy of over 70%.



Document # 4


* Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

Graph 2: AIME Accuracy During Training illustrates the accuracy progression of DeepSeek-R1-Zero on the AIME benchmark during its training, alongside baseline performances of OpenAI-o1-0912. The `r1-zero-cons@16` (red line) consistently improved and surpassed the `o1-0912-cons@64` baseline (purple dashed line), reaching approximately 0.85 accuracy by 8000 steps. Similarly, the `r1-zero-pass@1` (blue line) showed a steady upward trend, eventually exceeding the `o1-0912-pass@1` baseline (green dashed line) and reaching about 0.65 accuracy, highlighting DeepSeek-R1-Zero's effective learning without supervised fine-tuning.



Document # 5


* Improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.



Document # 6


* ## 1.2. Summary of Evaluation Results
*   **Reasoning tasks:** (1) DeepSeek-R1 achieves a score of 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217. On MATH-500, it attains an impressive score of 97.3%, performing on par with OpenAI-o1-1217 and significantly outperforming other models. (2) On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks as it achieves 2,029 Elo rating on Codeforces outperforming 96.3% human participants in the competition. For engineering-related tasks, DeepSeek-R1 performs slightly better than DeepSeek-V3, which could help developers in real world tasks.
*   **Knowledge:** On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3, demonstrating its capability in handling fact-based queries. A similar trend is observed where OpenAI-o1 surpasses 40 on this benchmark.



Document # 7


* DeepSeek-R1 avoids introducing length bias during GPT-based evaluations, further solidifying its robustness across multiple tasks.

On math tasks, DeepSeek-R1 demonstrates performance on par with OpenAI-01-1217, surpassing other models by a large margin. A similar trend is observed on coding algorithm tasks, such as LiveCodeBench and Codeforces, where reasoning-focused models dominate these benchmarks. On engineering-oriented coding tasks, OpenAI-01-1217 outperforms DeepSeek-R1 on Aider but achieves comparable performance on SWE Verified. We believe the engineering performance of DeepSeek-R1 will improve in the next version, as the amount of related RL training data currently remains very limited.



Document # 8


* Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks.

<table border="1">
<thead>
<tr>
<th>Model</th>
<th colspan="2">AIME 2024</th>
<th>MATH-500</th>
<th>GPQA Diamond</th>
<th>LiveCode Bench</th>
<th>CodeForces</th>
</tr>
<tr>
<th></th>
<th>pass@1</th>
<th>cons@64</th>
<th>pass@1</th>
<th>pass@1</th>
<th>pass@1</th>
<th>rating</th>
</tr>
</thead>
<tbody>
<tr>
<td>OpenAI-o1-mini</td>
<td>63.6</td>
<td>80.0</td>
<td>90.0</td>
<td>60.0</td>
<td>53.8</td>
<td>1820</td>
</tr>
<tr>
<td>OpenAI-o1-0912</td>
<td>74.4</td>
<td>83.3</td>
<td>94.8</td>
<td>77.3</td>
<td>63.4</td>
<td>1843</td>
</tr>
<tr>
<td>DeepSeek-R1-Zero</td>
<td>71.0</td>
<td>86.7</td>
<td>95.9</td>
<td>73.3</td>
<td>50.0</td>
<td>1444</td>
</tr>
</tbody>
</table>

Table 2 provides a comparison of DeepSeek-R1-Zero's performance against two OpenAI o1 models on various reasoning-related benchmarks. DeepSeek-R1-Zero achieved the highest pass@1 score on MATH-500 (95.9%) and the highest cons@64 score on AIME 2024 (86.7%), demonstrating superior performance in these specific areas compared to both OpenAI models. While excelling in some metrics, DeepSeek-R1-Zero had a lower rating on CodeForces (1444) and a lower pass@1 on LiveCode Bench (50.0%) than the OpenAI models.



Document # 9


* However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.


=== Gemini 2.5 Flash Answer using relevant parts of the pdf document ===


DeepSeek is considered a major breakthrough primarily because **DeepSeek-R1-Zero demonstrates robust reasoning capabilities achieved solely through Reinforcement Learning (RL), without the need for any supervised fine-tuning (SFT) data**. This highlights its ability to learn and generalize effectively through RL alone, showcasing an intrinsic development within the model to solve complex reasoning tasks.

Furthermore, DeepSeek-R1 (the more advanced version building upon R1-Zero's foundations) delivers impressive generalized performance:
*   It achieves performance on par with or even slightly surpasses top models like OpenAI-o1-1217 on various reasoning tasks (e.g., AIME 2024 and MATH-500).
*   It shows remarkable performance on instruction-following (IF-Eval), writing tasks (AlpacaEval2.0, ArenaHard), open-domain question answering, and long-context QA (FRAMES).
*   It significantly outperforms DeepSeek-V3 across diverse domains and knowledge benchmarks (MMLU, MMLU-Pro, GPQA Diamond, SimpleQA), showcasing the generalization benefits of large-scale RL.
*   It demonstrates expert-level performance in coding competition tasks, achieving a 2,029 Elo rating on Codeforces, outperforming 96.3% of human participants.

Regarding **DeepSeek-R1-Zero's performance compared to other models on the benchmarks (specifically OpenAI-o1-0912 and OpenAI-o1-mini),** based on Table 2 and accompanying text:

*   **AIME 2024:**
    *   **pass@1:** DeepSeek-R1-Zero achieved **71.0%**, which was lower than OpenAI-o1-0912 (74.4%) but higher than OpenAI-o1-mini (63.6%).
    *   **cons@64:** DeepSeek-R1-Zero achieved the highest score at **86.7%**, surpassing both OpenAI-o1-0912 (83.3%) and OpenAI-o1-mini (80.0%). The text also notes that this 86.7% was achieved through majority voting and exceeded OpenAI-o1-0912's performance.
*   **MATH-500 (pass@1):** DeepSeek-R1-Zero achieved the highest score at **95.9%**, outperforming OpenAI-o1-0912 (94.8%) and OpenAI-o1-mini (90.0%).
*   **GPQA Diamond (pass@1):** DeepSeek-R1-Zero scored **73.3%**, which was higher than OpenAI-o1-mini (60.0%) but lower than OpenAI-o1-0912 (77.3%).
*   **LiveCode Bench (pass@1):** DeepSeek-R1-Zero scored **50.0%**, which was lower than both OpenAI-o1-0912 (63.4%) and OpenAI-o1-mini (53.8%).
*   **CodeForces (rating):** DeepSeek-R1-Zero had a rating of **1444**, which was lower than both OpenAI-o1-0912 (1843) and OpenAI-o1-mini (1820).

In summary, DeepSeek-R1-Zero showed superior performance in some key areas like AIME 2024 (cons@64) and MATH-500, while demonstrating competitive but sometimes lower performance in other benchmarks like LiveCode Bench and CodeForces, compared to its OpenAI o1 counterparts. Its initial AIME 2024 pass@1 score improved significantly from 15.6% to 71.0% during its RL training process, making it comparable to OpenAI’s 01-0912.

In [None]:
# import sys
# print(sys.executable)

c:\Program Files\Python311\python.exe
