# RAG Agent

Retrieval Augmented Generation (RAG) is one of the most useful applications of AI agents. 

It consists in grounding the answers of our agent to a given knowledge base, that our agent can access by using tools.

This knowledge base - i.e., our collection of documents - is embedded in a vector store for efficient search. 

This method prevents allucinations and highly increases the accuracy of the agentic system.

Steps in this implementation:

1. get a knowledge base: we do that manually, and for this example we use only one document for simplicity. 

2. perform OCR of our documents to best extract information;

3. embed our documents in a vector database (vector store)

4. construct a tool for searching in the database

5. create the graph with grading and retries. 

## 1. Knowledge Base

We selected scientific papers on the topic on the topic **"LLM-Agents for Urban Mobility & Traffic Engineering"**.

We stored them in the folder `documents/`.

## 2. OCR (Optical Character Recognition)

As we said, we usually need to perform OCR on a knowledge based composed by documents. 

>Notice that we may skip this step if our knowledge base is already composed of plain text (for example, this could happen if our base is made thorugh web scarping, which usually returns plain text or markdown).

In this case our papers are complex, and contain mathematical expressions, tables, images. OCR is not a simple/skippable step here. 

That's why we are going to user the best OCR model around this days (January 2026): Mistral OCR 3. Find a usage example in the [mistral_ocr](./mistral_ocr.ipynb) notebook, and a full example from the actual Mistral site here: [link](https://colab.research.google.com/github/mistralai/cookbook/blob/main/mistral/ocr/data_extraction.ipynb). 

The latter showcases how to use their `Annotations` API to also annotate the detected bounding boxes. We will use this feature to include images in our RAG pipeline. 

> **Note:** You can replace Mistral's OCR API with a free process provided by LangChain, using `PyMuPDF4LLM` and the `Upstage Document Parse API`. Results will not be as good as using Mistral but probably will be good enough for many applications (and it's all free). Check the full tutorial here: [Multimodal RAG tutorial](https://langchain-opentutorial.gitbook.io/langchain-opentutorial/19-cookbook/06-multimodal/10-geminimultimodalrag#layout-parsing-to-extract-image-from-pdf-using-upstage-document-parse-api). 

### 2.1 Mistral OCR with Annotations

Mistral Document AI API adds two annotation functionalities:

- `document_annotation`: returns the annotation of the entire document based on the input schema.
- `box_annotation`: gives you the annotation of the bboxes extracted by the OCR model (charts/ figures etc) based on user requirement. The user may ask to describe/caption the figure for instance.

In [3]:
%pip install -q -U mistralai 

Note: you may need to restart the kernel to use updated packages.


Function to encode in base 64:

In [39]:
import base64

def encode_pdf(pdf_path):
    """Encode the pdf to base64."""
    try:
        with open(pdf_path, "rb") as pdf_file:
            return base64.b64encode(pdf_file.read()).decode('utf-8')
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return None
    except Exception as e:  # Added general exception handling
        print(f"Error: {e}")
        return None

In [7]:
base64_pdf = encode_pdf("RAG/documents/KimiK2.pdf")

First, we need to create our Annotation Formats, for that we advise make use of pydantic.

For this example, we will extract the image type and a description of each bbox; as well as the language, authors and a summary of the full document.

In [67]:
from pydantic import BaseModel, Field
from enum import Enum

class ImageType(str, Enum):
    GRAPH = "graph"
    TEXT = "text"
    TABLE = "table"
    IMAGE = "image"

class Image(BaseModel):
    image_type: ImageType = Field(..., description="The type of the image. Must be one of 'graph', 'text', 'table' or 'image'.")
    description: str = Field(..., description="A description of the image.")

class Document(BaseModel):
    summary: str = Field(..., description="A summary of the document.")
    authors: list[str] = Field(..., description="A list of authors who contributed to the document.")

Now with our pydantic models created for our Annotations, we can call our OCR endpoint.

The objective is to Annotate and Extract information from our document and the bbox/images detected.

In [68]:
from mistralai.extra import response_format_from_pydantic_model
from dotenv import load_dotenv
import os
import json

# Initialize Mistral client with API key
from mistralai import Mistral
load_dotenv()
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))

# OCR Call with Annotations
annotations_response = client.ocr.process(
    model="mistral-ocr-latest",
    pages=list(range(16, 24)), # Document Annotations has a limit of 8 pages, we recommend spliting your documents when using it; bbox annotations does not have the same limit
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{base64_pdf}"
    },
    bbox_annotation_format=response_format_from_pydantic_model(BBox),
    document_annotation_format=response_format_from_pydantic_model(Document),
    include_image_base64=True, # Let's also include the images in the response
    table_format="html"
  )

# Convert response to JSON format
response_dict = json.loads(annotations_response.model_dump_json())

print(json.dumps(response_dict, indent=4))

{
    "pages": [
        {
            "index": 16,
            "markdown": "##### Agentic Tool Use\n\nOn multi-turn tool-use benchmarks, Kimi-K2-Instruct sets a new standard. It achieves 66.1 Pass@1 on $\\tau^{2}$-Bench and 76.5 on ACEBench, substantially outperforming all baselines. These results affirm its strength in grounded, controlled, and agent-driven tool orchestration across domains.\n\n##### General Capabilities\n\nKimi-K2-Instruct exhibits strong, balanced performance across general knowledge, math, instruction following, and long-context tasks. It surpasses open-source peers on SimpleQA (31.0%), MMLU (89.5%) and MMLU-Redux (92.7%), and leads all models on instruction benchmarks (IFEval: 89.8%, Multi-Challenge: 54.1%). In math and STEM, it achieves top-tier scores (AIME 2024: 69.6%, GPQA-Diamond: 75.1%), and remains competitive on long-context factuality and retrieval (DROP: 93.5%, MRCR: 55.0%). These results position Kimi-K2-Instruct as a well-rounded and capable generalis

Let's split the pdf into 8 pages batches first as they advice to do:

In [11]:
%pip install -q -U pypdf

Note: you may need to restart the kernel to use updated packages.


In [43]:
from pypdf import PdfReader, PdfWriter

def split_pdf(input_path, chunk_size=8, output_dir="RAG/documents/chunks"):
    os.makedirs(output_dir, exist_ok=True)
    reader = PdfReader(input_path)
    for i in range(0, len(reader.pages), chunk_size):
        writer = PdfWriter()
        for page in reader.pages[i : i + chunk_size]:
            writer.add_page(page)
        
        chunk_filename = f"chunk_{i//chunk_size}.pdf"
        chunk_path = os.path.join(output_dir, chunk_filename)
        with open(chunk_path, "wb") as f:
            writer.write(f)
        print(f"Chunk {i//chunk_size} saved to {chunk_path}")

split_pdf("RAG/documents/KimiK2.pdf")

Chunk 0 saved to RAG/documents/chunks/chunk_0.pdf
Chunk 1 saved to RAG/documents/chunks/chunk_1.pdf
Chunk 2 saved to RAG/documents/chunks/chunk_2.pdf
Chunk 3 saved to RAG/documents/chunks/chunk_3.pdf


Let's actually parse and annotate tables as well:

In [69]:
chunk_dir = "RAG/documents/chunks"
responses = []

# Sort the list to ensure pages stay in order
for chunk_filename in sorted(os.listdir(chunk_dir)):
    # Construct the full path
    chunk_path = os.path.join(chunk_dir, chunk_filename)
    
    # Skip directories or non-pdf files if any exist
    if not chunk_filename.endswith(".pdf"):
        continue

    with open(chunk_path, "rb") as f:
        # Correctly encode the specific chunk
        base64_chunk = base64.b64encode(f.read()).decode('utf-8')
        print(f"Processing: {chunk_filename}")

    try:
        # OCR Call
        annotations_response = client.ocr.process(
            model="mistral-ocr-latest",
            # Remove the 'pages' limit because the file IS the limit now
            document={
                "type": "document_url",
                "document_url": f"data:application/pdf;base64,{base64_chunk}"
            },
            bbox_annotation_format=response_format_from_pydantic_model(Image),
            document_annotation_format=response_format_from_pydantic_model(Document),
            include_image_base64=True,
            table_format="html"  # take out tables as well
        )
        
        response_dict = annotations_response.model_dump()
        # Parse nested JSON strings in document_annotation
        if isinstance(response_dict.get("document_annotation"), str):
            response_dict["document_annotation"] = json.loads(response_dict["document_annotation"])
        # Parse nested JSON strings in image annotations
        for page in response_dict.get("pages", []):
            for img in page.get("images", []):
                if isinstance(img.get("image_annotation"), str):
                    img["image_annotation"] = json.loads(img["image_annotation"])

        responses.append(response_dict)
        print(f"Successfully processed {chunk_filename}")

    except Exception as e:
        print(f"Error processing {chunk_filename}: {e}")

# Save the responses
output_path = "RAG/OCR/responses.json"
os.makedirs(os.path.dirname(output_path), exist_ok=True) # Ensure directory exists
with open(output_path, "w") as f:
    json.dump(responses, f, indent=4)

Processing: chunk_0.pdf
Successfully processed chunk_0.pdf
Processing: chunk_1.pdf
Successfully processed chunk_1.pdf
Processing: chunk_2.pdf
Successfully processed chunk_2.pdf
Processing: chunk_3.pdf
Successfully processed chunk_3.pdf


Now since we split the document into several parts, our images' and tables' indexes will start over at each chunk, and that will give us repeated indices - we do not want that. So we re-index with the following function:

In [7]:
import json

def reindex_ocr_responses(responses_list: list[dict]) -> list[dict]:
    """
    Re-indexes images and tables across all OCR responses to have globally unique IDs.
    Updates both the ID fields in objects and all markdown references.
    """
    global_image_counter = 0
    global_table_counter = 0
    
    for response in responses_list:
        for page in response.get("pages", []):
            # Create mapping of old IDs to new IDs for this page
            image_id_map = {}
            table_id_map = {}
            
            # Re-index images
            for img in page.get("images", []):
                old_id = img["id"]
                # Get file extension
                ext = old_id.split('.')[-1] if '.' in old_id else 'jpeg'
                new_id = f"img-{global_image_counter}.{ext}"
                
                # Update the image ID in the object
                img["id"] = new_id
                image_id_map[old_id] = new_id
                global_image_counter += 1
            
            # Re-index tables
            for table in page.get("tables", []):
                old_id = table["id"]
                # Get file extension
                ext = old_id.split('.')[-1] if '.' in old_id else 'html'
                new_id = f"tbl-{global_table_counter}.{ext}"
                
                # Update the table ID in the object
                table["id"] = new_id
                table_id_map[old_id] = new_id
                global_table_counter += 1
            
            # Update markdown to reflect new IDs
            markdown = page.get("markdown", "")
            
            # Replace image references: ![img-0.jpeg](img-0.jpeg) format
            for old_id, new_id in image_id_map.items():
                markdown = markdown.replace(f"![{old_id}]({old_id})", f"![{new_id}]({new_id})")
            
            # Replace table references: [tbl-0.html](tbl-0.html) format
            for old_id, new_id in table_id_map.items():
                markdown = markdown.replace(f"[{old_id}]({old_id})", f"[{new_id}]({new_id})")
            
            page["markdown"] = markdown
    
    print(f"Re-indexed {global_image_counter} images and {global_table_counter} tables")
    return responses_list

In [8]:
# Load the original JSON
input_file = "RAG/OCR/responses.json"
output_file = "RAG/OCR/responses_reindexed.json"  # or use the same file to overwrite

with open(input_file, "r") as f:
    responses_list = json.load(f)

# Re-index
responses_list = reindex_ocr_responses(responses_list)

# Save the re-indexed version
with open(output_file, "w") as f:
    json.dump(responses_list, f, indent=4)

print(f"Saved re-indexed responses to {output_file}")

Re-indexed 18 images and 7 tables
Saved re-indexed responses to RAG/OCR/responses_reindexed.json


Let's check the results:

In [49]:
import json
from IPython.display import Markdown, display

def replace_images_in_markdown_annotated(
    markdown_str: str,
    images_dict: dict,
    tables_dict: dict = None,
    include_images: bool = True,
    include_tables: bool = True
) -> str:
    """
    Replaces images and tables in the markdown string with their content/descriptions.

    Args:
        markdown_str: The markdown string to replace images in.
        images_dict: A dictionary of images to replace, with their names as keys and data as values.
        tables_dict: A dictionary of tables to replace, with their names as keys and data as values.
        include_images: Whether to include images base64 data in the output.
        include_tables: Whether to include table HTML content in the output.

    Returns:
        The markdown string with images and tables replaced.
    """
    
    # Replace images: ![img-0.jpeg](img-0.jpeg) format
    for img_name, data in images_dict.items():
        placeholder = f"![{img_name}]({img_name})"
        
        # Get annotation description
        annotation = data.get('annotation', {})      
        description = annotation.get('description', '')
        
        if include_images:
            replacement = f"![{img_name}]({data['image']})\n\n**{description}**"
        else:
            replacement = f"**Figure: {img_name}**\n\n**{description}**"
        
        markdown_str = markdown_str.replace(placeholder, replacement)
    
    # Replace tables: [tbl-0.html](tbl-0.html) format (no exclamation mark!)
    if tables_dict:
        for tbl_name, data in tables_dict.items():
            placeholder = f"[{tbl_name}]({tbl_name})"
            
            if include_tables:
                # Insert the actual HTML table content
                replacement = f"\n\n{data['content']}\n\n"
            else:
                replacement = f"**Table: {tbl_name}**"
            
            markdown_str = markdown_str.replace(placeholder, replacement)
    
    return markdown_str

def process_saved_ocr_json(json_path: str, range: list[int] = None, include_images: bool = True, include_tables: bool = True) -> str:
    """
    Reads the saved JSON list of responses and merges them into one Markdown string.

    Args:
        json_path: Path to the JSON file containing OCR responses
        range: Optional range on the number of responses to process
        include_images: Whether to include images in the output
        include_tables: Whether to include tables in the output

    Returns:
        Combined Markdown string with all OCR responses
    """
    with open(json_path, "r") as f:
        responses_list = json.load(f)

    full_markdown_parts = []

    # Handle None range - process all responses
    responses_to_process = responses_list[range[0]:range[1]] if range is not None else responses_list

    for resp in responses_to_process:
        # 1. Add the Document-level Annotation/Summary for this chunk
        doc_anno = resp.get("document_annotation", "")
        if doc_anno:
            full_markdown_parts.append(f"## Document Summary\n**{doc_anno.get('summary', '')}**\n ## Authors \n **{doc_anno.get('authors', '')}**")

        # 2. Iterate through pages in this chunk
        for page in resp.get("pages", []):
            image_data = {}
            # Extract image data for replacement
            for img in page.get("images", []):
                image_data[img["id"]] = {
                    "image": img.get("image_base64", ""), 
                    "annotation": img.get("image_annotation", {})
                }
            
            table_data = {}
            for tbl in page.get("tables", []):
                table_data[tbl["id"]] = {
                    "content": tbl.get("content", ""),
                }
            
            # 3. Process the markdown for this specific page
            page_md = page.get("markdown", "")
            processed_page = replace_images_in_markdown_annotated(
                page_md, image_data, table_data, include_images, include_tables
            )
            full_markdown_parts.append(processed_page)

    return "\n\n---\n\n".join(full_markdown_parts)

In [52]:
# Display (just first page)
final_markdown = process_saved_ocr_json("RAG/OCR/responses_reindexed.json", range=[3, 4], include_images=False, include_tables=True)
display(Markdown(final_markdown))

## Document Summary
**The document presents a technical report on Kimi K2, detailing its contributions, token template for tool calling, evaluation details, and other technical aspects. It includes a list of authors, an explanation of the token structure for tool calling, and performance metrics across various benchmarks. The report highlights Kimi K2's superior performance in coding tasks, tool use tasks, math and STEM tasks, general tasks, and long context and factuality tasks. It also discusses the minimal impact of QK-Clip on model quality and the engine switching pipeline for RL training.**
 ## Authors 
 **['Yifan Bai', 'Yiping Bao', 'Guanduo Chen', 'Jiahao Chen', 'Ningxin Chen', 'Ruijue Chen', 'Yanru Chen', 'Yuankun Chen', 'Yutian Chen', 'Zhuofu Chen*', 'Jialei Cui', 'Hao Ding', 'Mengnan Dong', 'Ang`ang Du', 'Chenzhuang Du', 'Dikang Du', 'Yulun Du', 'Yu Fan', 'Yichen Feng', 'Kelin Fu', 'Bofei Gao', 'Hongcheng Gao', 'Peizhong Gao', 'Tong Gao', 'Xinran Gu', 'Longyu Guan', 'Haiqing Guo*', 'Jianhang Guo', 'Hao Hu', 'Xiaoru Hao', 'Tianhong He', 'Weiran He', 'Wenyang He', 'Chao Hong', 'Yangyang Hu', 'Zhenxing Hu', 'Weixiao Huang', 'Zhiqi Huang', 'Zihao Huang', 'Tao Jiang', 'Zhejun Jiang', 'Xinyi Jin', 'Yongsheng Kang*', 'Guokun Lai', 'Cheng Li', 'Fang Li', 'Haoyang Li', 'Ming Li', 'Wentao Li', 'Yanhao Li', 'Yiwei Li', 'Zhaowei Li', 'Zheming Li', 'Xiaohan Lin', 'Zongyu Lin', 'Chengyin Liu', 'Chenyu Liu', 'Dikang Liu', 'Jingyuan Liu*', 'Junqi Liu', 'Liang Liu', 'Shaowei Liu', 'T.Y. Liu', 'Tianwei Liu', 'Weizhou Liu', 'Yangyang Liu', 'Yibo Liu', 'Yiping Liu', 'Yue Liu', 'Zhengying Liu', 'Enzhe Lu', 'Lijun Lu', 'Shengling Ma', 'Xinyu Ma', 'Yingwei Ma', 'Shaoguang Mao', 'Jie Mei', 'Xin Men', 'Yibo Miao', 'Jinjing Xu', 'L.H. Xu', 'Lin Xu', 'Ruoyu Qin', 'Bowen Qu', 'Zeyu Shang', 'Lidong Shi', 'Shengyuan Shi', 'Feifan Song', 'Flood Sung', 'Heyu Tang', 'Jiawen Tao', 'Qifeng Teng', 'Chensi Wang', 'Dinglu Wang', 'Feng Wang', 'Haiming Wang', 'Jianzhou Wang*', 'Jiaxing Wang', 'Jinhong Wang', 'Shengjie Wang', 'Shuyi Wang', 'Yao Wang', 'Yejie Wang', 'Yiqin Wang', 'Yuxin Wang', 'Yuzhi Wang', 'Zhaoji Wang', 'Zhengtao Wang', 'Zhexu Wang', 'Chu Wei', 'Qianqian Wei', 'Wenhao Wu', 'Xingzhe Wu', 'Yuxin Wu', 'Yutong Zhang', 'Haotian Zhao', 'Yikai Zhao', 'Yuang Zhang', 'Yizhi Zhang', 'Yongting Zhang', 'Yu Zhang', 'Yutao Zhang', 'Zhen Zhang', 'Huabin Zheng', 'Shaojie Zheng', 'Jianren Zhou', 'Xinyu Zhou', 'Zaida Zhou', 'Zhen Zhu', 'Weiyu Zhuang', 'Xinxing Zu', 'Kimi K2']**

---

Kimi K2

TECHNICAL REPORT

# Appendix

# A Contributions

The listing of authors is in alphabetical order based on their last names. Names marked with an asterisk (*) indicate people who are no longer part of our team.



<table><tr><td>Yifan Bai</td><td>Guokun Lai</td><td>Shengyuan Shi</td><td>Ziyao Xu</td></tr><tr><td>Yiping Bao</td><td>Cheng Li</td><td>Feifan Song</td><td>Junjie Yan</td></tr><tr><td>Guanduo Chen</td><td>Fang Li</td><td>Jianlin Su</td><td>Yuzi Yan</td></tr><tr><td>Jiahao Chen</td><td>Haoyang Li</td><td>Zhengyuan Su</td><td>Xiaofei Yang</td></tr><tr><td>Ningxin Chen</td><td>Ming Li</td><td>Xinjie Sun*</td><td>Ying Yang</td></tr><tr><td>Ruijue Chen</td><td>Wentao Li</td><td>Flood Sung</td><td>Zhen Yang</td></tr><tr><td>Yanru Chen</td><td>Yanhao Li</td><td>Heyi Tang</td><td>Zhilin Yang</td></tr><tr><td>Yuankun Chen</td><td>Yiwei Li</td><td>Jiawen Tao</td><td>Zonghan Yang</td></tr><tr><td>Yutian Chen</td><td>Zhaowei Li</td><td>Qifeng Teng</td><td>Haotian Yao</td></tr><tr><td>Zhuofu Chen*</td><td>Zheming Li</td><td>Chensi Wang</td><td>Xingcheng Yao</td></tr><tr><td>Jialei Cui</td><td>Hongzhan Lin*</td><td>Dinglu Wang</td><td>Wenjie Ye</td></tr><tr><td>Hao Ding</td><td>Xiaohan Lin</td><td>Feng Wang</td><td>Zhuorui Ye</td></tr><tr><td>Mengnan Dong</td><td>Zongyu Lin</td><td>Haiming Wang</td><td>Bohong Yin</td></tr><tr><td>Ang'ang Du</td><td>Chengyin Liu</td><td>Jianzhou Wang*</td><td>Longhui Yu</td></tr><tr><td>Chenzhuang Du</td><td>Chenyu Liu</td><td>Jiaxing Wang</td><td>Enming Yuan</td></tr><tr><td>Dikang Du</td><td>Hongzhang Liu</td><td>Jinhong Wang</td><td>Hongbang Yuan*</td></tr><tr><td>Yulun Du</td><td>Jingyuan Liu*</td><td>Shengjie Wang</td><td>Mengjie Yuan</td></tr><tr><td>Yu Fan</td><td>Junqi Liu</td><td>Shuyi Wang</td><td>Haobing Zhan</td></tr><tr><td>Yichen Feng</td><td>Liang Liu</td><td>Yao Wang</td><td>Dehao Zhang</td></tr><tr><td>Kelin Fu</td><td>Shaowei Liu</td><td>Yejie Wang</td><td>Hao Zhang</td></tr><tr><td>Bofei Gao</td><td>T.Y. Liu</td><td>Yiqin Wang</td><td>Wanlu Zhang</td></tr><tr><td>Hongcheng Gao</td><td>Tianwei Liu</td><td>Yuxin Wang</td><td>Xiaobin Zhang</td></tr><tr><td>Peizhong Gao</td><td>Weizhou Liu</td><td>Yuzhi Wang</td><td>Yangkun Zhang</td></tr><tr><td>Tong Gao</td><td>Yangyang Liu</td><td>Zhaoji Wang</td><td>Yizhi Zhang</td></tr><tr><td>Xinran Gu</td><td>Yibo Liu</td><td>Zhengtao Wang</td><td>Yongting Zhang</td></tr><tr><td>Longyu Guan</td><td>Yiping Liu</td><td>Zhexu Wang</td><td>Yu Zhang</td></tr><tr><td>Haiqing Guo*</td><td>Yue Liu</td><td>Chu Wei</td><td>Yutao Zhang</td></tr><tr><td>Jianhang Guo</td><td>Zhengying Liu</td><td>Qianqian Wei</td><td>Yutong Zhang</td></tr><tr><td>Hao Hu</td><td>Enzhe Lu</td><td>Wenhao Wu</td><td>Zheng Zhang</td></tr><tr><td>Xiaoru Hao</td><td>Lijun Lu</td><td>Xingzhe Wu</td><td>Haotian Zhao</td></tr><tr><td>Tianhong He</td><td>Shengling Ma</td><td>Yuxin Wu</td><td>Yikai Zhao</td></tr><tr><td>Weiran He</td><td>Xinyu Ma</td><td>Chenjun Xiao</td><td>Huabin Zheng</td></tr><tr><td>Wenyang He</td><td>Yingwei Ma</td><td>Xiaotong Xie</td><td>Shaojie Zheng</td></tr><tr><td>Chao Hong</td><td>Shaoguang Mao</td><td>Weimin Xiong*</td><td>Jianren Zhou</td></tr><tr><td>Yangyang Hu</td><td>Jie Mei</td><td>Boyu Xu</td><td>Xinyu Zhou</td></tr><tr><td>Zhenxing Hu</td><td>Xin Men</td><td>Jing Xu*</td><td>Zaida Zhou</td></tr><tr><td>Weixiao Huang</td><td>Yibo Miao</td><td>Jinjing Xu</td><td>Zhen Zhu</td></tr><tr><td>Zhiqi Huang</td><td>Siyuan Pan</td><td>L.H. Xu</td><td>Weiyu Zhuang</td></tr><tr><td>Zihao Huang</td><td>Yebo Peng</td><td>Lin Xu</td><td>Xinxing Zu</td></tr><tr><td>Tao Jiang</td><td>Ruoyu Qin</td><td>Suting Xu</td><td>Kimi K2</td></tr><tr><td>Zhejun Jiang</td><td>Bowen Qu</td><td>Weixin Xu</td><td></td></tr><tr><td>Xinyi Jin</td><td>Zeyu Shang</td><td>Xinran Xu</td><td></td></tr><tr><td>Yongsheng Kang*</td><td>Lidong Shi</td><td>Yangchuan Xu</td><td></td></tr></table>



---

B Token Template of Tool Calling

There are three components in the token structure for tool-calling:

- Tool declaration message: defines the list of available tools and the schema of the arguments;
- Tool invoking section in assistant message: encodes the model’s request to invoke tools;
- Tool result message: encapsulates the invoked tool’s execution result.

The raw tokens of the tool declaration message are formatted as follows:

⬇
<|im_begin|>
tool_declare
<|im_middle|>
# Tools

{{ tool declaration content }}
<|im_end|>

The blue highlighted marks represent special tokens, and the green part, quoted by brackets, is the tool declaration content. We use TypeScript to express the tool declaration content, since TypeScript is a concise language with a comprehensive type system, able to express the types and constraints of tool parameters with brief text. The code 1 shows an example for two simple tools in JSON format compatible with OpenAI’s chat completion API, as a comparison, the same tools defined in TypeScript (listed in Code 2) is much shorter. To improve compatibility, part of our training data also uses JSON as the tool declaration language, so that 3rd-party frameworks need not additional development to support our tool calling scheme.

Listing 1: Tool definition with JSON in OpenAI compatible API
⬇
[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location and date",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Beijing, China"
},
"date": {
"type": "string",
"description": "Date to query, format in ‘%Y-%m-%d’"
}
},
"required": [
"location"
]
}
}
},
{
"type": "function",
"function": {
"name": "Calculator",
"description": "Simple calculator",
"parameters": {
"properties": {
"expr": {
"type": "string",
"description": "Arithmetic expression in javascript"
}
},

---

Kimi K2

TECHNICAL REPORT

```txt
"type": "object"
}
}
```

Listing 2: Tool definition in TypeScript
```txt
namespace functions {
// Get weather for a location and date
type get_weather = (_: {
// City and country e.g. Beijing, China
location: string,
// Date to query, format in '%Y-%m-%d'
date?: string
}) =&gt; any;
// Simple calculator
type Calculator = (_: {
// Arithmetic expression in javascript
expr?: string
}) =&gt; any;
}
```

The token template of the tool invoking section in the model's response messages is listed as follows:

```txt
<tool_call_section_begin|>
<|tool_call_begin|>
// call_id part
functions.{{tool name}}:{{counter}}
<|toolarguments_begin|>
{json serialized call arguments}
<|tool_call_end|>
<|tool_call_begin|>
// more tool calls
<|tool_call_end|>
<|tool_call_section_end|>
```

As shown in the template, we support parallel tool calling by placing multiple tool calls in a single response turn. Each tool call has a unique call id, formatted as functions.{tool-name}:{counter}, where tool-name is the name of the tool, and counter is an auto-increasing counter of all tool calls starting from 0 in the dialog.

During inference, the model may occasionally generate unexpected tokens, leading to format errors when parsing a tool call. To solve this issue, we developed a constrained decoding module named enforcer, inspired by lm-format-enforcer $^6$ . When a <tool_call_section_begin|> token is generated, it ensures that the upcoming tool-related tokens follow the predefined template, and the JSON argument string follows the declared schema.

The tool result message is simply a text message encoded with the tool's call id and the corresponding results.

```txt
&lt;|im_begin|&gt;
tool
&lt;|im_middle|&gt;
## Results of {{call_id}}
{execution result content}
&lt;|im_end|&gt;
```

# C Evaluation Details

Coding Tasks. We evaluate Kimi-K2-Instruct's capabilities on competitive coding benchmarks, LiveCodeBench and OJBench, where Kimi-K2-Instruct attains superior performance with scores of  $53.7\%$  and  $27.1\%$ , respectively. This excellence spans both medium-level coding challenges, such as LeetCode and AtCoder, and hard-level contests like NOI and ICPC, outperforming leading open-source and proprietary models. For multilingual programming proficiency, we employ MultiPL-E, covering languages including C++, C#, Java, JavaScript, PHP, Go, Kimi-K2-Instruct surpasses top</tool_call_section_begin|></tool_call_section_begin|></tool_call_section_begin|>

---

open-source models with an accuracy of 85.7%, compared with 83.1% for DeepSeek-V3-0324 and 78.2% for Qwen3-235B-A22B. In software engineering tasks, Kimi-K2-Instruct demonstrates robust performance on SWE-bench Verified (Python), SWE-lancer (Python), SWE-bench Multilingual, and Multi-SWE-bench datasets. It significantly outperforms open-source counterparts in resolving real-world code repository issues and notably narrows the performance gap with proprietary models. For example:

- SWE-bench Verified (multiple attempts): 71.6% (Kimi-K2-Instruct) vs. 80.2% (Claude 4 Sonnet)
- SWE-bench Multilingual: 47.3% (Kimi-K2-Instruct) vs. 51.0% (Claude 4 Sonnet)
- SWE-lancer: 39.1% (Kimi-K2-Instruct) vs. 40.8% (Claude 4 Sonnet)

On PaperBench, Kimi-K2-Instruct achieves an accuracy of 27.8%, closely matching GPT-4.1 and outperforming DeepSeek-V3-0324 (12.2%) and Qwen3-235B-A22B (8.2%) by a substantial margin. In terminal interaction tasks measured by TerminalBench, Kimi-K2-Instruct attains 25.0% using the default Terminus framework and rises to 30% within Moonshot’s in-house agentic framework, underscoring its capabilities in real-world agentic programming scenarios. Moreover, on the Aider-Polyglot benchmark, Kimi-K2-Instruct attains a 60.0% accuracy while employing rigorous decontamination procedures, further illustrating its strength and reliability across diverse coding environments.

##### Tool Use Tasks.

We evaluate multi-turn tool use with two complementary suites: $\tau^{2}$-Bench and ACEBench. $\tau^{2}$-Bench extends the original $\tau$-bench single-control setup to a dual-control environment in which both the agent and an LLM-simulated user have constrained tool affordances over a shared state, adding a realistic Telecom troubleshooting domain alongside the prior Airline/Retail TAU tasks and enabling analysis of coordination vs. pure reasoning. ACEBench is a large bilingual (En/Zh) API-grounded benchmark (4.5K APIs across 8 domains; 2K annotated eval items) partitioned into Normal (basic/personalized/atomic), Special (imperfect or out-of-scope inputs), and Agent (scenario-driven multi-turn, multi-step sandbox) tracks with automated grading of calls and outcomes. All models run in non-thinking mode; we set the temperature to 0.0, use deterministic tool adapters, score $\tau^{2}$ Airline/Retail/Telecom under Avg@4 seeds with Pass@1/4, and report overall on ACEBench English. Kimi-K2-Instruct averages 66.1 micro Pass@1 across $\tau^{2}$ vs DeepSeek-V3-0324 48.8 / Qwen3-235B-A22B 37.3. On ACEBench Overall Kimi-K2-Instruct scores 76.5 vs DeepSeek 72.7 / Qwen 70.5 and remains competitive with GPT-4.1 (80.1).

##### Math & STEM & Logical Tasks.

For Math tasks, Kimi-K2-Instruct achieves consistently strong performance, averaging over Geimini-2.5-Flash by 5.3 percentage points, over DeepSeek-V3-0324 by 5.5 points and over GPT4.1 by 15.8 points. For example, on AIME 2024, Kimi-K2-Instruct scores 69.6%, outperforming another two top open-source models by a large margin, DeepSeek-V3-0324 by 10.2 points and Qwen3-235B-A22B by 29.5 points. In STEM evaluations, Kimi-K2-Instruct achieves 75.1% on GPQA-Diamond, outperforming DeepSeek-V3-0324 (68.4%) and all non-thinking baselines by at least 5 percentage points. On SuperGPQA, it also exceeds the previous best open-source model, DeepSeek-V3-0324, by 3.5 points. Kimi-K2-Instruct also surpasses the other two leading models in logical reasoning. It achieves 89.0% on ZebraLogic and 89.5% on AutoLogi, exceeding DeepSeek-V3-0324 (84.0%, 88.9%) and substantially outperforming Qwen3-235B-A22B (37.7%, 83.3%).

##### General Tasks.

Kimi-K2-Instruct ties DeepSeek-V3-0324 on MMLU and MMLU-Pro, and takes the lead on MMLU-Redux with a 92.7 EM score—slightly ahead of GPT-4.1 (92.4) and just 1.5 points behind Claude-Opus-4. Beyond multiple-choice tasks, the model achieves 31.0% accuracy on the short-answer SimpleQA—3.3 points above DeepSeek-V3-0324 and more than twice that of Qwen3-235B-A22B—though still below GPT-4.1 (42.3%). On the adversarial free-response LiveBench (2024-11-25 snapshot), it reaches 76.4%, surpassing Claude-Sonnet 4 (74.8%) and leading Gemini 2.5 Flash Preview by 8.6 points. Across this challenging triad measuring breadth, depth, and robustness of world knowledge, Kimi-K2-Instruct secures a top-tier position among open-source models. We evaluate instruction-following with IFEval and Multi-Challenge. On IFEval, Kimi-K2-Instruct scores 89.8%, higher than DeepSeek-V3-0324 (81.1%) and GPT-4.1 (88.0%). On Multi-Challenge, which involves multi-turn dialogues with conflicting instructions, it achieves 54.1%, outperforming DeepSeek-V3-0324 (31.4%), GPT-4.1 (36.4%), and Claude-Opus-4 (49.0%). These results demonstrate that Kimi-K2-Instruct integrates strong factual knowledge with consistent instruction adherence across both single- and multi-turn settings, supporting robust and reliable real-world deployment.

##### Long Context and Factuality Tasks.

To evaluate the factuality of Kimi-K2-Instruct, we employ three benchmarks: FACTS Grounding, which measures adherence to provided documents using the proprietary models GPT-4o, Gemini 1.5 Pro and Claude 3.5 Sonnet; HHEM, which assesses summarization quality via the open-source HHEM-2.1-Open judge; and FaithJudge, which analyzes faithfulness in RAG tasks with o3-mini as the judge. Kimi-K2-Instruct scores 88.5 on FACTS Grounding, substantially outperforming all open-source rivals and even surpassing the closed-source Gemini 2.5 Flash. With HHEM-2.1-Open it achieves a hallucination rate of 1.1 %, reported in the tables as 1 minus the

---

Kimi K2

TECHNICAL REPORT

**Figure: img-13.jpeg**

**The image is a bar chart titled 'Kimi-K2-Instruct Open-Ended Evaluation (aggregated)'. It compares the performance of Kimi-K2-Instruct against three other models: DeepSeek-V3-0324, Claude-Sonnet-4, and ChatGPT-4o-latest. The chart is divided into three horizontal bars, each representing a different comparison. Each bar is segmented into three sections: Win (blue), Tie (gray), and Loss (red). The percentages for each segment are as follows: Kimi-K2-Instruct vs DeepSeek-V3-0324 has 59.6% Win, 23.5% Tie, and 16.9% Loss; Kimi-K2-Instruct vs Claude-Sonnet-4 has 64.6% Win, 18.8% Tie, and 16.6% Loss; Kimi-K2-Instruct vs ChatGPT-4o-latest has 65.4% Win, 17.6% Tie, and 17.0% Loss. The x-axis represents the percentage win rate, ranging from 0% to 100%.**
Figure 11: Chinese in-house benchmark evaluation.

rate, i.e. 98.9. On FaithJudge's RAG tasks the hallucination rate is  $7.4\%$ , likewise present as 92.6 for table consistency. For long-context capabilities, Kimi-K2-Instruct outperforms all open source and proprietary models on DROP  $(93.5\%)$ , and exceeds DeepSeek-V3-0324 on retrieval task MRCR  $(55.0\%$  vs  $50.8\%)$ . For long-context reasoning tasks FRAMES and LongBench v2, Kimi-K2-Instruct  $(77.1\%, 49.1\%)$  lags slightly behind DeepSeek-V3-0324 by around  $2\%$ .

Open-Ended Evaluation Beyond static, closed-ended benchmarks, we evaluate the model's performance on open-ended, nuanced tasks that more closely resemble real-world usage.

For English scenarios, we leverage the Arena-Hard-Auto v2.0 benchmark, which use LLM-as-a-judge protocols to assess generation quality across diverse, open-ended prompts [42]. These evaluations cover a wide range of high-difficulty prompts and are widely recognized in the research community. On Arena-Hard-Auto v2.0, Kimi-K2-Instruct achieves state-of-the-art win-rate on both hard prompts (54.5%) and creative writing tasks (85.0%), outperforming all open-source models and rivaling top proprietary systems such as GPT-4.1 and Claude Sonnet. These results underscore the model's strength in handling complex reasoning and nuanced generation under diverse, unconstrained settings.

However, Arena-Hard-Auto provides limited coverage of Chinese-specific tasks. To address this gap, we developed an in-house held-out benchmark grounded in authentic user queries. To safeguard the integrity of the evaluation, the benchmark data is access-restricted, thereby eliminating the risk of overfitting.

As shown in Figure 11, Kimi-K2-Instruct shows strong performance across all comparisons on Chinese in-house benchmarks. It outperforms ChatGPT-4o-latest with a  $65.4\%$  win rate, Claude Sonnet 4 with  $64.6\%$ , and DeepSeek-V3-0324 with  $59.6\%$ . In all cases, the loss rate stays low (around  $17\%$ ), indicating that Kimi-K2-Instruct rarely falls behind. The high win rates and consistent margins demonstrate its strong ability on open-ended Chinese tasks.

In addition to controlled evaluations, we also consider real-world user preference through public human assessments. As of July 17, 2025, Kimi-K2-Instruct ranked as the top open-source model and fifth overall on the LMSYS Arena leaderboard $^{7}$ , based on over 3,000 blind votes from real users. Unlike LLM-as-a-judge protocols, this leaderboard reflects direct human preference on diverse, user-submitted prompts, providing a complementary perspective on practical model performance.

The results on Arena-Hard-Auto, our in-house benchmark and votes from LMSYS Arena collectively offer a comprehensive view of Kimi-K2-Instruct's open-ended capabilities, showing that it is a highly preferred model in real-world user experience across English and Chinese.

# D QK-Clip Does Not Impair Model Quality

The QK-Clip design follows a minimal intervention principle: it activates only when necessary, and deactivates after training stabilizes. Empirical evidence and analysis converge on its negligible impact on model quality.

---

K Kimi K2

TECHNICAL REPORT

**Figure: img-14.jpeg**

**This graph shows the validation loss over training steps for two different scenarios: with QK-Clip (w/ QK-Clip) and without QK-Clip (w/o QK-Clip). The x-axis represents the number of training steps, ranging from 0 to 20,000, while the y-axis represents the validation loss, ranging from 1.6 to 2.8. The graph indicates that the validation loss decreases as the number of training steps increases. The scenario with QK-Clip consistently shows a lower validation loss compared to the scenario without QK-Clip, suggesting that QK-Clip helps in achieving better model performance.**
Figure 12: Applying QK-Clip to Muon in a small-scale setting with an aggressive threshold  $(\tau = 30)$  has negligible impact on loss, indicating that it is a safe and effective method for constraining attention logits.

Small-Scale Ablations We train two small-scale 0.5B activated and 3B total parameters MoE models, one with vanilla Muon and the other with MuonClip using a low clipping threshold  $(\tau = 30)$ . As shown in Figure 12, applying MuonClip has negligible effects on the loss curve, indicating that even aggressive clipping does not impair convergence or training dynamics with MuonClip. This demonstrates that MuonClip is a safe and effective method for bounding attention logits without degrading model performance. Furthermore, evaluation on downstream tasks reveals no statistically significant degradation in performance. These results collectively demonstrate that MuonClip is a safe and effective method for bounding attention logits without compromising model quality.

Self-deactivation In Kimi K2, QK-Clip was only transiently active:

- Initial 70000 steps:  $12.7\%$  of attention heads triggered QK-Clip for at least once, clamping  $S_{\mathrm{max}}$  to 100.
- Post-70000 steps: All heads at some point reduced their  $S_{\mathrm{max}}$  below 100, rendering QK-Clip inactive.

When QK-Clip is active, it is applied per-head (rather than per-layer) to minimize potential over-regularization on other heads. After training stabilizes, QK-clip is deactivated and has no effect at all.

# E Why Muon is More Prone to Logit Explosion

Logit explosion occurs when the largest pre-softmax attention score

$$
S _ {\max } = \max  _ {i, j} \left(q _ {i} \cdot k _ {j}\right) \tag {1}
$$

grows unboundedly during training. Since

$$
\left| q _ {i} \cdot k _ {j} \right| \leq \| q _ {i} \| \| k _ {j} \| \leq \| x _ {i} \| \| x _ {j} \| \| \mathbf {W} _ {q} \| \| \mathbf {W} _ {k} \|, \tag {2}
$$

and RMS-Norm keeps  $\| x_{i}\| \| x_{j}\|$  bounded, the phenomenon is primarily driven by the growing spectral-norm of  $\mathbf{W}_q$  or  $\mathbf{W}_k$ . Empirically, we found that Muon is more susceptible to logit explosion. We give our hypothesis below.

Structural difference in updates Muon produces a weight update coming from the msign operation; as a result, all singular values of the update matrix are equal — its effective rank is full. In contrast, a typical update matrix produced by Adam exhibits a skewed spectrum: a few large singular values dominate, and the effective rank is low. This low-rank assumption for Adam is not new; higher-order muP makes the same assumption.

Such phenomenon is verified on the 16 B Moonlight model, which shows weights trained with Muon exhibit higher singular-value entropy (i.e. higher effective rank) than those trained with Adam, corroborating the theoretical intuition.

SVD formulation Let the parameter matrix at step  $t - 1$  have the singular value decomposition

$$
\mathbf {W} _ {t - 1} = \sum_ {i} \sigma_ {i} u _ {i} v _ {i} ^ {\top} \tag {3}
$$

---

We write the update matrices as

$\Delta\mathbf{W}_{t}=\sum_{j}\bar{\sigma}\>\bar{u}_{j}\bar{v}_{j}^{\top}$ (4)

The next parameter update is therefore

$\mathbf{W}_{t}\leftarrow\sum_{i}\sigma_{i}u_{i}v_{i}^{\top}+\sum_{j}\bar{\sigma}\>\bar{u}_{j}\bar{v}_{j}^{\top}$ (5)

In Muon, as both the weights and the updates have a higher effective rank than Adam, we hypothesize there is a higher probability for singular-vector pair $u_{i}v_{i}^{\top}$ to align with $\bar{u}_{j}\bar{v}_{j}^{\top}$. This could cause the corresponding singular value of $\mathbf{W}_{t}$ to increase additively.

##### Attention-specific amplification

Attention logits are computed via the bilinear form

$q_{i}\cdot k_{j}=(x_{i}\mathbf{W}_{q})\cdot(x_{j}\mathbf{W}_{k}).$ (6)

The product $\mathbf{W}_{q}\mathbf{W}_{k}^{\top}$ squares the spectral norm, so any singular-value increase in either matrix is compounded. Muon’s tendency to enlarge singular values therefore translates into a higher risk of logit explosion.

## Appendix F K2 Critic Rubrics for General RL

### F.1 Core Rubrics

- Clarity and Relevance: Assesses the extent to which the response is succinct while fully addressing the user’s intent. The focus is on eliminating unnecessary detail, staying aligned with the central query, and using efficient formats such as brief paragraphs or compact lists. Unless specifically required, long itemizations should be avoided. When a choice is expected, the response should clearly offer a single, well-defined answer.
- Conversational Fluency and Engagement: Evaluates the response’s contribution to a natural, flowing dialogue that extends beyond simple question-answering. This includes maintaining coherence, showing appropriate engagement with the topic, offering relevant observations or insights, potentially guiding the conversation constructively when appropriate, using follow-up questions judiciously, handling hypothetical or personal-analogy queries gracefully, and adapting tone effectively to suit the conversational context (e.g., empathetic, formal, casual).
- Objective and Grounded Interaction: Assesses the response’s ability to maintain an objective and grounded tone, focusing squarely on the substance of the user’s request. It evaluates the avoidance of both metacommentary (analyzing the query’s structure, topic combination, perceived oddity, or the nature of the interaction itself) and unwarranted flattery or excessive praise directed at the user or their input. Excellent responses interact respectfully but neutrally, prioritizing direct, task-focused assistance over commentary on the conversational dynamics or attempts to curry favor through compliments.

### F.2 Prescriptive Rubrics

- Initial Praise: Responses must not begin with compliments directed at the user or the question (e.g., “That’s a beautiful question”, “Good question!”).
- Explicit Justification: Any sentence or clause that explains why the response is good or how it successfully fulfilled the user’s request. This is different from simply describing the content.

### F.3 Limitations

One potential side effect of this evaluation framework is that it may favor responses that appear confident and assertive, even in contexts involving ambiguity or subjectivity. This stems from two key constraints in the current rubric:

- Avoidance of Self-Qualification: The prescriptive rules prohibit self-assessments, explicit disclaimers, or hedging language (e.g., “this may not be accurate”, “I might be wrong”). While these phrases can reflect epistemic humility, they are often penalized as non-informative or performative.
- Preference for Clarity and Singularity: The rubric reward direct, decisive answers when users ask for a recommendation or explanation. In complex or open-ended scenarios, this may disincentivize appropriately cautious or multi-perspective responses.

---

K Kimi K2

TECHNICAL REPORT

As a result, the model may occasionally overstate certainty in areas where ambiguity, nuance, or epistemic modesty would be more appropriate. Future iterations of the framework may incorporate more fine-grained handling of calibrated uncertainty.

# G Engine Switching Pipeline for RL Training

**Figure: img-15.jpeg**

**This image is a Gantt chart that illustrates the scheduling of various operations across four devices. The chart uses different colors to represent different types of operations: H2D Buffer (blue), IPC Buffer (light orange), Reload weights (dark orange), Broadcast (src) (yellow), and Broadcast (dst) (light beige). Each row corresponds to a different device (Device 0 to Device 3), and the horizontal axis represents time. The chart shows the sequence and overlap of these operations, indicating how tasks are managed and synchronized across multiple devices.**
(a) Theoretical perfect three-stage pipeline weight update

**Figure: img-16.jpeg**

**The image depicts a Gantt chart, which is a type of bar chart that illustrates a project schedule. Each horizontal bar represents a task, with the length of the bar corresponding to the duration of the task. The chart includes overlapping and sequential tasks, indicated by the alignment and spacing of the bars. Different colors are used to distinguish between different tasks or phases of the project. The chart helps in visualizing the start and end dates of tasks, their dependencies, and the overall project timeline.**
(b) A PCIE bounded three-stage pipeline

**Figure: img-17.jpeg**

**The image shows a sequence of Gantt charts, each representing a different project timeline. The charts are composed of horizontal bars divided into segments, with different colors indicating various stages or activities within each project. The blue segments likely represent planned or scheduled activities, while the orange and beige segments could signify different phases or statuses of the tasks, such as in progress or completed. Each row in the image represents a different project, showing how tasks overlap and progress over time.**
(c) Fixed two-stage pipeline
Figure 13: pipeline for RL weight update

The checkpoint engine manages three equal-size device buffers on each GPU: an H2D buffer for loading the offloaded model parameters, and two IPC buffers for GPU-to-GPU broadcast. The IPC buffers are shared to inference engines, allowing it to directly access the same physical memory. These three buffers allow us to arrange the three steps in a pipeline.

Theoretical three-stage pipeline. As illustrated in Figure 13a, a three-stage pipeline is introduced. (1)  $H2D$ : a shard of the latest weights is copied into the H2D buffer asynchronously. (2) Broadcast: Once the copy completes, the shard will be copied to one IPC buffers and broadcast to all devices. (3) Reload: Inference engines simultaneously load parameters from the other IPC buffer.

Two-stage pipeline due to PCIe saturation. On NVIDIA H800 clusters, concurrent H2D and broadcast saturate the shared PCIe fabric, collapsing the three stages into a sequential procedure (Figure 13b). We therefore adopt a simpler, two-stage scheme (Figure 13c): (1) All devices perform a single, synchronous H2D transfer. (2) The broadcast and reload proceed in parallel.

The two-stage pipeline will be bound by multiple synchronous H2D copy operations. But in large scale devices, model will be split into small shards, the entire parameter set fits into the H2D buffer in one transfer, the overhead will disappear.

By overlapping H2D, Broadcast, and Reload weights, we can obtain a high bandwidth to reshard the weights from train engines to all inference engines.

In [25]:
# authors are on the last page:
final_page = process_saved_ocr_json("RAG/OCR/responses.json", range=[3, 4])
print(final_page[:1000])

## Document Summary
**{'summary': "The document presents a technical report on Kimi K2, detailing its contributions, token template for tool calling, evaluation details, and other technical aspects. It includes a list of authors, an explanation of the token structure for tool calling, and performance metrics across various benchmarks. The report highlights Kimi K2's superior performance in coding tasks, tool use tasks, math and STEM tasks, general tasks, and long context and factuality tasks. It also discusses the minimal impact of QK-Clip on model quality and the engine switching pipeline for RL training.", 'authors': ['Yifan Bai', 'Yiping Bao', 'Guanduo Chen', 'Jiahao Chen', 'Ningxin Chen', 'Ruijue Chen', 'Yanru Chen', 'Yuankun Chen', 'Yutian Chen', 'Zhuofu Chen*', 'Jialei Cui', 'Hao Ding', 'Mengnan Dong', 'Ang`ang Du', 'Chenzhuang Du', 'Dikang Du', 'Yulun Du', 'Yu Fan', 'Yichen Feng', 'Kelin Fu', 'Bofei Gao', 'Hongcheng Gao', 'Peizhong Gao', 'Tong Gao', 'Xinran Gu', 'Longyu Guan', '

### 2.2 Parsing Metadata

An important aspect of a RAG system is the metadata: we want to present data to the user with the given source, probably the page, images and tables if we parsed them... and so on.

In [36]:
def load_ocr_json(json_path: str, range: list[int] = None) -> list[dict]:
    """
    Loads the saved JSON list of OCR responses.

    Args:
        json_path: Path to the JSON file containing OCR responses
        range: Optional range [start, end] to slice the responses list

    Returns:
        List of OCR response dictionaries
    """
    with open(json_path, "r") as f:
        responses_list = json.load(f)
    
    # Return sliced or full list
    return responses_list[range[0]:range[1]] if range is not None else responses_list

def load_ocr_text(json_path: str, range: list[int] = None) -> list[dict]:
    """
    Loads the saved JSON list of OCR responses and returns the text only.

    Args:
        json_path: Path to the JSON file containing OCR responses
    Returns: 
        The text of the pages, replacing images and tables with their indices.
    """ 

In [37]:
# Load all responses
all_responses = load_ocr_json("RAG/OCR/responses.json")
# Access the data
for resp in all_responses:
    print(resp["document_annotation"])

{'summary': 'The document introduces Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. It highlights the MuonClip optimizer, which improves upon the Muon optimizer with a novel QK-clip technique to enhance training stability and token efficiency. Kimi K2 was pre-trained on 15.5 trillion tokens and underwent a multi-stage post-training process, including a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage. The model demonstrates state-of-the-art performance on various benchmarks, particularly excelling in agentic capabilities and tasks related to software engineering. The document also discusses the technical aspects of the model, including the MuonClip optimizer, the architecture of Kimi K2, and the training infrastructure used. It concludes by mentioning the release of the base and post-trained model checkpoints to facilitate future research and applications of agentic intel

This is a part of our metadata. Since we split the document into several parts, we may want to fuse together all the author fields, but maybe get the summary only from the first part (where the abstract is) or maybe get an llm to summarize the split summaries into one. Another part is the page, of the document, but that is easy to retrieve.
 
But another important part of our metadata is the images and the tables. 

We want to replace images and tables with a textual description for our RAG. Then we'll map to the actual base64/html encoding with metadata.  

In [None]:
first_img = all_responses[0]["pages"][0]["images"][0]
print(first_img['id'])
print(first_img['image_annotation'])

img-0.jpeg
{
  "image_type": "graph",
  "description": "This image contains a series of bar charts comparing the performance of various AI models across different benchmarks. The benchmarks are categorized into 'Agentic and Competitive Coding' and 'Math & STEM'. Each bar chart represents a specific benchmark, such as SWE-bench Verified, SWE-bench Multilingual, LiveCodeBench v6, OJBench, Tau2-bench micro-average, AceBench (en), AIME 2025, and GPQA-Diamond. The AI models compared include Kimi-K2-Instruct, DeepSeekV3-0324, Owen3-2535B-A22B, OpenAI GPT-4.1, Claude 4 Opus, Claude 4 Sonnet, and Gemini 2.5 Flash non-thinking. The performance scores for each model are displayed on the y-axis, with higher scores indicating better performance. The charts show that different models perform variably across different benchmarks, with Kimi-K2-Instruct generally performing well across most benchmarks."
}


We already parsed our OCR'd document in a nice way with the above functions: we are able to get a full markdown with indexed placeholders for images and tables, together with images' descriptions.

But how about tables descriptions??

Well, we may have to do this manually, since mistral does not implement that specific function. 

*My idea is to give the pdf of the paper, together with the extracted tables, to a multimodal model that will provide a summary of the given tables and an index. Then we swap the result in with the tables extracted from Mistral.*

Let's do that:

In [71]:
from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from dotenv import load_dotenv
import os
from pydantic import BaseModel, Field, SecretStr

load_dotenv()
llm = ChatOpenAI(
    model="gemini-2.5-flash", 
    
    # redirect LangChain to OpenRouter
    base_url="https://openrouter.ai/api/v1",

    # pass the OpenRouter key
    api_key=SecretStr(os.environ["OPENROUTER_API_KEY"])
)

prompt = """
You are a document indexing assistant. 
I will provide a document and its corresponding OCR Markdown. 
Your task is to find every table placeholder (e.g., [tbl-x.html]) in the text, identify what that table represents, and return a JSON mapping.

Specifically, you must return an answer composed of 2 parts:
- title: the title of the table, extracted from the document.
- description: the description of the table. 
This description must be thorough and include all the information in the table, highlighting the main points and the context.
"""

class ResponseFormat(BaseModel):
    title : str = Field("The title of the table, extracted from the document.")
    description : str = Field("The description of the table.")

agent = create_agent(
    model=llm,
    system_prompt=prompt,
    response_format=ResponseFormat,
)

In [72]:
from langchain_core.messages import HumanMessage

# get the full text with no base64 image data but with table contents (as html)
# we want this llm to check the tables content and have full context 
full_text = process_saved_ocr_json("RAG/OCR/responses_reindexed.json", include_images=False, include_tables=True)
# we also pass the pdf in order to have full context (maybe not needed)
base64_pdf = encode_pdf("RAG/documents/KimiK2.pdf")

message = HumanMessage(
    content=[
        {
            "type": "text",
            "text": f"Here is the structured OCR text and tables for context:\n\n{full_text}"
        },
        {
            "type": "image_url", # OpenAI uses this block for PDF vision support
            "image_url": {
                "url": f"data:application/pdf;base64,{base64_pdf}"
            }
        },
    ]
)

input_state = {"messages" : [message]}

In [73]:
result = agent.invoke(input_state)

BadRequestError: Error code: 400 - {'error': {'message': 'gemini-2.5-flash is not a valid model ID', 'code': 400}, 'user_id': 'user_36hWo69Fljn0qd35HPzkWJkn7jD'}

In [None]:
print(result['structured_response'])