# 如何拆分JSON数据
该JSON分割器[分割](/docs/concepts/text_splitters/)JSON数据的同时允许控制分块大小。它会深度优先遍历JSON数据并构建较小的JSON块。该工具会尽量保持嵌套JSON对象的完整性，但若需要在min_chunk_size和max_chunk_size之间保持分块大小，也会对它们进行分割。
如果该值不是嵌套的 JSON，而是一个非常大的字符串，则该字符串不会被分割。若需要对分块大小设置严格上限，可考虑在这些分块上结合使用递归文本分割器。此外，还提供了一个可选的预处理步骤来分割列表：首先将其转换为 JSON（字典）格式，再按此方式进行分割。
1. 文本如何分割：json值。2. 分块大小的衡量标准：按字符数计算。

In [None]:
%pip install -qU langchain-text-splitters

首先我们加载一些JSON数据：

In [1]:
import json

import requests

# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()

## 基本用法
指定 `max_chunk_size` 以限制分块大小：

In [2]:
from langchain_text_splitters import RecursiveJsonSplitter

splitter = RecursiveJsonSplitter(max_chunk_size=300)

要获取 JSON 数据块，请使用 `.split_json` 方法：

In [3]:
# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)

for chunk in json_chunks[:3]:
    print(chunk)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'servers': [{'url': 'https://api.smith.langchain.com', 'description': 'LangSmith API endpoint.'}]}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.', 'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}


要获取 LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) 对象，请使用 `.create_documents` 方法：

In [4]:
# The splitter can also output documents
docs = splitter.create_documents(texts=[json_data])

for doc in docs[:3]:
    print(doc)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'


或者使用 `.split_text` 直接获取字符串内容：

In [5]:
texts = splitter.split_text(json_data=json_data)

print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "servers": [{"url": "https://api.smith.langchain.com", "description": "LangSmith API endpoint."}]}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}


## 如何管理列表内容的分块大小
请注意，此示例中有一个数据块超过了指定的 `max_chunk_size`（300）。在查看其中一个较大的数据块时，我们发现其中包含一个列表对象：

In [6]:
print([len(text) for text in texts][:10])
print()
print(texts[3])

[171, 231, 126, 469, 210, 213, 237, 271, 191, 232]

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}


默认情况下，JSON分割器不会分割列表。
指定 `convert_lists=True` 以预处理 JSON 数据，将列表内容转换为字典格式，其中 `index:item` 作为 `key:val` 键值对：

In [7]:
texts = splitter.split_text(json_data=json_data, convert_lists=True)

让我们来看看这些数据块的大小。现在它们都控制在最大值以下。

In [8]:
print([len(text) for text in texts][:10])

[176, 236, 141, 203, 212, 221, 210, 213, 242, 291]


该列表已转换为字典，但即使被分割成多个块，仍保留所有必要的上下文信息：

In [9]:
print(texts[1])

{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": {"0": "tracer-sessions"}, "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}


In [10]:
# We can also look at the documents
docs[1]

Document(page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session.", "operationId": "read_tracer_session_api_v1_sessions__session_id__get"}}}}')