How to split JSON data

This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_sizw.

If the value is not a nested json, but rather a very large string the string will not be split. If yo need a hard cap on the chunk size consider composing this with a Recursive text splitter on htose chunks. There is an optional pre-precessing step to split lits, bu first converting them to json (dict) and then splitting them as such.

How the text is split: json value.
How the chunk size is measured: by number of characters.

In [1]:
json_data = {
  "status": "success",
  "timestamp": "2025-09-13T14:30:00Z",
  "data": {
    "company": {
      "id": "COMP123",
      "name": "TechNova Solutions",
      "location": "San Francisco, CA",
      "industry": "Software Development"
    },
    "employees": [
      {
        "id": "EMP001",
        "name": "Alice Johnson",
        "role": "Software Engineer",
        "department": "Engineering",
        "email": "alice.johnson@technova.com",
        "salary": 85000,
        "skills": ["Python", "Django", "React"]
      },
      {
        "id": "EMP002",
        "name": "Michael Smith",
        "role": "Data Scientist",
        "department": "AI Research",
        "email": "michael.smith@technova.com",
        "salary": 92000,
        "skills": ["Python", "TensorFlow", "NLP"]
      },
      {
        "id": "EMP003",
        "name": "Sophia Brown",
        "role": "HR Manager",
        "department": "Human Resources",
        "email": "sophia.brown@technova.com",
        "salary": 75000,
        "skills": ["Recruitment", "Employee Relations", "Compliance"]
      }
    ]
  }
}


In [3]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(max_chunk_size = 10)
json_chunk = json_splitter.split_json(json_data=json_data)
json_chunk

[{'status': 'success', 'timestamp': '2025-09-13T14:30:00Z'},
 {'data': {'company': {'id': 'COMP123', 'name': 'TechNova Solutions'}}},
 {'data': {'company': {'location': 'San Francisco, CA'}}},
 {'data': {'company': {'industry': 'Software Development'}}},
 {'data': {'employees': [{'id': 'EMP001',
     'name': 'Alice Johnson',
     'role': 'Software Engineer',
     'department': 'Engineering',
     'email': 'alice.johnson@technova.com',
     'salary': 85000,
     'skills': ['Python', 'Django', 'React']},
    {'id': 'EMP002',
     'name': 'Michael Smith',
     'role': 'Data Scientist',
     'department': 'AI Research',
     'email': 'michael.smith@technova.com',
     'salary': 92000,
     'skills': ['Python', 'TensorFlow', 'NLP']},
    {'id': 'EMP003',
     'name': 'Sophia Brown',
     'role': 'HR Manager',
     'department': 'Human Resources',
     'email': 'sophia.brown@technova.com',
     'salary': 75000,
     'skills': ['Recruitment', 'Employee Relations', 'Compliance']}]}}]

In [4]:
json_doc = json_splitter.create_documents([json_data])
json_doc[:3]

[Document(metadata={}, page_content='{"status": "success", "timestamp": "2025-09-13T14:30:00Z"}'),
 Document(metadata={}, page_content='{"data": {"company": {"id": "COMP123", "name": "TechNova Solutions"}}}'),
 Document(metadata={}, page_content='{"data": {"company": {"location": "San Francisco, CA"}}}')]