#### How to split JSON data
This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.

- How the text is split: json value.
- How the chunk size is measured: by number of characters.


https://python.langchain.com/api_reference/text_splitters/json/langchain_text_splitters.json.RecursiveJsonSplitter.html#langchain_text_splitters.json.RecursiveJsonSplitter


In [31]:
# add this import for running in jupyter notebook
import nest_asyncio
import json

nest_asyncio.apply()

from langchain_community.document_loaders.mongodb import MongodbLoader

loader = MongodbLoader(
    connection_string="mongodb+srv://kaiumallimon:e2B1tUqAZ0IBSRCN@meditouch-backend.ogsmo.mongodb.net/meditouch?retryWrites=true",
    db_name="meditouch",
    collection_name="doctors",
    field_names=["name", "email", "phone", "speciality", "all_timeslots"] 
)

docs = loader.load()

docs


[Document(metadata={'database': 'meditouch', 'collection': 'doctors'}, page_content=" Dr. Arnob connect.arnob@gmail.com 01738439423 Medicine [{'date': '2024-12-10', 'intervals': [{'start': '08:00', 'end': '10:00', '_id': ObjectId('6755f64a492b8447dbc94edf')}, {'start': '07:00', 'end': '11:00', '_id': ObjectId('6755f64a492b8447dbc94ee0')}, {'start': '04:00', 'end': '12:00', '_id': ObjectId('6755f64a492b8447dbc94ee1')}], 'timePerInterval': 60, '_id': ObjectId('6755f82511985b3589e5aaa6')}, {'date': '2024-15-10', 'intervals': [{'start': '09:00', 'end': '10:00', '_id': ObjectId('6755f89211985b3589e5aab2')}, {'start': '11:00', 'end': '12:00', '_id': ObjectId('6755f89211985b3589e5aab4')}], 'timePerInterval': 15, '_id': ObjectId('6755f89211985b3589e5aab1')}, {'date': '2024-19-10', 'intervals': [{'start': '09:00', 'end': '10:00', '_id': ObjectId('6755f89711985b3589e5aac1')}, {'start': '10:00', 'end': '11:00', '_id': ObjectId('6755f89711985b3589e5aac2')}, {'start': '11:00', 'end': '12:00', '_id'

In [32]:
from langchain_text_splitters import RecursiveJsonSplitter

json_splitter = RecursiveJsonSplitter()
json_chunks = json_splitter.split_json(docs)


IndexError: list index out of range

In [None]:
json_chunks

In [5]:
for chunk in json_chunks[:3]:
    print(chunk)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'], 'summary': 'Read Tracer Session', 'description': 'Get a specific session.'}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}, {'API Key': []}]}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'include_stats', 'in': 'query', 'required': False, 'schema': {'type': 'boolean', 'default': False, 'title': 'Include Stats'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


In [6]:
## The splitter can also output documents
docs=json_splitter.create_documents(texts=[json_data])
for doc in docs[:3]:
    print(doc)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}, {"API Key": []}]}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "include_stats", "in": "query", "required": false, "schema": {"type": "boolean", "default": false, "title": "Include Stats"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'


In [7]:
texts=json_splitter.split_text(json_data)
print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"], "summary": "Read Tracer Session", "description": "Get a specific session."}}}}
{"paths": {"/api/v1/sessions/{session_id}": {"get": {"operationId": "read_tracer_session_api_v1_sessions__session_id__get", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}, {"API Key": []}]}}}}
