#### Text Splitting from Documents- RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. 
It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. 
This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, 
as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.

##### RecursiveCharacterTextSplitter

In [None]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = PyPDFLoader("../data/sample.pdf")
pdf_document = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(pdf_document)
print(texts[0].page_content)
print("-----")
print(texts[1].page_content)
print("-----")

##### CharacterTextSplitter

In [17]:

from json import load
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders.text import TextLoader

loader = TextLoader("../data/sample.txt")
text_document = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
texts = text_splitter.split_documents(text_document)
print(texts[0].page_content)
print("-----")
print(texts[1].page_content)
print("-----")



Created a chunk of size 167, which is longer than the specified 100


Langchain is specially used for LLM powered applications.
-----
Langchain  -> Chaining  -> Focus on Sequential Execution of process
* In Langchain, everything happens in sequential order So we call it as Directed Acyclic Graph(DAG)
-----


##### How to split by HTML header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

In [18]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on=[
    ("h1","Header 1"),
    ("h2","Header 2"),
    ("h3","Header 3")
]
html_splitter=HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits=html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Foo'}, page_content='Foo'),
 Document(metadata={'Header 1': 'Foo'}, page_content='Some intro text about Foo.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Bar main section'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}, page_content='Some intro text about Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Bar subsection 1'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}, page_content='Some text about the first subtopic of Bar.'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Bar subsection 2'),
 Document(metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}, page_content='Some text about the second subtopic of Bar.'),
 Document(metadata={'Header 

#### How to split JSON data
This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.

- How the text is split: json value.
- How the chunk size is measured: by number of characters.

Ref: https://python.langchain.com/docs/how_to/recursive_json_splitter/

In [20]:
import json
import requests
from langchain_text_splitters import RecursiveJsonSplitter

json_data=requests.get("https://api.smith.langchain.com/openapi.json").json()
json_splitter=RecursiveJsonSplitter(max_chunk_size=300)
json_chunks=json_splitter.split_json(json_data)
print(json_chunks[0])
print("-----")
print(json_chunks[1])
print("-----")
print(json_chunks[2])

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}, 'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'tags': ['tracer-sessions'], 'summary': 'Get Tracing Project Prebuilt Dashboard', 'description': 'Get a prebuilt dashboard for a tracing project.'}}}}
-----
{'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'operationId': 'get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post', 'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}}
-----
{'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'parameters': [{'name': 'session_id', 'in': 'path', 'required': True, 'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}}, {'name': 'accept', 'in': 'header', 'required': False, 'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Accept'}}]}}}}


In [21]:
# for getting documents from json data
docs = json_splitter.create_documents(texts=[json_data])

for doc in docs[:3]:
    print(doc)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}, "paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"tags": ["tracer-sessions"], "summary": "Get Tracing Project Prebuilt Dashboard", "description": "Get a prebuilt dashboard for a tracing project."}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"operationId": "get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post", "security": [{"API Key": []}, {"Tenant ID": []}, {"Bearer Auth": []}]}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"parameters": [{"name": "session_id", "in": "path", "required": true, "schema": {"type": "string", "format": "uuid", "title": "Session Id"}}, {"name": "accept", "in": "header", "required": false, "schema": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Accept"}}]}}}}'


#### Semantic Similarity Text Splitter
At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

Ref: https://python.langchain.com/docs/how_to/semantic-chunker/

In [None]:
!pip3 install langchain_experimental

In [26]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.document_loaders.text import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings

embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

text_splitter = SemanticChunker(embeddings)
loader = PyPDFLoader("../data/sample.pdf")
pdf_document = loader.load()
doc_data = " "
for doc in pdf_document:
    doc_data += doc.page_content
texts = text_splitter.create_documents([doc_data])
texts

[Document(metadata={}, page_content=" AAPL Stock Analysis - Initial Findings:\nAverage analyst rating: Buy\n12-month average price target: $237.36\nPotential upside from current price (approx. $196.58): +20.75%\nAnalyst Opinions and Sentiment (from Yahoo Finance):\nStrong Buy, Buy, Hold, Underperform, Sell ratings are present. Earnings Per Share (EPS) estimates for current and next quarters/years are\navailable. Revenue estimates for current and next quarters/years are available. Sales Growth (year/est) is positive for current and next years. Several top analysts have 'Outperform' or 'Strong Buy' ratings with price targets\nranging from $230 to $300. Recent Company Developments and Financial Performance (from Apple's Q2 2025 \nearnings report):\nQ2 2025 revenue: $95.4 billion, up 5% year over year. Q2 2025 diluted earnings per share: $1.65, up 8% year over year. Double-digit growth in Services."),
 Document(metadata={}, page_content='Introduced iPhone 16e, new Macs, and iPads. Cut carb

The default way to split is based on percentile. In this method, all differences between sentences are calculated, 
and then any difference greater than the X percentile is split. 
The default value for X is 95.0 and can be adjusted by the keyword argument breakpoint_threshold_amount which expects a number between 0.0 and 100.0.

In [37]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="percentile", min_chunk_size=100, 
)

docs = text_splitter.create_documents([doc_data])
print(docs[0].page_content)
print("-----")
print(len(docs))
print("-----")

texts = text_splitter.split_documents(docs)
print(texts[0].page_content)
print("-----")
print(texts[1].page_content)
print("-----")

 AAPL Stock Analysis - Initial Findings:
Average analyst rating: Buy
12-month average price target: $237.36
Potential upside from current price (approx. $196.58): +20.75%
Analyst Opinions and Sentiment (from Yahoo Finance):
Strong Buy, Buy, Hold, Underperform, Sell ratings are present. Earnings Per Share (EPS) estimates for current and next quarters/years are
available. Revenue estimates for current and next quarters/years are available. Sales Growth (year/est) is positive for current and next years. Several top analysts have 'Outperform' or 'Strong Buy' ratings with price targets
ranging from $230 to $300. Recent Company Developments and Financial Performance (from Apple's Q2 2025 
earnings report):
Q2 2025 revenue: $95.4 billion, up 5% year over year. Q2 2025 diluted earnings per share: $1.65, up 8% year over year. Double-digit growth in Services.
-----
2
-----
 AAPL Stock Analysis - Initial Findings:
Average analyst rating: Buy
12-month average price target: $237.36
Potential upside