### Practicing Text Splitting techniques in langchain

In [1]:
## Reading pdf files
from langchain_community.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader("Attention_Paper.pdf")
pdf_documents = pdf_loader.load()

## Recursive Text Splitter

In [6]:
## splitting the pdf documents
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)
final_docs = text_splitter.split_documents(pdf_documents)
# To use direct text files, use create_documents
# split_documents works on the documents loaded by the loader
# final_docs = text_splitter.create_documents([doc.page_content for doc in pdf_documents])
# final_docs is a list of Document objects




In [7]:
print(final_docs[0])
print(final_docs[1])
print(type(final_docs[2]))

page_content='Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗' metadata={'producer': 'PyPDF2', 'creator': 'PyPDF', 'creationdate': '', 'subject': 'Neural Information Processing Systems http://nips.cc/', 'publisher': 'Curran Associates, Inc.', 'language': 'en-US', 'created': '2017', 'eventtype': 'Poster', 'description-abstract': 'The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms.  We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEngli

In [8]:
## Loading text file
with open("speech.txt", "r") as file:
    speech = file.read()

from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
speech_docs = loader.load()

## Charcter Text Splitter

In [None]:
## character text splitter using langchain docs
from langchain_text_splitters import CharacterTextSplitter
character_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=100,
    chunk_overlap=20,
    length_function=len
)
splitted_docs = character_splitter.split_documents(speech_docs)
splitted_docs


Created a chunk of size 349, which is longer than the specified 100
Created a chunk of size 309, which is longer than the specified 100
Created a chunk of size 379, which is longer than the specified 100
Created a chunk of size 294, which is longer than the specified 100
Created a chunk of size 279, which is longer than the specified 100
Created a chunk of size 336, which is longer than the specified 100
Created a chunk of size 203, which is longer than the specified 100
Created a chunk of size 312, which is longer than the specified 100
Created a chunk of size 292, which is longer than the specified 100
Created a chunk of size 312, which is longer than the specified 100
Created a chunk of size 248, which is longer than the specified 100
Created a chunk of size 198, which is longer than the specified 100
Created a chunk of size 335, which is longer than the specified 100
Created a chunk of size 321, which is longer than the specified 100
Created a chunk of size 302, which is longer tha

[Document(metadata={'source': 'speech.txt'}, page_content='DEMOCRACY: THE CORNERSTONE OF HUMAN PROGRESS AND FREEDOM'),
 Document(metadata={'source': 'speech.txt'}, page_content='Honorable guests, distinguished colleagues, and fellow citizens,\nINTRODUCTION'),
 Document(metadata={'source': 'speech.txt'}, page_content="Today, I stand before you to speak about one of humanity's greatest achievements: democracy. More than just a political system, democracy represents the collective aspiration of billions of people for freedom, equality, and self-determination. It is the foundation upon which modern civilization has built its most cherished values and institutions."),
 Document(metadata={'source': 'speech.txt'}, page_content='Democracy, derived from the Greek words "demos" meaning people and "kratos" meaning power, literally translates to "power of the people." This simple yet profound concept has shaped the course of human history, sparked revolutions, and continues to inspire movements fo

In [11]:
## Character text splitter using raw text
speech_text = character_splitter.create_documents([speech])
speech_text[0]

Created a chunk of size 349, which is longer than the specified 100
Created a chunk of size 309, which is longer than the specified 100
Created a chunk of size 379, which is longer than the specified 100
Created a chunk of size 294, which is longer than the specified 100
Created a chunk of size 279, which is longer than the specified 100
Created a chunk of size 336, which is longer than the specified 100
Created a chunk of size 203, which is longer than the specified 100
Created a chunk of size 312, which is longer than the specified 100
Created a chunk of size 292, which is longer than the specified 100
Created a chunk of size 312, which is longer than the specified 100
Created a chunk of size 248, which is longer than the specified 100
Created a chunk of size 198, which is longer than the specified 100
Created a chunk of size 335, which is longer than the specified 100
Created a chunk of size 321, which is longer than the specified 100
Created a chunk of size 302, which is longer tha

Document(metadata={}, page_content='DEMOCRACY: THE CORNERSTONE OF HUMAN PROGRESS AND FREEDOM')

## HTML HEADER TEXT SPLITTER

#### HTMLHeaderTextSplitter is a "structure-aware" text splitter that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

#### It is analogous to the MarkdownHeaderTextSplitter for markdown files.

In [14]:
sample_html = """<html>
<head>
<title>Sample HTML</title>
</head>

<body>
<h1>Welcome to LangChain</h1>
<p>This is a sample HTML document.</p>
<p>It contains multiple paragraphs and headings.</p>
<h2>Subheading</h2>
<p>Here is another paragraph under a subheading.</p>
<p>LangChain is a framework for building applications with LLMs.</p>
<p>It supports various document loaders and text splitters.</p>
<p>Enjoy exploring LangChain!</p>
</body>
</html>"""

from langchain_text_splitters import HTMLHeaderTextSplitter
headers_to_split = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split
)
html_header_split = html_splitter.split_text(sample_html)
html_header_split

[Document(metadata={'Header 1': 'Welcome to LangChain'}, page_content='Welcome to LangChain'),
 Document(metadata={'Header 1': 'Welcome to LangChain'}, page_content='This is a sample HTML document.  \nIt contains multiple paragraphs and headings.'),
 Document(metadata={'Header 1': 'Welcome to LangChain', 'Header 2': 'Subheading'}, page_content='Subheading'),
 Document(metadata={'Header 1': 'Welcome to LangChain', 'Header 2': 'Subheading'}, page_content='Here is another paragraph under a subheading.  \nLangChain is a framework for building applications with LLMs.  \nIt supports various document loaders and text splitters.  \nEnjoy exploring LangChain!')]

In [15]:
## splitting from url
url = "https://lilianweng.github.io/posts/2023-06-23-agent/"

split_from_url = html_splitter.split_text_from_url(url)
for i, doc in enumerate(split_from_url):
    print(f"Document {i+1}:\n{doc}\n")

Document 1:
page_content='if (localStorage.getItem("pref-theme") === "dark") {
        document.body.classList.add('dark');
    } else if (localStorage.getItem("pref-theme") === "light") {
        document.body.classList.remove('dark')
    } else if (window.matchMedia('(prefers-color-scheme: dark)').matches) {
        document.body.classList.add('dark');
    }  
MathJax = {
    tex: {
      inlineMath: [['$', '$'], ['\\(', '\\)']],
      displayMath: [['$$','$$'], ['\\[', '\\]']],
      processEscapes: true,
      processEnvironments: true
    },
    options: {
      skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre']
    }
  };

  window.addEventListener('load', (event) => {
      document.querySelectorAll("mjx-container").forEach(function(x){
        x.parentElement.classList += 'has-jax'})
    });  
Lil'Log  
|  
Posts  
Archive  
Search  
Tags  
FAQ'

Document 2:
page_content='LLM Powered Autonomous Agents' metadata={'Header 1': 'LLM Powered Autonomous Agents'}

Docume

## JSON Recursive Splitter

#### class langchain_text_splitters.json.RecursiveJsonSplitter(
#### max_chunk_size: int = 2000,
#### min_chunk_size: int | None = None,
#### )[source]
Splits JSON data into smaller, structured chunks while preserving hierarchy.

This class provides methods to split JSON data into smaller dictionaries or JSON-formatted strings based on configurable maximum and minimum chunk sizes. It supports nested JSON structures, optionally converts lists into dictionaries for better chunking, and allows the creation of document objects for further use.

In [16]:
import json
import requests
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()
json_data
 

{'openapi': '3.1.0',
 'info': {'title': 'LangSmith', 'version': '0.1.0'},
 'paths': {'/api/v1/sessions/{session_id}/dashboard': {'post': {'tags': ['tracer-sessions'],
    'summary': 'Get Tracing Project Prebuilt Dashboard',
    'description': 'Get a prebuilt dashboard for a tracing project.',
    'operationId': 'get_tracing_project_prebuilt_dashboard_api_v1_sessions__session_id__dashboard_post',
    'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}],
    'parameters': [{'name': 'session_id',
      'in': 'path',
      'required': True,
      'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
     {'name': 'accept',
      'in': 'header',
      'required': False,
      'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
       'title': 'Accept'}}],
    'requestBody': {'required': True,
     'content': {'application/json': {'schema': {'$ref': '#/components/schemas/CustomChartsSectionRequest'}}}},
    'responses': {'200': {'description': 'Succ

In [17]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(
    max_chunk_size=200,
    min_chunk_size=None
)
json_chunks = json_splitter.split_json(json_data)
json_chunks[0]

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}}

In [19]:
## Make Docs using splitter
docs = json_splitter.create_documents(texts=[json_data])
docs[0]

Document(metadata={}, page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}}')

In [20]:
texts = json_splitter.split_text(json_data)
print(texts[0])
print(texts[1])

{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}}
{"paths": {"/api/v1/sessions/{session_id}/dashboard": {"post": {"tags": ["tracer-sessions"], "summary": "Get Tracing Project Prebuilt Dashboard"}}}}
