## Text splitters
Splitting a long document into smaller chunks that can fit into your model's context window is a necessary task. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

#### Levels Of Text Splitting
(Taken from : https://github.com/FullStackRetrieval-com)

<b>Level 1: Character Splitting </b>- Simple static character chunks of data

<b>Level 2: Recursive Character Text Splitting </b>- Recursive chunking based on a list of separators

<b>Level 3: Document Specific Splitting - </b> Various chunking methods for different document types (PDF, Python, Markdown)

<b>Level 4: Semantic Splitting - </b> Embedding walk based chunking

<b>Level 5: Agentic Splitting - </b> Experimental method of splitting text with an agent-like system.

<b>Level 6: Alternative Representation Chunking + Indexing </b>- Derivative representations of your raw text that will aid in retrieval and indexing

In [3]:
import os
# Disable pip version check
os.environ['PIP_DISABLE_PIP_VERSION_CHECK'] = '1'
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install -qU langchain-text-splitters

##### Character Splitting

In [7]:
from langchain_text_splitters import CharacterTextSplitter

# Load an example document
text = "Data/Api.txt"
with open(text) as f:
    api_text = f.read()
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
    strip_whitespace=False
)
texts = text_splitter.create_documents([api_text])
print(texts[0])

page_content='Request Structure: Test the correct formation and syntax of API requests, including headers, query parameters, and request bodies.\n\nResponse Validation: Verify the accuracy and completeness of the API responses. Validate the response status codes, headers, and payload.\n\nData Format and Encoding: Ensure that the API handles data formats correctly, such as JSON, XML, or others. Validate that the encoding and decoding processes work as expected.\n\nError Handling: Test how the API handles error scenarios and responds with appropriate error codes, messages, and error structures. Check if error conditions are handled gracefully.\n\nAuthentication and Authorization: Validate the authentication mechanisms provided by the API, such as API keys, tokens, or OAuth. Test different authentication scenarios and authorization levels to ensure access control is properly enforced.'


##### Recursive Character Text Splitting
The problem with Level #1 is that we don't take into account the structure of our document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain here. Let's take a look at them one by one.

"\n\n" - Double new line, or most commonly paragraph breaks

"\n" - New lines

" " - Spaces

"" - Characters

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([api_text])
print(texts[0])
print(texts[1])

page_content='Request Structure: Test the correct formation and syntax of API requests, including headers, query'
page_content='headers, query parameters, and request bodies.'


##### Split by HTML header

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter


fname = "Data/mdguide.html"

HtmlFile = open(fname, 'r', encoding='utf-8')
source_code = HtmlFile.read() 

headers_to_split_on = [
    #("h1", "Header 1"),
    #("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(source_code)
html_header_splits

In [None]:
url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

HTMLHeaderTextSplitter, which splits based on HTML headers, can be composed with another splitter which constrains splits based on character lengths, such as RecursiveCharacterTextSplitter.

This can be done using the .split_documents method of the second splitter:

In [17]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
 Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 Th

##### Splitting code

In [18]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

###### python splitters:

    \nclass - Classes first
    \ndef - Functions next
    \n\tdef - Indented functions
    \n\n - Double New lines
    \n - New Lines
    " " - Spaces
    "" - Characters

In [23]:
from langchain.text_splitter import PythonCodeTextSplitter

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
python_splitter.create_documents([PYTHON_CODE])


[Document(page_content='def hello_world():\n    print("Hello, World!")\n\n# Call the function\nhello_world()')]

In [24]:
##### Markdown

In [28]:
markdown_text = """
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
# Hopefully this code block isn't split
pip install langchain


```python
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()"""
md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs

[Document(page_content='# 🦜️🔗 LangChain'),
 Document(page_content='⚡ Building applications with LLMs through composability ⚡'),
 Document(page_content='## Quick Install\n\n```bash'),
 Document(page_content="# Hopefully this code block isn't split"),
 Document(page_content='pip install langchain'),
 Document(page_content='```python\ndef hello_world():\n    print("Hello, World!")'),
 Document(page_content='# Call the function\nhello_world()')]

In [29]:
from langchain.text_splitter import MarkdownTextSplitter

md_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
md_splitter.create_documents([markdown_text])

[Document(page_content='# 🦜️🔗 LangChain\n\n⚡ Building applications with LLMs through composability ⚡\n\n## Quick Install'),
 Document(page_content="```bash\n# Hopefully this code block isn't split\npip install langchain\n\n\n```python"),
 Document(page_content='def hello_world():\n    print("Hello, World!")\n\n# Call the function\nhello_world()')]

##### Splitting PDF

In [36]:
# ! pip install unstructured_pytesseract -q
! pip install pdf2image pypdfium2 -q
## Install poppler


In [2]:
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



In [None]:
filename = "Data/output_parser.pdf"

# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,

    # Unstructured Helpers
    strategy="hi_res", 
    infer_table_structure=True, 
    model_name="yolox"
)
elements
elements[-4].metadata.text_as_html

##### split text based on semantic similarity

In [7]:
from dotenv import load_dotenv, dotenv_values
import google.generativeai as genai
from IPython.display import Markdown, display
import os
load_dotenv()
os.getenv("GOOGLE_API_KEY") 
my_api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=my_api_key)

In [11]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_google_genai import GoogleGenerativeAIEmbeddings


text = "Data/state_of_the_union.txt"
with open(text,encoding = 'latin1') as f:
    api_text = f.read()
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

text_splitter = SemanticChunker(embeddings)
docs = text_splitter.create_documents([api_text])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russiaâs Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students

###### Breakpoints
This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split. There are different kinds of breakpoints: 
1. Percentile
The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split.

In [14]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="percentile"
)

In [16]:
docs = text_splitter.create_documents([api_text])
len(docs)

26

2. Standard Deviation

In this method, any difference greater than X standard deviations is split.

In [17]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="standard_deviation"
)
docs = text_splitter.create_documents([api_text])
len(docs)

10

3. Interquartile

In this method, the interquartile distance is used to split chunks.

In [18]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="interquartile"
)
docs = text_splitter.create_documents([api_text])
len(docs)

29

4. Gradient

In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.

In [20]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="gradient"
)
docs = text_splitter.create_documents([api_text])
print(len(docs))
print(docs[0].page_content)
print(docs[1].page_content)
print(docs[10].page_content)

26
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Smartphones. The Internet. Technology we have yet to invent. But thatâs just the beginning. Intelâs CEO, Pat Gelsinger, who is here tonight, told me they are ready to increase their investment from  
$20 billion to $100 billion. That would be one of the biggest investments in manufacturing in American history. And all theyâre waiting for is for you to pass this bill. So letâs not wait any longer. Send it to my desk. Iâll sign it. And we will really take off. And Intel is not alone.
