# **Document Splitting**

In [1]:
! pip install langchain

Collecting langchain
  Downloading langchain-0.2.1-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_core-0.2.1-py3-none-any.whl (308 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.5/308.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.0-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.63-py3-none-any.whl (122 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.8/122.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting packaging<24.0,>=23.2 (from langchain-

We are going to use 2 different methods:
- `RecursiveCharacterTextSplitter`
- `CharacterTextSplitter`

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 26
chunk_overlap = 4

In [3]:
# CharacterTextSplitter
c_splitter = CharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap
)

# RecursiveCharacterTextSplitter
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap
)

Now, let's examine the splitters on our simple text:

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

print(len(text1))
print(len(text2))

26
33


In [8]:
# Try CharacterTextSplitter on the first text:
print(r_splitter.split_text(text1))

# Try CharacterTextSplitter on the second text:
print(r_splitter.split_text(text2))

['abcdefghijklmnopqrstuvwxyz']
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']


Let's examine more complex text:

In [13]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

print(f"RecursiveCharacterTextSplitter: {r_splitter.split_text(text3)}")
print(f"CharacterTextSplitter: {c_splitter.split_text(text3)}")

RecursiveCharacterTextSplitter: ['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
CharacterTextSplitter: ['a b c d e f g h i j k l m n o p q r s t u v w x y z']


The `CharacterTextSplitter()` function, breaks the function based on the newlines first.

In [14]:
c_splitter = CharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    separator = ' '
)

c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

## **Recursive splitting details**

`RecursiveCharacterTextSplitter` is recommended for generic text.

In [16]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

print(len(some_text))

496


Now, let's try this `some_text` using different methods:

In [18]:
## CharacterTextSplitter:
c_splitter = CharacterTextSplitter(
    chunk_size = 450,
    chunk_overlap = 0,
    separator = ' '
)


## RecursiveCharacterTextSplitter:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 450,
    chunk_overlap = 0,
    separators = ["\n\n", "\n", " ", ""]
)

In [31]:
chunks = c_splitter.split_text(some_text)

print(f"Number Of Chunks: {len(chunks)}")

for chunk in chunks:
    print(chunk)
    print("-----------------")

Number Of Chunks: 2
When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. 

 Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,
-----------------
have a space.and words are separated by space.
-----------------


In [30]:
chunks = r_splitter.split_text(some_text)

print(f"Number Of Chunks: {len(chunks)}")

for chunk in chunks:
    print(chunk)
    print("-----------------")

Number Of Chunks: 2
When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
-----------------
Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.
-----------------


Let's reduce the chunk size a bit and add a period to our separators:

In [37]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap = 0,
    separators = ["\n\n", "\n", "(?<=\. )", " ", ""]
)

chunks = r_splitter.split_text(some_text)

# Output the chunks:
for chunk in chunks:
    print(chunk)
    print("-------------")

When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,
-------------
closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.
-------------
Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this
-------------
string. Sentences have a period at the end, but also, have a space.and words are separated by space.
-------------


## **PDFs**

So far, we have talked about how we can split a text into smaller pieces of chunks, next we will go for another type of data.

In [40]:
! pip install pypdf langchain_community

Collecting langchain_community
  Downloading langchain_community-0.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.21.2-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensio

In [41]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('MachineLearning-Lecture01.pdf')
pages = loader.load()

In [44]:
print(f"Number of pages: {len(pages)}")

Number of pages: 22


So, we create a simple splitter `CharacterTextSplitter`:

In [45]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap= 150,
    length_function = len,
    separator = '\n'
)

In [47]:
chunks = text_splitter.split_documents(pages)

print(f"Number of chunks: {len(chunks)}")

Number of chunks: 77


We had $22$ pages but $77$ chunks.

## **Token splitting**

- We can also split on token count explicity, if we want.

- This can be useful because LLMs often have context windows designated in tokens.

- Tokens are often ~4 characters.

In [49]:
from langchain.text_splitter import TokenTextSplitter

Now, Let's create our splitter. we also need to install `tiktoken`

In [52]:
! pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [53]:
text_splitter = TokenTextSplitter(
    chunk_size = 1,
    chunk_overlap = 0
)

text4 = "foo bar bazzyfoo"

In [54]:
## split the text:
text_splitter.split_text(text4)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

We can see that tokens are different from characters.
now we apply this on the pdf document.

In [61]:
text_splitter = TokenTextSplitter(
    chunk_size = 10,
    chunk_overlap = 0,
)

docs = text_splitter.split_documents(pages)

In [62]:
for i in range(5):
    print(docs[i].page_content)
    print(docs[i].metadata)
    print("---------------")

MachineLearning-Lecture01  

{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}
---------------
Instructor (Andrew Ng):  Okay. Good
{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}
---------------
 morning. Welcome to CS229, the machine 
{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}
---------------

learning class. So what I wanna do today
{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}
---------------
 is ju st spend a little time going over the
{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}
---------------


## **Context aware splitting**

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [63]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n
## Chapter 2\n\n \
Hi this is Molly"""

In [64]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [66]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on = headers_to_split_on
)

md_header_splits = markdown_splitter.split_text(markdown_document)

In [70]:
for split in md_header_splits:
    print(split.page_content)
    print(split.metadata)
    print("----------------")

Hi this is Jim  
Hi this is Joe
{'Header 1': 'Title', 'Header 2': 'Chapter 1'}
----------------
Hi this is Lance
{'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'}
----------------
Hi this is Molly
{'Header 1': 'Title', 'Header 2': 'Chapter 2'}
----------------
