# Document Splitting

![](images/doc_split.png)

The basis of all the text splitters in Lang Chain involves splitting on chunks in some chunk size with  some chunk overlap.\
And so, we have a little diagram here below to show what that looks like.

|             |                            |
| ----------------------------------------------- | ------------------------------------- |
| ![](images/splitter.png) | ![](images/types_of_splitters.png) |

- 

In [2]:
import os
import openai
import sys
sys.path.append('../..')

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ.get('OPENAI_API_KEY')

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [4]:
chunk_size =26
chunk_overlap = 4

In [5]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

In [7]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

When we split it with the recursive character text splitter it still ends up as one string. This is because this is 26 characters long and we've specified a chunk size of 26. \
So, there's actually no need to  even do any splitting here. \

Now, let's do it on a slightly longer string where it's longer than the 26 characters that we've  specified as the chunk size. 


In [8]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [9]:
len(text2)

33

In [10]:
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Here we can see that two different chunks are created. 
- The first one ends at Z, so that's 26 characters. The next one we can see starts with W, X, Y, Z. 
- Those are the four **chunk overlaps**, And then it continues with the rest of the string.

Let's take a look at a slightly more complex string where we have a bunch of spaces between characters.

In [11]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [12]:
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

We can now see that it's split into three chunks because there are spaces, so it takes up more space.
-  If we look at the overlap we can see that in the first one there's L and M, and L and M are then also  present in the second one.
-  That seems like only two characters but because of the space both in between the L and M, and then also, before the L and after the M that actually counts as the four that makes up the **chunk overlap**.

Let's now try with the **CharacterTextSplitter**.

In [13]:
c_splitter.split_text(text3)

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

And we can see that when we run it doesn't actually try to split it at all.\
The issue is the **character text splitter splits on a single character** and by default that character is a **newline character**. But here, there are no newlines. \
If we set the separator to be an empty space, we can see what happens then.

In [14]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)

c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Here it's split in the same way as before.

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [15]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

Now, let's try it out on some more real-world examples. \
We've got this long paragraph here, and we can see that right about here, we have this double newline symbol which is a typical separator between paragraphs. 

In [16]:
len(some_text)

496

And now, let's define our two text splitters. \
We'll work with the **CharacterTextSplitter** as before with the space as a separator and then we'll 
initialize the **RecursiveCharacterTextSplitter**. 

In [17]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

And here, we pass in a list of separators, and these are the **default separators** but we're just putting them in this notebook to better show what's going on. \
we can see that we've got a list of double newline, single newline, space and then nothing, an empty string.
- What this mean is that when you're splitting a piece of text it will first try to split it by **double newlines ("\n\n")**. 
- And then, if it still needs to split the individual chunks more it will go on to **single newlines("\n")**. 
- And then, if it still needs to do more it goes on to the **space(" ")**. 
- And then, finally it will just go character by character if it really needs to do that.

In [18]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [19]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

- Looking at how these perform on the above text, we can see that the **CharacterTextSplitter** splits on spaces. And so, we end up with the weird separation in the middle of the sentence. 
- The **RecursiveCharacterTextSplitter** first tries to split on double newlines, and so here it splits it up into two paragraphs.
  - Even though the first one is shorter than the 450 characters, we specified this is probably a better split because now the two paragraphs that are each their own paragraphs are in the chunks as opposed to being split in the middle of a sentence.

Let's now split it into even smaller chunks just to get an even better intuition as to what's going on.\
We'll also add in a **period separator**. This is aimed at splitting in between sentences.


In [20]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

If we run this text splitter, we can see that it's split on sentences, but the periods are actually in the wrong places. 
- This is because of the regex that's going on underneath the scenes. To fix this, we can actually specify a slightly more complicated regex with a look behind. 

In [21]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

If we run this, we can see that it's split into sentences, and it's split properly with the periods being in the right places.

Let's now do this on an even more real-world example with one of the PDFs that we worked with in the  first document loading section.

In [23]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()

In [24]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

Here we pass the **length_function**. 
- This is using **LEN**, the Python built-in. 
- This is the **default**, but we're just specifying it for more clarity what's going on underneath the scenes, and this is counting the length of the characters. 
 

In [25]:
docs = text_splitter.split_documents(pages)

If we compare the length of those documents to the length of the original pages, we can see that there's been a bunch more documents that have been created as a result of this splitting.

In [26]:
len(docs)

77

In [27]:
len(pages)

22

We can do a similar thing with the Notion DB that we used in the first lecture as well.

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [None]:
docs = text_splitter.split_documents(notion_db)

In [None]:
len(notion_db)

In [None]:
len(docs)

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated by token count.

Tokens are often ~4 characters.

In [31]:
from langchain.text_splitter import TokenTextSplitter

To really get a sense for what the difference is between tokens and characters. \
Let's initialize the token text splitter with a chunk size of 1, and a chunk overlap of 0. So, this will split any text into a list of the relevant tokens. 

In [32]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

Let's create a fun made-up text, and when we split it, we can see that it's split into a bunch of different tokens, and they're all a little bit different in terms of their length and the number of characters in them.

In [33]:
text1 = "foo bar bazzyfoo"

In [34]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

- So, the first one is just foo then you've got a space, and then bar, and then you've got a space, and just the B then AZ then ZY, and then foo again. 
- And this shows a little bit of the difference between splitting on characters versus splitting on tokens. 

 Let's apply this to the documents that we loaded above, and in a similar way, we can call the split documents on the pages, and if we take a look at the first document, we have our new split document with the page content being roughly the title, and then we've got the metadata of the source and the page where it came from. 

In [35]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [36]:
docs = text_splitter.split_documents(pages)

In [37]:
docs[0]

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0})

- If we take a look atthe first document, we have our new split document with the page content being roughly the title, and then we've got the metadata of the source and the page where it came from. 
- You can see here that the metadata of the source and the page is the same in the chunk as it was for the original document and so if we take a look at that just to make sure pages 0 metadata, we can see that it lines up. 
- This is good it's carrying through the metadata to each chunk appropriately, but there can also be cases where you actually want to add more metadata to the chunks as you split them. 

In [38]:
pages[0].metadata

{'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [39]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [40]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [41]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [42]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [43]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

In [44]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

In [None]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [None]:
md_header_splits = markdown_splitter.split_text(txt)

In [None]:
md_header_splits[0]