 # <center style="font-family: consolas; font-size: 32px; font-weight: bold;">  Hands-On LangChain for LLM Applications Development: Documents Splitting </center>
***

Once you’ve loaded documents, you’ll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model’s context window. 

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. 

LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents. In this two-part practical article, we will explore the importance of document splitting, and the available LangChain text splitters and will explore four of them in depth. 

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">
<ul>
    <li><a href="#1" target="_self" rel=" noreferrer nofollow">1. Why do we need document splitting? </a></li> 
    <li><a href="#2" target="_self" rel=" noreferrer nofollow">2. Different types of LangChain splitters </a></li> 
    <li><a href="#3" target="_self" rel=" noreferrer nofollow">3. Introduction to recursive character text splitter & the character text splitter </a></li> 
    <li><a href="#4" target="_self" rel=" noreferrer nofollow">4. Diving deep in recursive splitting </a></li>
    <li><a href="#5" target="_self" rel=" noreferrer nofollow">5. PDF loading & splitting </a></li>
    <li><a href="#6" target="_self" rel=" noreferrer nofollow">6. Token splitting </a></li>
    <li><a href="#7" target="_self" rel=" noreferrer nofollow">7. Context-aware splitting </a></li>
</ul>
</div>

***

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1.  Why do we need document splitting? </b></div>


Document splitting happens after you load your data into the document format. But before, it goes into the vector store, and this may seem really simple. 

You can just split the chunks according to the lengths of each character or something like that. But as an example of why this is both trickier and very important down the line, let’s take a look at this example here.

We’ve got a sentence about an LLM model and some specifications. And if we did a simple splitting, we could end up with part of the sentence in one chunk, and the other part of the sentence in another chunk.

Here is the whole sentence:

**Falcon LLM is a generative large language model (LLM) that helps advance applications and use cases to future-proof our world. Today the Falcon 180B, 40B, 7.5B, and 1.3B parameter AI models.**


Let's split it into two chunks:

First chunk: 

**Falcon LLM is a generative large language model (LLM) that helps advance applications and use cases to future-proof our world**

Second chunk:

**Today the Falcon 180B, 40B, 7.5B, and 1.3B parameter AI models.**

When we’re trying to answer a question down the line about what are the specifications of the Falcon LLM, we actually don’t have the right information in either chunk so it’s split apart. And so, we wouldn’t be able to answer this question correctly. So, there’s a lot of nuance and importance in splitting the chunks to get semantically relevant chunks together.

The basis of all the text splitters in Lang Chain involves splitting on chunks in some chunk size with some chunk overlap. The figure below shows what that looks like. So, the chunk size corresponds to the size of a chunk, which can be measured in a few different ways. 

A chunk overlap is generally kept as a little overlap between two chunks, like a sliding window as we move from one to the other. This allows for the same piece of context to be at the end of one chunk and at the start of the other and helps create some notion of consistency.

<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2.  Different types of LangChain splitters </b></div>


There are a lot of different types of splitters in Lang Chain, and we’ll cover a few of them in this article to give you an idea of how they work you are encouraged to try the rest on your own to see how they work. These text splitters vary across a bunch of dimensions. They also can vary on how they split the chunks, and what characters go into that. They can vary in how they measure the length of the chunks. Is it by characters? Is it by tokens? There are even some that use other smaller models to determine when the end of a sentence might be and use that as a way of splitting chunks.

Another important part of splitting into chunks is also the metadata. Maintaining the same metadata across all chunks, but also adding in new pieces of metadata when relevant.

**Here is a summary of the splitters available by the LangChain package:**

* **CharacterTextSplitter()**: Implementation of splitting text that looks at characters. 
* **MarkdownHeaderTextSplitter()**: Implementation of splitting markdown files based on specified headers. 
* **TokenTextSplitter():** Implementation of splitting text that looks at tokens. 
* **SentenceTransformersTokenTextSplitter():** Implementation of splitting text that looks at tokens. 
* **RecursiveCharacterTextSplitter():** Implementation of splitting text that looks at characters. Recursively tries to split by different characters to find one that works.
* **Language():** for CPP, Python, Ruby, Markdown etc 
* **NLTKTextSplitter():** Implementation of splitting text that looks at sentences using NLTK (Natural Language Tool Kit) 
* **SpacyTextSplitter():** Implementation of splitting text that looks at sentences using Spacy

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3.  Introduction to recursive character text splitter & the character text splitter </b></div>


Let's start with two of the most common types of text splitters in Lang Chain. The recursive character text splitter and the character text splitter. We are going to try around with a few toy use cases to get a sense of how they work. We are going to set a relatively small chunk size of 26, and an even smaller chunk overlap of 4, so that we can see what these can do.

Let’s initialize these text splitters as R splitter and a C splitter. And then let’s take a look at a few different use cases.



In [1]:
!pip install LangChain -q
!pip install -U langchain-community -q

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size =26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
keras-cv 0.9.0 requires keras-core, which is not installed.
keras-nlp 0.12.1 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 16.1.0 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 24.1 which is incompatible.
jupyterlab 4.2.1 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.1.0 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incomp

Let’s load in the first string. A, B, C, D, all the way down to Z. And let’s look at what happens when we use the various splitters.

In [2]:
text1 = 'abcdefghijklmnopqrstuvwxyz'

Let's start with the RecursiveCharacterTextSplitter and see the output:

In [3]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

When we split it with the RecursiveCharacterTextSplitter it still ends up as one string. This is because this is 26 characters long and we’ve specified a chunk size of 26. So, there’s actually no need to even do any splitting here. Now, let’s do it on a slightly longer string where it’s longer than the 26 characters that we’ve specified as the chunk size.


In [4]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Here we can see that two different chunks are created. The first one ends at Z, so that’s 26 characters. The next one we can see starts with W, X, Y, Z. Those are the four chunk overlaps, And then it continues with the rest of the string.

Let’s take a look at a slightly more complex string where we have a bunch of spaces between characters.

In [5]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

We can now see that it’s split into three chunks because there are spaces, so it takes up more space. And so, if we look at the overlap we can see that in the first one, there’s L and M, and L and M are then also present in the second one. That seems like only two characters but because of the space both in between the L and M, and then also, before the L and after the M that actually counts as the four that make up the chunk overlap.

Let’s now try with the CharacterTextSplitter. We can see that when we run it doesn’t actually try to split it at all.

The reason for this is that the CharacterTextSplitter splits on a single character and by default that character is a newline character. But here, there are no newlines.

We can set the separator to be an empty space, we can see what happens then:

In [6]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

We can see that here it’s split in the same way as before after we change the separator parameter to separate on space.

<a id="4"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 4.  Diving deep in recursive splitting </b></div>

Now, let’s try it out on some more real-world examples. We’ve got this long paragraph here, and we can see that right about here, we have this double newline symbol which is a typical separator between paragraphs.

In [7]:
text = """As the world grapples with the challenges of climate change, \
renewable energy emerges as a beacon of hope. Solar and wind power, \
in particular, are transforming the energy landscape, offering sustainable \
alternatives to traditional fossil fuels. \n\n  \
Governments and businesses globally are investing in clean energy \
initiatives to reduce carbon footprints and mitigate environmental impact. \
The shift towards renewables not only addresses environmental concerns \
but also fosters innovation, creating a brighter and more sustainable \
future for generations to come."""


Let’s check out the length of this text, and we can see that it’s just about 600. 

In [8]:
len(text)

563

Let’s define two text splitters; Character Text Splitter and Recursive Character Text Splitter. We will work with the character text splitter as before with the space as a separator and then we’ll initialize the recursive character text splitter. Here, we pass in a list of separators, and these are the default separators but we’re just putting them to show better what’s going on.

In [9]:
# Character Text Splitter
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)

# Recursive Character Text Splitter

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [10]:
# Character Text Splitter
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)

# Recursive Character Text Splitter

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

For the Recursive Character Text Splitter, we have passed a list of separators. This list is double newline, single newline, space, and then nothing, an empty string. 

This means that when you’re splitting a piece of text it will first try to split it by double newlines. Then, if it still needs to split the individual chunks more it will go on to single newlines. Then, if it still needs to do more it goes on to the space. Finally, it will go character by character if it really needs to do that.

Let's apply these two splitters to the text above and look at how they perform. 

In [11]:
c_splitter.split_text(text)

['As the world grapples with the challenges of climate change, renewable energy emerges as a beacon of hope. Solar and wind power, in particular, are transforming the energy landscape, offering sustainable alternatives to traditional fossil fuels. \n\n Governments and businesses globally are investing in clean energy initiatives to reduce carbon footprints and mitigate environmental impact. The shift towards renewables not only addresses',
 'environmental concerns but also fosters innovation, creating a brighter and more sustainable future for generations to come.']

We can see that the character text splitter splits on the spaces. So, we end up with this weird separation in the middle of the sentence. 

The Recursive Character Text Splitter first tries to split on double newlines, and so here it splits it up into two paragraphs. Even though the first one is shorter than the 450 characters, we specified this is probably a better split because now the two paragraphs that are each their own paragraphs are in chunks as opposed to being split in the middle of a sentence

In [12]:
r_splitter.split_text(text)

['As the world grapples with the challenges of climate change, renewable energy emerges as a beacon of hope. Solar and wind power, in particular, are transforming the energy landscape, offering sustainable alternatives to traditional fossil fuels.',
 'Governments and businesses globally are investing in clean energy initiatives to reduce carbon footprints and mitigate environmental impact. The shift towards renewables not only addresses environmental concerns but also fosters innovation, creating a brighter and more sustainable future for generations to come.']

Now let’s split it into even smaller chunks just to get a better understanding of how it works. We’ll also add a period separator. This is aimed at splitting in between sentences. 

In [13]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(text)

['As the world grapples with the challenges of climate change, renewable energy emerges as a beacon of hope. Solar and wind power, in particular, are',
 'transforming the energy landscape, offering sustainable alternatives to traditional fossil fuels.',
 'Governments and businesses globally are investing in clean energy initiatives to reduce carbon footprints and mitigate environmental impact. The',
 'shift towards renewables not only addresses environmental concerns but also fosters innovation, creating a brighter and more sustainable future for',
 'generations to come.']

we can see that it’s split into sentences, but the periods are actually in the wrong places. This is because of the regex that’s going on underneath the scenes. To fix this, we can actually specify a slightly more complicated regex with a look behind. To fix this, we can actually specify a slightly more complicated regex with a look behind. Now, if we run this, we can see that it’s split into sentences, and it’s split properly with the periods being in the right places.

In [14]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(text)

['As the world grapples with the challenges of climate change, renewable energy emerges as a beacon of hope. Solar and wind power, in particular, are',
 'transforming the energy landscape, offering sustainable alternatives to traditional fossil fuels.',
 'Governments and businesses globally are investing in clean energy initiatives to reduce carbon footprints and mitigate environmental impact. The',
 'shift towards renewables not only addresses environmental concerns but also fosters innovation, creating a brighter and more sustainable future for',
 'generations to come.']

<a id="5"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 5.  PDF loading & splitting </b></div>

Let’s now apply the document splitters on an even more real-world example with one of the PDFs that we worked with in the document loading article. Let’s load it in, and then let’s define the text splitter here.

In [15]:
from langchain.document_loaders import PyPDFLoader

pdf_file = '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf'
loader = PyPDFLoader(pdf_file)
pages = loader.load()

For the text splitter, we first pass the separator on the newline character. The chunk size is 1000 and the overlap is 150. Finally, we pass the length function using the len Python built-in function. This is the default, but we’re just specifying it for more clarity on what’s going on underneath the scenes, and this is counting the length of the characters. 



In [16]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)


Because we now want to use documents, we’re using the split documents method, and we’re passing in a list of documents.



In [17]:
docs = text_splitter.split_documents(pages)

If we compare the length of those documents to the length of the original pages, we can see that there have been a bunch more documents that have been created as a result of this splitting

In [18]:
len(docs)

80

In [19]:
len(pages)

41

Finally, let's print the content of the third document 

In [20]:
docs[2].page_content

'PAGE 3Table of \nContentsIntroduction: Coding AI is the New Literacy.\nChapter 1: Three Steps to Career Growth.\nChapter 2: Learning Technical Skills for a \nPromising AI Career.\nChapter 3: Should You Learn Math to Get a Job \nin AI?\nChapter 4: Scoping Successful AI Projects.\nChapter 5: Finding Projects that Complement \nYour Career Goals.\nChapter 6: Building a Portfolio of Projects that \nShows Skill Progression.\nChapter 7: A Simple Framework for Starting Your AI \nJob Search.\nChapter 8: Using Informational Interviews to Find \nthe Right Job.\nChapter 9: Finding the Right AI Job for You.\nChapter 10: Keys to Building a Career in AI.\nChapter 11: Overcoming Imposter Syndrome.\nFinal Thoughts: Make Every Day Count.LEARNING\nPROJECTS\nJOB'

<a id="6"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 6.  Token splitting </b></div>

In the previous sections, we have done all the splitting based on characters. But, there’s another way to split your text based on tokens. Since LLMs often have context windows that are designated by token count. Therefore it’s important to know what the tokens are, and where they appear. Then, we can split them to have a slightly more representative idea of how the LLM would view them.

To have a better understanding of the difference between tokens and character splitters we will apply both to a piece of text and compare the results.

Let’s initialize the token text splitter with a chunk size of 1 and a chunk overlap of 0. So, this will split any text into a list of the relevant tokens.

In [21]:
!pip install tiktoken
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.7.0


Let’s create a small text example, and when we split it, we can see that it’s split into a bunch of different tokens, and they’re all a little bit different in terms of their length and the number of characters in them.

In [22]:
text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

So, the first one is just foo then you’ve got a space, and then bar, and then you’ve got a space, and just the B then AZ then ZY, and then foo again. And this shows a little bit of the difference between splitting on characters versus splitting on tokens.

Let’s apply this to the PDF that we loaded above, and in a similar way, we can call the split documents on the pages, and if we take a look at the first document, we have our new split document with the page content being roughly the title, and then we’ve got the metadata of the source and the page where it came from.

In [23]:
docs = text_splitter.split_documents(pages)
docs[0]

Document(page_content='PA', metadata={'source': '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf', 'page': 0})

You can see here that the metadata of the source and the page is the same in the chunk as it was for the original document and so if we take a look at that to make sure pages [0] metadata, we can see that it lines up.

In [24]:
pages[0].metadata

{'source': '/kaggle/input/how-to-build-a-career-in-ai-pdf/eBook-How-to-Build-a-Career-in-AI.pdf',
 'page': 0}

This is good it’s carrying through the metadata to each chunk appropriately, but there can also be cases where you actually want to add more metadata to the chunks as you split them. 

This can contain information like where in the document the chunk came from, from where it is relative to other things or concepts in the document, and generally this information can be used when answering questions to provide more context about what this chunk is exactly. 

To see a concrete example of this, let’s look at another type of text splitter that adds information to the metadata of each chunk.

<a id="7"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 7.  Context-aware splitting </b></div>

A text splitting often uses sentences or other delimiters to keep related text together. However many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting. 

We can use MarkdownHeaderTextSplitter to preserve header metadata in our chunks. This splitter will split a markdown file based on the header or any subheaders and then it will add those headers as content to the metadata fields and that will get passed on along to any chunks that originate from those splits.

**Let’s import the MarkdownHeaderTextSplitter using the code below:**


In [25]:
from langchain.text_splitter import MarkdownHeaderTextSplitter


Lets take this example to illustrate how it works. We have a title and then a subheader of chapter 1. We then have some sentences there, and then another section of an even smaller subheader, and then we jump back out to chapter 2, and some sentences there:

In [26]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Chapter 1\n\n Hi this is Section 1\n\n \
### Section \n\n \
Hi this is Section 2 \n\n 
## Chapter 2\n\n \
Hi this is Chapter 2"""

Next we will define a list of the headers we want to split on and the names of those headers. So first, we’ve got a single hashtag and we’ll call that header 1. We’ve then got two hashtags, header 2, three hashtags, header 3. We can then initialize the markdown header text splitter with those headers, and then split the the example we have above

In [27]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

Now lets split the example above on the splitters we gave in the previous code snippet:

In [28]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)


Lets print the output to have a better understanding of how it works:

In [29]:
md_header_splits[0]


Document(page_content='Hi this is Chapter 1  \nHi this is Section 1', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

If we take a look at a few of these examples, we can see the first one has the content, “Hi this is Chapter 1” “Hi this is Section 1.” and in the metadata, we have header 1, and then we have it as Title and header 2 as Chapter 1, and this is coming from right here in the example document above.

Let’s take a look at the next one:

In [30]:
md_header_splits[1]

Document(page_content='Hi this is Section 2', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

We can see here that we’ve jumped down into an even smaller subsection. We have got the content of “Hi this is Section 2” and now we’ve got not only header 1. But also header 2, and also header 3, and this is again coming from the content and names in the markdown document above.

Let’s try this out on a real-world example. In a previous article, we loaded the notion directory using the notion directory loader and this loaded the files to markdown which is relevant for the markdown header splitter.

**let’s load those documents using the NotionDirectoryLoader:**

In [31]:
from langchain_community.document_loaders import NotionDirectoryLoader


# Load the documents using NotionDirectoryLoader
loader = NotionDirectoryLoader("/kaggle/input/notiondb/Notion_DB")
docs = loader.load()

# Concatenate the content of all pages into a single string
notion_text = ' '.join([d.page_content for d in docs])

Now we will define the **MarkdownHeaderTextSplitter** with header 1 as a single hashtag and header 2 as a double hashtag. We split the text and we get our splits.

In [32]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(notion_text)

Lets now print the first part of this splitting:

In [33]:
md_header_splits[0]

Document(page_content='↓ Click the button below at the start of every week to clear the current meal plan.', metadata={'Header 1': 'Meal Planner'})

If we take a look at these results, we can see that the first one has content of the notion page. If you looked down to the metadata, we can see that we’ve loaded header 1 as Blendel’s employee handbook. 

If we print the second part of the results we can see that there will be a new header in the meta data

In [34]:
md_header_splits[1]

Document(page_content='To edit meals from a specific day of the week, click on the meal entry you want to modify.  \n[Weekly Plan](Meal%20Planner%2078127b2f0a0943808b94d7ff51873450/Weekly%20Plan%20f0e6db90b62b4b3fa432f1ba0cca0039.csv)', metadata={'Header 1': 'Weekly Plan'})

In this notebook, we have now gone over how to get semantically relevant chunks with appropriate metadata. In the next article, we will cover the next step of moving those chunks of data into a vector store.

# <div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 170, 0.2) -15px 15px inset, rgba(240, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 46, 170, 0.7)"> <b> ༼⁠ ⁠つ⁠ ⁠◕⁠‿⁠◕⁠ ⁠༽⁠つ Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> 💌 Thank you for taking the time to read through my notebook. I hope you found it interesting and informative. If you have any feedback or suggestions for improvement, please don't hesitate to let me know in the comments. <br><br> 🚀 If you liked this notebook, please consider upvoting it so that others can discover it too. Your support means a lot to me, and it helps to motivate me to create more content in the future. <br><br> ❤️ Once again, thank you for your support, and I hope to see you again soon!</p>