# Text Splitters

---

Alejandro Ricciardi (Omegapy)  
created date: 01/23/2024   
[GitHub](https://github.com/Omegapy)  

Credit: [LangChain](https://python.langchain.com/docs/expression_language/)

<br>

--- 

 
Projects Description:  
**LangChain** is a framework for developing applications powered by language models.  
**In this project:** This project is a series of LangChain text tplitters for LLMs tutorials on Jupyter Notebook.  
The tutorials are a series LangChain Python code examples from the https://python.langchain.com/ website
 
Specifically from the section [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/).

Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.

When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.

At a high level, text splitters work as following:

1.	Split the text up into small, semantically meaningful chunks (often sentences).
2.	Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
3.	Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

That means there are two different axes along which you can customize your text splitter:
1.	How the text is split
2.	How the chunk size is measured

**Types of Text Splitters**
LangChain offers many different types of text splitters. Below is a table listing all of them, along with a few characteristics:

**Name**: Name of the text splitter

**Splits On**: How this text splitter splits text

**Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from.

**Description**: Description of the splitter, including recommendation on when to use it.

<p></p>
<table><thead><tr><th>Name</th><th>Splits On</th><th>Adds Metadata</th><th>Description</th></tr></thead><tbody><tr><td>Recursive</td><td>A list of user defined characters</td><td></td><td>Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.</td></tr><tr><td>HTML</td><td>HTML specific characters</td><td>✅</td><td>Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)</td></tr><tr><td>Markdown</td><td>Markdown specific characters</td><td>✅</td><td>Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)</td></tr><tr><td>Code</td><td>Code (Python, JS) specific characters</td><td></td><td>Splits text based on characters specific to coding languages. 15 different languages are available to choose from.</td></tr><tr><td>Token</td><td>Tokens</td><td></td><td>Splits text on tokens. There exist a few different ways to measure tokens.</td></tr><tr><td>Character</td><td>A user defined character</td><td></td><td>Splits text based on a user defined character. One of the simpler methods.</td></tr><tr><td>[Experimental] Semantic Chunker</td><td>Sentences</td><td></td><td>First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from <a href="https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb" target="_blank" rel="noopener noreferrer">Greg Kamradt</a></td></tr></tbody></table>

**Evaluate text splitters**
You can evaluate text splitters with the [Chunkviz utility](https://chunkviz.up.railway.app/) created by Greg Kamradt. ```Chunkviz``` is a great tool for visualizing how your text splitter is working. It will show you how your text is being split up and help in tuning up the splitting parameters.

**Other Document Transforms**
Text splitting is only one example of transformations that you may want to do on documents before passing them to an LLM. Head to Integrations for documentation on built-in document transformer [integrations](https://python.langchain.com/docs/integrations/document_transformers/) with 3rd-party tools.

<p></p>
<b style="font-size:15;">
⚠️ This project requires an OpenAI key.
</b>


##### Project Map  
- [API Keys](#api-keys)  
- [HTMLHeaderTextSplitter](#htmlheadertextsplitter)
    - [Base Example](#base-example-htmlheadertextsplitter)
    - [Limitations](#limitations)
- [Split by character](#split-by-character)
- [Split code](#split-code)
    - [Python](#python)
    - [JS](#js)
    - [TS](#ts)
    - [Markdown](#markdown)
    - [Latex](#latex)
    - [HTML](#html)
    - [Solidity](#solidity)
    - [c](#c)
- [MarkdownHeaderTextSplitter](#markdownheadertextsplitter)
- [Recursively split by character](#recursively-split-by-character)
- [Semantic Chunking](#semantic-chunking)
- [Split by tokens](#split-by-tokens)
    - [tiktoken](#tiktoken)
    - [spaCy](#spacy)
    - [SentenceTransformers](#sentencetransformers)
    - [NTLK](#nltk)
    - [Hugging Face tokenizer](#hugging-face-tokenizer)

<br>

---


#### API Keys

In [4]:
import os
from dotenv import load_dotenv,find_dotenv
from openai.types.beta.threads.messages import message_file

load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPEN_AI_KEY")

---
## HTMLHeaderTextSplitter

Similar in concept to the

`MarkdownHeaderTextSplitter`, the `HTMLHeaderTextSplitter` is a “structure-aware” chunker that splits text at the element level and adds metadata for each header “relevant” to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

<br>

---

### Base Example (HTMLHeaderTextSplitter)

1) With an HTML string:

In [5]:
from langchain.text_splitter import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html>
<body>
    <div>
        <h1>Foo</h1>
        <p>Some intro text about Foo.</p>
        <div>
            <h2>Bar main section</h2>
            <p>Some intro text about Bar.</p>
            <h3>Bar subsection 1</h3>
            <p>Some text about the first subtopic of Bar.</p>
            <h3>Bar subsection 2</h3>
            <p>Some text about the second subtopic of Bar.</p>
        </div>
        <div>
            <h2>Baz</h2>
            <p>Some text about Baz</p>
        </div>
        <br>
        <p>Some concluding text about Foo</p>
    </div>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(page_content='Foo'),
 Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),
 Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),
 Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),
 Document(page_content='Baz', metadata={'Header 1': 'Foo'}),
 Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),
 Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]

2) Pipelined to another splitter, with html loaded from a web URL:

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

url = "https://plato.stanford.edu/entries/goedel/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(html_header_splits)
splits[80:85]

[Document(page_content='We see that Gödel first tried to reduce the consistency problem for analysis to that of arithmetic. This seemed to require a truth definition for arithmetic, which in turn led to paradoxes, such as the Liar paradox (“This sentence is false”) and Berry’s paradox (“The least number not defined by an expression consisting of just fourteen English words”). Gödel then noticed that such paradoxes would not necessarily arise if truth were replaced by provability. But this means that arithmetic truth', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 The First Incompleteness Theorem'}),
 Document(page_content='means that arithmetic truth and arithmetic provability are not co-extensive — whence the First Incompleteness Theorem.', metadata={'Header 1': 'Kurt Gödel', 'Header 2': '2. Gödel’s Mathematical Work', 'Header 3': '2.2 The Incompleteness Theorems', 'Header 4': '2.2.1 Th

[Project Map](#project-map)

---

### Limitations
There can be quite a bit of structural variation from one HTML document to another, and while HTMLHeaderTextSplitter will attempt to attach all “relevant” headers to any given chunk, it can sometimes miss certain headers. For example, the algorithm assumes an informational hierarchy in which headers are always at nodes “above” associated text, i.e. prior siblings, ancestors, and combinations thereof. In the following news article (as of the writing of this document), the document is structured such that the text of the top-level headline, while tagged “h1”, is in a distinct subtree from the text elements that we’d expect it to be “above”—so we can observe that the “h1” element and its associated text do not show up in the chunk metadata (but, where applicable, we do see “h2” and its associated text)

In [7]:
url = "https://www.cnn.com/2023/09/25/weather/el-nino-winter-us-climate/index.html"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
print(html_header_splits[1].page_content[:500])

No two El Niño winters are the same, but many have temperature and precipitation trends in common.  
Average conditions during an El Niño winter across the continental US.  
One of the major reasons is the position of the jet stream, which often shifts south during an El Niño winter. This shift typically brings wetter and cooler weather to the South while the North becomes drier and warmer, according to NOAA.  
Because the jet stream is essentially a river of air that storms flow through, they c


[Project Map](#project-map)

---

---
## Split by character

This is the simplest method. This splits based on characters (by default “”) and measure chunk length by number of characters.

1. How the text is split: by single character.
2. How the chunk size is measured: by number of characters.

<br>

---

In [8]:
# This is a long document we can split up.
with open("data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [9]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [10]:
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'


Here’s an example of passing metadata along with the documents, notice that it is split along with the documents.

In [11]:
metadatas = [{"document": 1}, {"document": 2}]
documents = text_splitter.create_documents(
    [state_of_the_union, state_of_the_union], metadatas=metadatas
)
print(documents[0])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.' metadata={'document': 1}


In [12]:
text_splitter.split_text(state_of_the_union)[0]

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.'

[Project Map](#project-map)

---

---
## Split code

```CodeTextSplitter``` allows you to split your code with multiple languages supported. Import ```enum Language``` and specify the language.

<br>

---

In [13]:
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter,
)

# Full list of supported languages
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol']

In [14]:
# You can also see the separators used for a given language
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

### Python

In [15]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():\n    print("Hello, World!")'),
 Document(page_content='# Call the function\nhello_world()')]

[Project Map](#project-map)

---

### JS

In [16]:
JS_CODE = """
function helloWorld() {
  console.log("Hello, World!");
}

// Call the function
helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document(page_content='function helloWorld() {\n  console.log("Hello, World!");\n}'),
 Document(page_content='// Call the function\nhelloWorld();')]

### TS

In [17]:
TS_CODE = """
function helloWorld(): void {
  console.log("Hello, World!");
}

// Call the function
helloWorld();
"""

ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS, chunk_size=60, chunk_overlap=0
)
ts_docs = ts_splitter.create_documents([TS_CODE])
ts_docs

[Document(page_content='function helloWorld(): void {'),
 Document(page_content='console.log("Hello, World!");\n}'),
 Document(page_content='// Call the function\nhelloWorld();')]

[Project Map](#project-map)

---

### Markdown

In [18]:
markdown_text = """
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
# Hopefully this code block isn't split
pip install langchain
```

As an open-source project in a rapidly developing field, we are extremely open to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs

[Document(page_content='# 🦜️🔗 LangChain'),
 Document(page_content='⚡ Building applications with LLMs through composability ⚡'),
 Document(page_content='## Quick Install\n\n```bash'),
 Document(page_content="# Hopefully this code block isn't split"),
 Document(page_content='pip install langchain'),
 Document(page_content='```'),
 Document(page_content='As an open-source project in a rapidly developing field, we'),
 Document(page_content='are extremely open to contributions.')]

### Latex

In [19]:
latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""

latex_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
latex_docs = latex_splitter.create_documents([latex_text])
latex_docs

  latex_text = """


[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle'),
 Document(page_content='\\section{Introduction}'),
 Document(page_content='Large language models (LLMs) are a type of machine learning'),
 Document(page_content='model that can be trained on vast amounts of text data to'),
 Document(page_content='generate human-like language. In recent years, LLMs have'),
 Document(page_content='made significant advances in a variety of natural language'),
 Document(page_content='processing tasks, including language translation, text'),
 Document(page_content='generation, and sentiment analysis.'),
 Document(page_content='\\subsection{History of LLMs}'),
 Document(page_content='The earliest LLMs were developed in the 1980s and 1990s,'),
 Document(page_content='but they were limited by the amount of data that could be'),
 Document(page_content='processed and the computational power available at the'),
 Document(page_content='time. In the past decade, however, advanc

### HTML

In [20]:
html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>
        </div>
        <div>
            As an open-source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([html_text])
html_docs

[Document(page_content='<!DOCTYPE html>\n<html>'),
 Document(page_content='<head>\n        <title>🦜️🔗 LangChain</title>'),
 Document(page_content='<style>\n            body {\n                font-family: Aria'),
 Document(page_content='l, sans-serif;\n            }\n            h1 {'),
 Document(page_content='color: darkblue;\n            }\n        </style>\n    </head'),
 Document(page_content='>'),
 Document(page_content='<body>'),
 Document(page_content='<div>\n            <h1>🦜️🔗 LangChain</h1>'),
 Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡'),
 Document(page_content='</p>\n        </div>'),
 Document(page_content='<div>\n            As an open-source project in a rapidly dev'),
 Document(page_content='eloping field, we are extremely open to contributions.'),
 Document(page_content='</div>\n    </body>\n</html>')]

[Project Map](#project-map)

---

### Solidity

In [21]:
SOL_CODE = """
pragma solidity ^0.8.20;
contract HelloWorld {
   function add(uint a, uint b) pure public returns(uint) {
       return a + b;
   }
}
"""

sol_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.SOL, chunk_size=128, chunk_overlap=0
)
sol_docs = sol_splitter.create_documents([SOL_CODE])
sol_docs

[Document(page_content='pragma solidity ^0.8.20;'),
 Document(page_content='contract HelloWorld {\n   function add(uint a, uint b) pure public returns(uint) {\n       return a + b;\n   }\n}')]

### c

In [22]:
C_CODE = """
using System;
class Program
{
    static void Main()
    {
        int age = 30; // Change the age value as needed

        // Categorize the age without any console output
        if (age < 18)
        {
            // Age is under 18
        }
        else if (age >= 18 && age < 65)
        {
            // Age is an adult
        }
        else
        {
            // Age is a senior citizen
        }
    }
}
"""
c_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.CSHARP, chunk_size=128, chunk_overlap=0
)
c_docs = c_splitter.create_documents([C_CODE])
c_docs

[Document(page_content='using System;'),
 Document(page_content='class Program\n{\n    static void Main()\n    {\n        int age = 30; // Change the age value as needed'),
 Document(page_content='// Categorize the age without any console output\n        if (age < 18)\n        {\n            // Age is under 18'),
 Document(page_content='}\n        else if (age >= 18 && age < 65)\n        {\n            // Age is an adult\n        }\n        else\n        {'),
 Document(page_content='// Age is a senior citizen\n        }\n    }\n}')]

[Project Map](#project-map)

---

---
## MarkdownHeaderTextSplitter

Many chat or Q+A applications involve chunking input documents prior to embedding and vector storage.

<br>

---

For example these notes from Pinecone provide some useful tips:

As mentioned, chunking often aims to keep text with common context together. With this in mind, we might want to specifically honor the structure of the document itself. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

In [23]:
md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly' 

[("#", "Header 1"),("##", "Header 2")]

[('#', 'Header 1'), ('##', 'Header 2')]

And content is grouped or split by common headers:

In [24]:
{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

{'content': 'Hi this is Molly',
 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

In [25]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
 Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
 Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

In [26]:
type(md_header_splits[0])

langchain_core.documents.base.Document

By default, ```MarkdownHeaderTextSplitter``` strips headers being split on from the output chunk’s content. This can be disabled by setting ```strip_headers = False```.

In [27]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(page_content='# Foo  \n## Bar  \nHi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
 Document(page_content='### Boo  \nHi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
 Document(page_content='## Baz  \nHi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

Within each markdown group we can then apply any text splitter we want.

In [28]:
markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

[Document(page_content='# Intro  \n## History  \nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='## Rise and divergence  \nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
 Document(page_content='#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane,

[Project Map](#project-map)

---

---
## Recursively split by character

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ```["\n\n", "\n", " ", ""]```. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

1. How the text is split: by list of characters.
2. How the chunk size is measured: by number of characters.

<br>

---

In [29]:
# This is a long document we can split up.
with open("data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()
    
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and'
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.'


In [30]:
text_splitter.split_text(state_of_the_union)[:2]

['Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and',
 'of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.']

[Project Map](#project-map)

---

---
## Semantic Chunking

Splits the text based on semantic similarity.

Taken from Greg Kamradt’s wonderful notebook: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb

All credit to him.

At a high level, this splits into sentences, then groups into groups of 3 sentences, and then merges one that are similar in the embedding space.

<br>

---

In [31]:
!pip install --quiet langchain_experimental langchain_openai

In [32]:
# This is a long document we can split up.
with open("data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [33]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students t

[Project Map](#project-map)

---

---
## Split by tokens

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

<br>

---

### tiktoken

[tiktoken](https://github.com/openai/tiktoken) is a fast BPE tokenizer created by OpenAI.

We can use it to estimate tokens used. It will probably be more accurate for the OpenAI models.

How the text is split: by character passed in.
How the chunk size is measured: by tiktoken tokenizer.

In [34]:
%pip install --upgrade --quiet  tiktoken

Note: you may need to restart the kernel to use updated packages.


In [35]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

In [36]:
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.


Note that if we use ```CharacterTextSplitter.from_tiktoken_encoder```, text is only split by ```CharacterTextSplitter``` and tiktoken tokenizer is used to merge splits. It means that split can be larger than chunk size measured by tiktoken tokenizer. We can use ```RecursiveCharacterTextSplitter.from_tiktoken_encoder``` to make sure splits are not larger than chunk size of tokens allowed by the language model, where each split will be recursively split if it has a larger size.

We can also load a tiktoken splitter directly, which ensure each split is smaller than chunk size.

In [37]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our


[Project Map](#project-map)

---

### spaCy

[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

Another alternative to NLTK is to use [spaCy tokenizer](https://spacy.io/api/tokenizer).

1. How the text is split: by spaCy tokenizer.
2. How the chunk size is measured: by number of characters.

In [38]:
#%pip install --upgrade --quiet  spacy
# The following packages are needed to run SpacyTextSplitter
#!python -m spacy download en_core_web_lg
#!python -m spacy download en_core_web_sm

In [39]:
# This is a long document we can split up.
with open("data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [40]:
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  



Last year COVID-19 kept us apart.

This year we are finally together again. 



Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 



With a duty to one another to the American people to the Constitution. 



And with an unwavering resolve that freedom will always triumph over tyranny. 



Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 



He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 



He met the Ukrainian people. 



From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.




[Project Map](#project-map)

---

### SentenceTransformers

The ```SentenceTransformersTokenTextSplitter``` is a specialized text splitter for use with the ```sentence-transformer models```. 
The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [41]:
#!conda install -c conda-forge sentence-transformers

In [None]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

In [None]:
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

In [None]:
text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

[Project Map](#project-map)

---

### NLTK

The [Natural Language Toolkit](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), or more commonly [NLTK](https://www.nltk.org/), is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.

Rather than just splitting on “”, we can use NLTK to split based on NLTK tokenizers.

1. How the text is split: by NLTK tokenizer.
2. How the chunk size is measured: by number of characters.

In [43]:
#!pip install nltk



In [44]:
# This is a long document we can split up.
with open("data/state_of_the_union.txt") as f:
    state_of_the_union = f.read()

In [45]:
from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.


[Project Map](#project-map)

---

### Hugging Face tokenizer

[Hugging Face](https://huggingface.co/docs/tokenizers/index) has many tokenizers.

We use Hugging Face tokenizer, the [GPT2TokenizerFast](https://huggingface.co/Ransaka/gpt2-tokenizer-fast) to count the text length in tokens.

1. How the text is split: by character passed in.
2. How the chunk size is measured: by number of tokens calculated by the Hugging Face tokenizer.

In [47]:
#!pip install transformers

Collecting transformers
  Using cached transformers-4.37.2-py3-none-any.whl.metadata (129 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.2-cp312-none-win_amd64.whl.metadata (3.9 kB)
Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
   ---------------------------------------- 0.0/8.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/8.4 MB 1.4 MB/s eta 0:00:07
   - -------------------------------------- 0.3/8.4 MB 5.3 MB/s eta 0:00:02
   ----- ---------------------------------- 1.1/8.4 MB 10.1 MB/s eta 0:00:01
   --------- ------------------------------ 1.9/8.4 MB 12.2 MB/s eta 0:00:01
   ------------- -------------------------- 2.9/8.4 MB 14.3 MB/s eta 0:00:01
   ------------------ --------------------- 3.9/8.4 MB 15.4 MB/s eta 0:00:01
   ------------------------ --------------- 5.1/8.4 MB 17.2 MB/s eta 0:00:01
   -------------------------------- ------- 6.8/8.4 MB 19.7 MB/s eta 0:00:01
   -------------------------------------

In [48]:
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [49]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.


[Project Map](#project-map)

---