# Levels Of Text Splitting

In this tutorial we are reviewing the 4 Levels Of Text Splitting. This is an unofficial list put together for fun and educational purposes.

Ever try to put a long piece of text into ChatGPT but it tells you it’s too long? Or you're trying to give your application better long term memory, but it’s still just not quite working.

One of the most effective strategies to improve performance of your language model applications is to split your large data into smaller pieces. This is call splitting or chunking (we'll use these terms interchangeably). In the world of multi-modal, splitting also applies to images.

We are going to cover a lot, but if you make it to the end, I guarantee you’ll have a solid grasp on chunking theory, strategies, and resources to learn more.

**Levels Of Text Splitting**
* **Level 1: [Character Splitting](#CharacterSplitting)** - Simple static character chunks of data
* **Level 2: [Recursive Character Text Splitting](#RecursiveCharacterSplitting)** - Recursive chunking based on a list of separators
* **Level 3: [Document Specific Splitting](#DocumentSpecific)** - Various chunking methods for different document types (PDF, Python, Markdown)
* **Level 4: [Semantic Splitting](#SemanticChunking)** - Embedding walk based chunking

**Notebook resources:**
* [Video Overview]() - Walkthrough of this code with commentary
* [ChunkViz.com](https://www.chunkviz.com/) - Visual representation of chunk splitting methods
* [RAGAS](https://github.com/explodinggradients/ragas) - Retrieval evaluation framework

**Evaluations**

It's important to test your chunking strategies in retrieval evals. It doesn't matter how you chunk if the performance of your application isn't great.

Eval Frameworks:

* [LangChain Evals](https://python.langchain.com/docs/guides/evaluation/)
* [Llama Index Evals](https://docs.llamaindex.ai/en/stable/module_guides/evaluating/root.html)
* [RAGAS Evals](https://github.com/explodinggradients/ragas)


**The Chunking Commandment:** Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later.

## Level 1: Character Splitting <a id="CharacterSplitting"></a>
Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

* **Pros:** Easy & Simple
* **Cons:** Very rigid and doesn't take into account the structure of your text

Concepts to know:
* **Chunk Size** - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
* **Chunk Overlap** - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.

In [2]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

Then let's split this text manually

In [3]:
# Create a list that will hold your chunks
chunks = []

chunk_size = 35 # Characters

# Run through the a range with the length of your text and iterate every chunk_size you want
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
chunks

['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

When working with text in the language model world, we don't deal with raw strings. It is more common to work with documents. Documents are objects that hold the text you're concerned with, but also additional metadata which makes filtering and manipulation easier later.

We could convert our list of strings into documents, but I'd rather start from scratch and create the docs.

Let's load up LangChains `CharacterSplitter` to do this for us

In [4]:
from langchain.text_splitter import CharacterTextSplitter

Then let's load up this text splitter. I need to specify `chunk overlap` and `separator` or else we'll get funk results. We'll get into those next

In [5]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)

Then we can actually split our text via `create_documents`. Note: `create_documents` expects a list of texts, so if you just have a string (like we do) you'll need to wrap it in `[]`

In [6]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='unk up. It is the example text for '),
 Document(metadata={}, page_content='this exercise')]

Notice how this time we have the same chunks, but they are in documents. These will play nicely with the rest of the LangChain world. Also notice how the trailing whitespace on the end of the 2nd chunk is missing. This is because LangChain removes it, see [this line](https://github.com/langchain-ai/langchain/blob/f36ef0739dbb548cabdb4453e6819fc3d826414f/libs/langchain/langchain/text_splitter.py#L167) for where they do it. You can avoid this with `strip_whitespace=False`

**Chunk Overlap & Separators**

**Chunk overlap** will blend together our chunks so that the tail of Chunk #1 will be the same thing and the head of Chunk #2 and so on and so forth.

This time I'll load up my overlap with a value of 4, this means 4 characters of overlap

In [7]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=4, separator='')

In [8]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='o chunk up. It is the example text'),
 Document(metadata={}, page_content='ext for this exercise')]

Notice how we have the same chunks, but now there is overlap between 1 & 2 and 2 & 3. The 'o ch' on the tail of Chunk #1 matches the 'o ch' of the head of Chunk #2.

I wanted a better way to visualize this, so I made [ChunkViz.com](www.chunkviz.com) to help show it. Here's what the same text looks like.

**Separators** are character(s) sequences you would like to split on. Say you wanted to chunk your data at `ch`, you can specify it.

In [9]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='ch')

In [10]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to'),
 Document(metadata={}, page_content='unk up. It is the example text for this exercise')]

#### Llama Index

[Llama Index](https://www.llamaindex.ai/) is a great choice for flexibility in the chunking and indexing process. They provide node relationships out of the box which can aid in retrieval later.

Let's take a look at their sentence splitter. It is similar to the character splitter, but using its default settings, it'll split on sentences instead.

In [4]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

Load up your splitter

In [5]:
splitter = SentenceSplitter(
    chunk_size=200,
    chunk_overlap=15,
)

Load up your document

In [13]:
documents = SimpleDirectoryReader(
    input_files=["data/popular.txt"]
).load_data()



Create your nodes. Nodes are similar to documents but with more relationship data added to them.

In [14]:
nodes = splitter.get_nodes_from_documents(documents)

Then let's take a look at one

In [15]:
nodes[0]

TextNode(id_='a7c407b0-544c-437e-b087-8bffbd3c2acb', embedding=None, metadata={'file_path': 'data/popular.txt', 'file_name': 'popular.txt', 'file_type': 'text/plain', 'file_size': 43295, 'creation_date': '2025-05-18', 'last_modified_date': '2025-05-18'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4d06bc0d-db85-4c68-8f04-f88226cef9ba', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data/popular.txt', 'file_name': 'popular.txt', 'file_type': 'text/plain', 'file_size': 43295, 'creation_date': '2025-05-18', 'last_modified_date': '2025-05-18'}, hash='183fcc14103b24683d228fa3581151119ed0bbce9b806b38cfbd4c98944b88cd'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='75c7fabd-15c5-42ee-9b

As you can see there is a lot more relationship data held within Llama Index's nodes. We'll talk about those later, I don't want to get ahead of ourselves

Basic Character splitting is likely only useful for a few applications, maybe yours!

## Level 2: Recursive Character Text Splitting
<a id="RecursiveCharacterSplitting"></a>
Let's jump a level of complexity.

The problem with Level #1 is that we don't take into account the structure of our document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L842). Let's take a look at them one by one.

* "\n\n" - Double new line, or most commonly paragraph breaks
* "\n" - New lines
* " " - Spaces
* "" - Characters

I'm not sure why a period (".") isn't included on the list, perhaps it is not universal enough? If you know, let me know.

This is the swiss army knife of splitters and my first choice when mocking up a quick application. If you don't know which splitter to start with, this is a good first bet.

Let's try it out

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

Then let's load up a larger piece of text

In [17]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

Now let's make our text splitter

In [18]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

In [19]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content="One of the most important things I didn't understand about the"),
 Document(metadata={}, page_content='world when I was a child is the degree to which the returns for'),
 Document(metadata={}, page_content='performance are superlinear.'),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(metadata={}, page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(metadata={}, page_content='meant well, but this is rarely true. If your product is only'),
 Document(metadata={}, page_content="half as good as your competitor's, you don't get half as many"),
 Document(metadata={}, page_content='customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are"),
 Document(metadata={}, page_content='superlinear in business. Some think this is a flaw of'),
 Document(metadata=

Notice how now there are more chunks that end with a period ".". This is because those likely are the end of a paragraph and the splitter first looks for double new lines (paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll split by the next separator. If the chunk is still too big, then it'll move onto the next one and so forth.

For text of this size, let's split on something bigger.

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(metadata={}, page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, th

## Level 3: Document Specific Splitting <a id="DocumentSpecific"></a>

Stepping up our levels ladder, let's start to handle document types other than normal prose in a .txt. What if you have pictures? or a PDF? or code snippets?

Our first two levels wouldn't work great for this so we'll need to find a different tactic.

This level is all about making your chunking strategy fit your different data formats. Let's run through a bunch of examples of this in action

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.

See all of LangChains document splitters [here](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter) and Llama Index ([HTML](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#htmlnodeparser), [JSON](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#jsonnodeparser), [Markdown](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules.html#markdownnodeparser))

### Markdown

You can see the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1175).

Separators:
* `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
* ```` ```\n ```` - Code blocks
* `\n\\*\\*\\*+\n` - Horizontal Lines
* `\n---+\n` - Horizontal Lines
* `\n___+\n` - Horizontal Lines
* `\n\n` Double new lines
* `\n` - New line
* `" "` - Spaces
* `""` - Character

In [21]:
from langchain.text_splitter import MarkdownTextSplitter

In [22]:
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

In [23]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [24]:
splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Fun in California\n\n## Driving'),
 Document(metadata={}, page_content='Try driving on the 1 down to San Diego'),
 Document(metadata={}, page_content='### Food'),
 Document(metadata={}, page_content="Make sure to eat a burrito while you're"),
 Document(metadata={}, page_content='there'),
 Document(metadata={}, page_content='## Hiking\n\nGo to Yosemite')]

Notice how the splits gravitate towards markdown sections. However, it's still not perfect. Check out how there is a chunk with just "there" in it. You'll run into this at low-sized chunks.

### Python

See the python splitters [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L1069)

* `\nclass` - Classes first
* `\ndef` - Functions next
* `\n\tdef` - Indented functions
* `\n\n` - Double New lines
* `\n` - New Lines
* `" "` - Spaces
* `""` - Characters


Let's load up our splitter

In [25]:
from langchain.text_splitter import PythonCodeTextSplitter

In [26]:
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [27]:
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)

In [28]:
python_splitter.create_documents([python_text])

[Document(metadata={}, page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
 Document(metadata={}, page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)')]

Check out how the class stays together in a single document (good), then the rest of the code is in a second document (ok).

I needed to play with the chunk size to get a clean result like that. You'll likely need to do the same for yours which is why using evaluations to determine optimal chunk sizes is crucial.

### JS

Very similar to python. See the separators [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L983).

Separators:
* `\nfunction` - Indicates the beginning of a function declaration
* `\nconst` - Used for declaring constant variables
* `\nlet` - Used for declaring block-scoped variables
* `\nvar` - Used for declaring a variable
* `\nclass` - Indicates the start of a class definition
* `\nif` - Indicates the beginning of an if statement
* `\nfor` - Used for for-loops
* `\nwhile` - Used for while-loops
* `\nswitch` - Used for switch statements
* `\ncase` - Used within switch statements
* `\ndefault` - Also used within switch statements
* `\n\n` - Indicates a larger separation in text or code
* `\n` - Separates lines of code or text
* `" "` - Separates words or tokens in the code
* `""` - Makes every character a separate element

In [29]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

In [30]:
javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

In [31]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

In [32]:
js_splitter.create_documents([javascript_text])

[Document(metadata={}, page_content='// Function is called, the return value will end up in x'),
 Document(metadata={}, page_content='let x = myFunction(4, 3);'),
 Document(metadata={}, page_content='function myFunction(a, b) {'),
 Document(metadata={}, page_content='// Function returns the product of a and b\n  return a * b;\n}')]

### PDFs w/ tables

Ok now things will get a bit spicier.

PDFs are an extremely common data type for language model work. Often they'll contain tables that contain information.

This could be financial data, studies, academic papers, etc.

Trying to split tables by a character based separator isn't reliable. We need to try out a different method. For a deep dive on this I recommend checking out [Lance Martin's](https://twitter.com/RLanceMartin) [tutorial](https://twitter.com/RLanceMartin/status/1721942636364456336) w/ LangChain.

I'll be going through a text based methods. [Mayo](https://twitter.com/mayowaoshin) has also outlined a GPT-4V method which tries to pulls tables via vision rather than text. You can check out [here](https://twitter.com/mayowaoshin/status/1727399231734886633).

A very convenient way to do this is with [Unstructured](https://unstructured.io/), a library dedicated to making your data LLM ready.

In [1]:
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

Let's load up our PDF and then parition it. This is a PDF from a [Salesforce earning report](https://investor.salesforce.com/financials/default.aspx).

In [4]:
filename = "data/Note_on_the_Greek_economy_14_February_2025.pdf"

# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,

    # Unstructured Helpers
    strategy="hi_res", 
    infer_table_structure=True, 
    model_name="yolox"
)

Cannot set gray non-stroke color because /'P274' is an invalid float value
Cannot set gray non-stroke color because /'P276' is an invalid float value
Cannot set gray non-stroke color because /'P278' is an invalid float value
Cannot set gray non-stroke color because /'P279' is an invalid float value
Cannot set gray non-stroke color because /'P281' is an invalid float value
Cannot set gray non-stroke color because /'P282' is an invalid float value
Cannot set gray non-stroke color because /'P283' is an invalid float value
Cannot set gray non-stroke color because /'P284' is an invalid float value
Cannot set gray non-stroke color because /'P286' is an invalid float value
Cannot set gray non-stroke color because /'P287' is an invalid float value
Cannot set gray non-stroke color because /'P288' is an invalid float value
Cannot set gray non-stroke color because /'P289' is an invalid float value
Cannot set gray non-stroke color because /'P290' is an invalid float value
Cannot set gray non-strok

config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/115M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/46.8M [00:00<?, ?B/s]

Let's look at our elements

These are just unstructured objects, we could look at them all but I want to look at the table it parsed.

In [16]:
elements[20].text

'Employment increased by 0.8% y-o-y in December 2024, while the monthly unemployment rate (sa) decreased to 9.4%.'

In [49]:
from unstructured.documents.elements import Table, NarrativeText, Title

# Option 1: Safely extract tables
tables = [el for el in elements if isinstance(el, Table)]

# Option 2: All text-based elements
text_elements = [el for el in elements if hasattr(el, 'text') and isinstance(el.text, str)]

# Option 3: Print types to explore what you're working with
for el in elements[:10]:
    print(type(el), getattr(el, "text", "")[:60])


<class 'unstructured.documents.elements.Image'>   
<class 'unstructured.documents.elements.Image'> 
<class 'unstructured.documents.elements.Title'> NOTE ON THE GREEK ECONOMY February 14, 2025
<class 'unstructured.documents.elements.NarrativeText'> Economic Analysis and Research Department
<class 'unstructured.documents.elements.NarrativeText'> Recent Economic Developments: an overview
<class 'unstructured.documents.elements.NarrativeText'> Economic activity continued to expand at a satisfactory pace
<class 'unstructured.documents.elements.NarrativeText'> Looking ahead, according to the latest BoG projections, grow
<class 'unstructured.documents.elements.Text'> ------------------------------------------------------------
<class 'unstructured.documents.elements.Title'> Latest economic information - available in the last three we
<class 'unstructured.documents.elements.Title'> Economic Activity


In [50]:
for i, table in enumerate(tables[:10]):
    print(f"\n--- Table {i+1} ---")
    print(table.text[:1000])  # Show first 1000 characters of the table



--- Table 1 ---
SUMMARY OF ECONOMIC DEVELOPMENTS AND OUTLOOK..............ccccccssesseeeeeeteeerees 4 BACKGROUND INFORMATION .............cccccccsscccsceseseceeseesssesecsessecesecseceeseseeseeeessesseeeseseseeseesags 8 1. ECONOMIC ACTIVITY 0.00... ccc cccccccecsseccessesseeececseseessseesessessesessesseseseseeseaseseesseseseesessesesens 8 2. PRICES AND REAL ESTATE MARKET ...............c:ccccccesssscssescsscssescsesseseseessesessecsesesseseesesseseesasegs 13 3. LABOUR MARKET AND COSTS ...........cccccceccescsseceessecsceececeesecesecseseessecesecsnsesseseeseseeseesanegs 17 4. EXTERNAL BALANCES, COMPETITIVENESS. ...............0:ccsccccsscssessessesseseseeseseeseessseseeseesenee 21 5. FISCAL DEVELOPMENTS..............c:cccccccsscssecsssseesceessesscsessscscsessesecseseessesessessssesseseeseseeseesaaees 27 6. MONEY AND CREDIT ...........:.ccccecsccccccecsecesscsesssesceessesscsecsecsesessesecseseessssessesssseseseeseaseseeseases 32 7. FINANCIAL MARKET DEVELOPMENT G.............cccccccc

In [33]:
from unstructured.documents.elements import Image as UnstructuredImage, NarrativeText

for i, el in enumerate(elements[:10]):
    print(f"--- {type(el).__name__} ---")

    if hasattr(el, "text"):
        print(el.text[:300])
    elif isinstance(el, UnstructuredImage) and el.image_path:
        print(f"[Displaying image {i}]")
        display(IPyImage(filename=el.image_path))
        # Optionally: look ahead for possible caption
        if i + 1 < len(elements) and isinstance(elements[i + 1], NarrativeText):
            print("Caption (approx.):", elements[i + 1].text[:200])
    else:
        print("[No text content]")

    print()


--- Image ---
  

--- Image ---


--- Title ---
NOTE ON THE GREEK ECONOMY February 14, 2025

--- NarrativeText ---
Economic Analysis and Research Department

--- NarrativeText ---
Recent Economic Developments: an overview

--- NarrativeText ---
Economic activity continued to expand at a satisfactory pace in 2024:Q3 (2.4% y-o-y), outperforming the euro area average. HICP inflation came down fast from its 2022 peak due to falling energy prices in 2023, but it remained relatively elevated at 3.0% in 2024 due to persistent services inflation. 

--- NarrativeText ---
Looking ahead, according to the latest BoG projections, growth is expected to peak in 2025 and converge thereafter towards potential growth. Growth will be mainly driven by private consumption and investment supported by available European resources. Inflation is expected to further decelerate in 20

--- Text ---
--------------------------------------------------------------------------------------------------------------------

In [36]:
#!pip3 install "unstructured[all-docs]"
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

In [None]:
# Get elements
raw_pdf_elements = partition_pdf(
    filename=filepath,
    
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="static/pdfImages/",
)

If you head over to `static/pdfImages/` and check out the images that were parsed.

But the images don't do anything sitting in a folder, we need to do something with them! Though a bit outside the scope of chunking, let's talk about how to work with these.

The common tactics will either use a multi-modal model to generate summaries of the images or use the image itself for your task. Others get embeddings of images (like [CLIP](https://openai.com/research/clip)).

Let's generate summaries so you'll be inspired to take this to the next step. We'll use GPT-4V. Check out other models [here](https://platform.openai.com/docs/model).

## Level 4: Semantic Chunking <a id="SemanticChunking"></a>

- **Definition**: Splitting a document into chunks based on meaning or topic rather than length or layout.

- **Goal**: Make each chunk a self-contained idea or section so it's more useful for search and question-answering.

- **How it's different**:
  - Semantic: "Split here because this section starts talking about inflation."
  - Non semantic: "Split every 500 characters, no matter what it says."

- **Methods used**:
  - Text embeddings + clustering
  - Topic modeling (e.g., LDA)
  - Sentence similarity
  - Section headers or title detection

- **Where it's used**:
  - RAG pipelines (LLMs retrieve meaningful chunks)
  - Document summarization
  - Intelligent document search

- **Benefits**:
  - Better answers from LLMs
  - Reduces irrelevant or partial content
  - Improves retrieval precision


In [None]:
# Step 1: Input is a list of elements (e.g., from unstructured or PDF parser)
elements = partition_pdf(...)

chunks = []
current_chunk = {
    "title": "Untitled",
    "content": []
}

for el in elements:
    if is_title(el):
        # Save the previous chunk if it has content
        if current_chunk["content"]:
            chunks.append(current_chunk)
        
        # Start a new chunk
        current_chunk = {
            "title": el.text.strip(),
            "content": []
        }
    elif is_text(el):
        current_chunk["content"].append(el.text.strip())

# Step 3: Save the last chunk
if current_chunk["content"]:
    chunks.append(current_chunk)

In [None]:
[
  {
    "title": "Executive Summary",
    "content": [
      "The Greek economy grew by 2.4% in Q3...",
      "Inflation fell from 9.2% to 3.1%..."
    ]
  },
  {
    "title": "Macroeconomic Outlook",
    "content": [
      "Growth is expected to slow in 2025...",
      "Private investment remains strong..."
    ]
  }
]