+# Chunking 分块
视频链接：https://www.youtube.com/watch?v=8OJC21T2SL4  
分块可视化：https://chunkviz.up.railway.app/  
多级分块策略：  
- 代码库：https://github.com/FullStackRetrieval-com/RetrievalTutorials
- 网站教程：https://community.fullstackretrieval.com/document-loaders/text-splitting


## Level 1: Character Splitting 字符分割

In [1]:
# 测试文本
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [2]:
chunks = []

# 每个chunk包含的字符数
chunk_size = 35

# 根据chunk_size进行划分
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
for doc in chunks:
    print(f"len = {len(doc)}, content = '{doc}'")

len = 35, content = 'This is the text I would like to ch'
len = 35, content = 'unk up. It is the example text for '
len = 13, content = 'this exercise'


In [3]:
# 基于langchain来实现字符划分
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=35,
    chunk_overlap=0,
    separator='',
    strip_whitespace=False,
)

split_docs = text_splitter.create_documents([text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 35, content = 'This is the text I would like to ch'
len = 35, content = 'unk up. It is the example text for '
len = 13, content = 'this exercise'


In [4]:
# 增加重叠字符的字符划分
text_splitter = CharacterTextSplitter(
    chunk_size=35,
    chunk_overlap=4,  # diff
    separator='',
    strip_whitespace=False,
)

split_docs = text_splitter.create_documents([text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 35, content = 'This is the text I would like to ch'
len = 35, content = 'o chunk up. It is the example text '
len = 21, content = 'ext for this exercise'


In [5]:
# 使用其它字符进行分割
text_splitter = CharacterTextSplitter(
    chunk_size=35,
    chunk_overlap=0,
    separator='ch',  # diff
    strip_whitespace=False,
)

split_docs = text_splitter.create_documents([text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 33, content = 'This is the text I would like to '
len = 48, content = 'unk up. It is the example text for this exercise'


## Level 2: Recursive Character Text Splitting 递归字符文本分割

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=65,
    chunk_overlap=0
)
split_docs = text_splitter.create_documents([text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 62, content = 'One of the most important things I didn't understand about the'
len = 63, content = 'world when I was a child is the degree to which the returns for'
len = 28, content = 'performance are superlinear.'
len = 64, content = 'Teachers and coaches implicitly told us the returns were linear.'
len = 64, content = '"You get out," I heard a thousand times, "what you put in." They'
len = 60, content = 'meant well, but this is rarely true. If your product is only'
len = 61, content = 'half as good as your competitor's, you don't get half as many'
len = 60, content = 'customers. You get no customers, and you go out of business.'
len = 56, content = 'It's obviously true that the returns for performance are'
len = 53, content = 'superlinear in business. Some think this is a flaw of'
len = 64, content = 'capitalism, and that if we changed the rules it would stop being'
len = 62, content = 'true. But superlinear returns for performance are a feature of'
len = 62, content = 'the wo

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0
)
split_docs = text_splitter.create_documents([text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 155, content = 'One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.'
len = 313, content = 'Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.'
len = 433, content = 'It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]'


## Level 3: Document Specific Splitting 针对特定类型文档的分割

### markdwon

In [17]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [24]:
splitter = MarkdownTextSplitter(
    chunk_size=40,
    chunk_overlap=0
)
split_docs = splitter.create_documents([markdown_text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 31, content = '# Fun in California

## Driving'
len = 38, content = 'Try driving on the 1 down to San Diego'
len = 8, content = '### Food'
len = 39, content = 'Make sure to eat a burrito while you're'
len = 5, content = 'there'
len = 25, content = '## Hiking

Go to Yosemite'


### python

In [22]:
from langchain.text_splitter import PythonCodeTextSplitter

python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [23]:
splitter = PythonCodeTextSplitter(
    chunk_size=100,
    chunk_overlap=0
)
split_docs = splitter.create_documents([python_text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 86, content = 'class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age'
len = 58, content = 'p1 = Person("John", 36)

for i in range(10):
    print (i)'


### JS

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

In [21]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=65,
    chunk_overlap=0
)
split_docs = js_splitter.create_documents([javascript_text])
for doc in split_docs:
    print(f"len = {len(doc.page_content)}, content = '{doc.page_content}'")

len = 56, content = '// Function is called, the return value will end up in x'
len = 25, content = 'let x = myFunction(4, 3);'
len = 27, content = 'function myFunction(a, b) {'
len = 60, content = '// Function returns the product of a and b
  return a * b;
}'


### 带表格的PDF文件

In [None]:
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

In [None]:
filename = "static/SalesforceFinancial.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="hi_res",
    infer_table_structure=True,
    model_name="yolox"
)
elements

In [None]:
elements[-4].metadata.text_as_html

In [None]:
with open("./pdf_table_to_html.html") as f:
    f.write(elements[-4].metadata.text_as_html)

### 多模态（文本 + 图像）

In [None]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

In [None]:
filepath = "./static/VisualInstruction.pdf"

raw_pdf_elements = partition_pdf(
    filename=filepath,

    # 使用pdf模式，寻找内嵌的图片
    extract_images_in_pdf=True,

    # 使用视觉模型(YOLOX)识别layout，获取表格的边界(bounding boxes)和标题
    infer_table_structure=True,

    # 按标题分块
    chunk_strategy="by_title",

    # 每个文本块的最大字符数限制
    max_characters=4000,
    # 在达到多少字符后开始新块(不是强制开始新块，而是尝试开始)
    new_after_n_chars=3800,
    # 合并少于多少字符的文本块
    combine_text_under_n_chars=2000,
    image_output_dir_path="static/pdfImages/"
)

## Level 4: Semantic Splitting 语义分割

## Level 5: Agentic Splitting 代理分割

## *Bonus Level:* Alternative Representation Chunking + Indexing 替代表示分块与索引