<a href="https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_TextSplitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will show you the main text splitters LangChain framework supports.

In [7]:
# !pip install -qU langchain

In [8]:
long_text = '''
WASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.

The Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.

The charges stem from Trump's treatment of sensitive government materials he took with him when he left the White House in January 2021.

He is due to make a first court appearance in the case in a Miami court on Tuesday, a day before his 77th birthday.

The indictment of a former U.S. president on federal charges is unprecedented in American history and emerges at a time when Trump is the front-runner for the Republican presidential nomination next year.

Investigators seized roughly 13,000 documents from Trump's Mar-a-Lago estate in Palm Beach, Florida, nearly a year ago. One hundred were marked as classified, even though one of Trump's lawyers had previously said all records with classified markings had been returned to the government.
'''

# CharacterTextSplitter

In [9]:
from langchain.text_splitter import CharacterTextSplitter

In [10]:
text_splitter = CharacterTextSplitter(        
    chunk_size = 100,
    chunk_overlap  = 10,
    length_function = len,
    separator = "\n",
)

documents = text_splitter.create_documents([long_text])
print(documents[0].page_content)
print(documents[1].page_content)

Created a chunk of size 283, which is longer than the specified 100
Created a chunk of size 163, which is longer than the specified 100
Created a chunk of size 136, which is longer than the specified 100
Created a chunk of size 115, which is longer than the specified 100
Created a chunk of size 204, which is longer than the specified 100


WASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.
The Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.


# RecursiveCharacterTextSplitter

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 50,
    chunk_overlap  = 10,
    length_function = len,
    add_start_index = True
)

documents = text_splitter.create_documents([long_text])
print(documents[0])
print(documents[1])
print(len(documents[1].page_content))

page_content='WASHINGTON (Reuters) -Former U.S. President' metadata={'start_index': 1}
page_content='President Donald Trump faces 37 criminal counts' metadata={'start_index': 35}
47


# TokenTextSplitter

In [13]:
from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(
    encoding_name = "p50k_base",
    chunk_size=50,
    chunk_overlap=0
)

In [14]:
documents = text_splitter.create_documents([long_text])
print(documents[0])

page_content='\nWASHINGTON (Reuters) -Former U.S. President Donald Trump faces 37 criminal counts including charges of unauthorized retention of classified documents and conspiracy to obstruct justice after leaving the White House in 2021, according to federal court documents made public on Friday.\n' metadata={}


In [15]:
print(documents[1])

page_content="\nThe Justice Department made the charging documents public on a tumultuous day in which two of Trump's lawyers quit the case and a former aide face charges as well.\n\nThe charges stem from Trump's treatment of sensitive government materials he took with him when" metadata={}


In [17]:
import tiktoken
encoding_for_davinci = tiktoken.encoding_for_model("text-davinci-002")

print(len(encoding_for_davinci.encode(documents[0].page_content)))
print(len(encoding_for_davinci.encode(documents[1].page_content)))
print(len(encoding_for_davinci.encode(documents[2].page_content)))

50
50
50


In [18]:
print(encoding_for_davinci.encode(documents[0].page_content))
print(len(encoding_for_davinci.encode(documents[0].page_content)))

[198, 21793, 357, 12637, 8, 532, 14282, 471, 13, 50, 13, 1992, 3759, 1301, 6698, 5214, 4301, 9853, 1390, 4530, 286, 22959, 21545, 286, 10090, 4963, 290, 10086, 284, 26520, 5316, 706, 4305, 262, 2635, 2097, 287, 33448, 11, 1864, 284, 2717, 2184, 4963, 925, 1171, 319, 3217, 13, 198]
50
