In LangChain, text_splitter is used to efficiently handle large text data by splitting it into smaller, manageable chunks. This is essential for working with LLMs (Large Language Models), as they have token limits and perform better with appropriately sized inputs.

In [1]:
from langchain_community.document_loaders import WikipediaLoader
loader = WikipediaLoader(
    query="Algorithmic_trading",
    load_max_docs=2
)
docs = loader.load()
docs

[Document(metadata={'title': 'Algorithmic trading', 'summary': 'Algorithmic trading is a method of executing orders using automated pre-programmed trading instructions accounting for variables such as time, price, and volume. This type of trading attempts to leverage the speed and computational resources of computers relative to human traders. In the twenty-first century, algorithmic trading has been gaining traction with both retail and institutional traders. A study in 2019 showed that around 92% of trading in the Forex market was performed by trading algorithms rather than humans. \nIt is widely used by investment banks, pension funds, mutual funds, and hedge funds that may need to spread out the execution of a larger order or perform trades too fast for human traders to react to. However, it is also available to private traders using simple retail tools.\nThe term algorithmic trading is often used synonymously with automated trading system. These encompass a variety of trading stra

In [2]:
type(docs[0])

langchain_core.documents.base.Document

#### Recursively Split Text by Characters

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
final_document = text_splitter.split_documents(docs)

In [4]:
final_document

[Document(metadata={'title': 'Algorithmic trading', 'summary': 'Algorithmic trading is a method of executing orders using automated pre-programmed trading instructions accounting for variables such as time, price, and volume. This type of trading attempts to leverage the speed and computational resources of computers relative to human traders. In the twenty-first century, algorithmic trading has been gaining traction with both retail and institutional traders. A study in 2019 showed that around 92% of trading in the Forex market was performed by trading algorithms rather than humans. \nIt is widely used by investment banks, pension funds, mutual funds, and hedge funds that may need to spread out the execution of a larger order or perform trades too fast for human traders to react to. However, it is also available to private traders using simple retail tools.\nThe term algorithmic trading is often used synonymously with automated trading system. These encompass a variety of trading stra

In [5]:
print(final_document[0])
print(final_document[1])

page_content='Algorithmic trading is a method of executing orders using automated pre-programmed trading instructions accounting for variables such as time, price, and volume. This type of trading attempts to leverage the speed and computational resources of computers relative to human traders. In the twenty-first century, algorithmic trading has been gaining traction with both retail and institutional traders. A study in 2019 showed that around 92% of trading in the Forex market was performed by trading' metadata={'title': 'Algorithmic trading', 'summary': 'Algorithmic trading is a method of executing orders using automated pre-programmed trading instructions accounting for variables such as time, price, and volume. This type of trading attempts to leverage the speed and computational resources of computers relative to human traders. In the twenty-first century, algorithmic trading has been gaining traction with both retail and institutional traders. A study in 2019 showed that around

##### Text

In [6]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('speech.txt')
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=15)
final_document = text_splitter.split_documents(text_documents)

In [8]:
final_document

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its'),
 Document(metadata={'source': 'speech.txt'}, page_content='democracy. Its peace must be planted upon the'),
 Document(metadata={'source': 'speech.txt'}, page_content='upon the tested foundations of political liberty.'),
 Document(metadata={'source': 'speech.txt'}, page_content='liberty. We have no selfish ends to serve. We'),
 Document(metadata={'source': 'speech.txt'}, page_content='to serve. We desire no conquest, no dominion. We'),
 Document(metadata={'source': 'speech.txt'}, page_content='dominion. We seek no indemnities for ourselves,'),
 Document(metadata={'source': 'speech.txt'}, page_content='for ourselves, no material compensation for the'),
 Document(metadata={'source': 'speech.txt'}, page_content='for the sacrifices we shall freely make. We are'),
 Document(metadata={'source': 'speech.txt'}, page_content='make. We are but one of the champions of the'),
 Document(metad

In [9]:
print(final_document[0])
print(final_document[1])

page_content='The world must be made safe for democracy. Its' metadata={'source': 'speech.txt'}
page_content='democracy. Its peace must be planted upon the' metadata={'source': 'speech.txt'}


#### Character Text Splitter
This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.


In [10]:
from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
docs=loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

In [11]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=50,chunk_overlap=15)
text_splitter.split_documents(docs)

Created a chunk of size 470, which is longer than the specified 50
Created a chunk of size 347, which is longer than the specified 50
Created a chunk of size 670, which is longer than the specified 50
Created a chunk of size 984, which is longer than the specified 50
Created a chunk of size 791, which is longer than the specified 50


[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'),
 Document(metadata={'source': 'speech.txt'}, page_content

- If the text is very large and you want to split it quickly → CharacterTextSplitter

- If you want to split it logically while preserving the meaning of the text → RecursiveCharacterTextSplitter

Usually RecursiveCharacterTextSplitter is used more often because it is more flexible and effective.