### How to split by character -Character Text Splitter

This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters. 

1. Hot the text is split: by single character seperator.
2. How the chunk size is measure: by number of characters.

In [1]:
from langchain_community.document_loaders import TextLoader

loader= TextLoader('speech.txt')
docs=loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

In [5]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter=CharacterTextSplitter(separator="\n\n",chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(docs)

Created a chunk of size 470, which is longer than the specified 100
Created a chunk of size 347, which is longer than the specified 100
Created a chunk of size 668, which is longer than the specified 100
Created a chunk of size 982, which is longer than the specified 100
Created a chunk of size 789, which is longer than the specified 100


[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'),
 Document(metadata={'source': 'speech.txt'}, page_content

In [6]:
speech=""
with open("speech.txt") as f:
    speech=f.read()

text_splitter=CharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech]) ## converting text in to a document file
print(text[0])
print(text[1])


Created a chunk of size 470, which is longer than the specified 100
Created a chunk of size 347, which is longer than the specified 100
Created a chunk of size 668, which is longer than the specified 100
Created a chunk of size 982, which is longer than the specified 100
Created a chunk of size 789, which is longer than the specified 100


page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'
page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'


### The fundamental difference between RecursiveCharacterTextSplitter and CharacterTextSplitter in Python (commonly found in libraries like LangChain) lies in their approach to splitting text into chunks:

CharacterTextSplitter:
Simple Character-Based Splitting:
This splitter primarily focuses on splitting text based on a fixed chunk_size and a specified separator character (e.g., a space, newline, or custom delimiter).

Less Semantic Awareness:
It doesn't inherently prioritize maintaining semantic units like paragraphs or sentences. If a chunk boundary falls in the middle of a word or sentence, it will split there.

Direct Approach:
It's a more straightforward and less sophisticated method, suitable for cases where strict character count adherence is the main concern and semantic coherence within chunks is less critical.

RecursiveCharacterTextSplitter:
Hierarchical and Recursive Splitting:
This is the recommended splitter for general-purpose text. It attempts to maintain semantic units by recursively trying a list of separators in a defined order (e.g., ["\n\n", "\n", " ", ""]).

Semantic Coherence:
It prioritizes splitting on larger, more semantically meaningful separators first (like double newlines for paragraphs), then progressively moves to smaller units (single newlines for lines, spaces for words, and finally individual characters) if the chunks are still too large.

Intelligent Chunking:
This approach aims to create chunks that are as semantically coherent as possible, making them more useful for tasks like retrieval-augmented generation (RAG) where understanding the context of each chunk is crucial.

Chunk Overlap Management:
Both splitters can handle chunk_overlap, but RecursiveCharacterTextSplitter often produces more effective overlaps due to its semantic awareness during the splitting process.

In essence, CharacterTextSplitter is a simple, direct character-count-based splitter, while RecursiveCharacterTextSplitter is a more intelligent, hierarchical splitter designed to preserve semantic meaning within text chunks.