# CharacterTextSplitter 
is a tool that splits text into chunks based purely on the number of characters, without worrying about words, sentences, or paragraphs.


## 1. separator (default: "\n\n")
What it does: Tells the splitter where to break the text first.

Example: If set to " ", it splits at spaces.

Set to "" if you want to split purely by characters.

## 2. chunk_size (default: 1000)
What it does: The maximum number of characters allowed in each chunk.

If the text is longer, it will be broken into smaller chunks.

## 3. chunk_overlap (default: 200)
What it does: How many characters from the end of one chunk should be repeated at the beginning of the next chunk.

This helps preserve context between chunks.

## 4. length_function (default: len)
What it does: Function used to calculate the length of each chunk.

You can customize this if you're counting tokens instead of characters.

Example (using tiktoken): length_function = lambda x: len(tokenizer.encode(x))

## 5. is_separator_regex (default: False)
What it does: Tells LangChain whether your separator is a regex pattern.

If True, the separator will be treated as a regular expression.

Example: separator = r"\s+" will split on all whitespace.



## How it works:
You tell it:

How many characters per chunk you want (e.g., 100).

How much overlap you want between chunks (optional).

Then it just cuts the text like a knife, every N characters, even if that cuts a word or sentence in half.



## Summary:
Behavior                	 Explanation

1) Default  	First splits on \n\n, then trims based on character limits

2) separator=""	              Forces it to split purely by character count


In [17]:
from langchain_community.document_loaders import TextLoader

loader=TextLoader("speech.txt")

documents=loader.load()

In [18]:
documents

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness beca

In [19]:
documents[0]

Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness becau

In [20]:
documents[0].page_content

'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness because we act without animus, not in enmity toward a people o

## Tries to keep sentence

In [21]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)

texts=text_splitter.split_documents(documents)

Created a chunk of size 818, which is longer than the specified 100
Created a chunk of size 668, which is longer than the specified 100
Created a chunk of size 982, which is longer than the specified 100
Created a chunk of size 789, which is longer than the specified 100


In [22]:
texts
len(texts)

6

In [23]:
texts[0]

Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.')

# Blinldy splits by 100 chars

In [24]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(separator="",chunk_size=100, chunk_overlap=20)

texts=text_splitter.split_documents(documents)

In [25]:
len(texts)

46