## Text Splitter - RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them until chunks are small enough. The default list is ["\n\n","\n"," ",""]. This has the effect of trying to keep all paragraphs (and then sentences and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text
- How list is split : by list of characters
- How the chunk size is measured: by number of characters

#### The splitter uses this list in a hierarchical manner:

- First Pass: Try splitting by the first character in the list (\n\n).
- Second Pass: If the resulting chunks are still too large, split by the next character (\n).
- Third Pass: If chunks are still too large, split by spaces ( ).
- Final Pass: If no other options work, split at the character level ("").

In [1]:
## Test Example
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "This is a paragraph.\n\n It has multiple sentences.\n Here is another paragraph. Let's split it."
splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)
print(chunks)


['This is a paragraph.', 'It has multiple sentences.', "Here is another paragraph. Let's split it."]


In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./../2-Data_Ingestion/Speech.pdf")

docs = loader.load()

docs

[Document(metadata={'source': './../2-Data_Ingestion/Speech.pdf', 'page': 0}, page_content='Goodmorningdistinguishedguests, colleagues, andfriends,\nIt isanhonortostandbeforeyoutodaytodiscussatopicthat isshapingthefutureofhumanity—Artificial Intelligence(AI). Aswenavigatethroughthefourthindustrial revolution,AI standsasbothabeaconof hopeandasubject of caution.\nToday, I will coverthreecritical areas: thetransformativepotential of AI, theethicalconsiderationsit brings, andourcollectiveresponsibilityasstewardsof thispowerfultechnology.\nLet usbeginbyreflectingonhowAI istransformingindustries. Fromhealthcaretoagriculture, AI isenablingunprecedentedadvancements. Forinstance, AI-drivendiagnostictoolsaredetectingdiseasesearlierthaneverbefore, savingcountlesslives. Inagriculture,AI-poweredanalyticsarehelpingfarmersoptimizecropyieldswhileminimizingenvironmental impact. Thesearejust afewexamplesof howAI isdrivingprogress.\nHowever, wemust alsoacknowledgethechallengesahead. Therapidpaceof AI ado

In [29]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)

final_doc = text_splitter.split_documents(docs)

final_doc

[Document(metadata={'source': './../2-Data_Ingestion/Speech.pdf', 'page': 0}, page_content='Goodmorningdistinguishedguests, colleagues, andfriends,\nIt isanhonortostandbeforeyoutodaytodiscussatopicthat isshapingthefutureofhumanity—Artificial Intelligence(AI). Aswenavigatethroughthefourthindustrial revolution,AI standsasbothabeaconof hopeandasubject of caution.\nToday, I will coverthreecritical areas: thetransformativepotential of AI, theethicalconsiderationsit brings, andourcollectiveresponsibilityasstewardsof thispowerfultechnology.'),
 Document(metadata={'source': './../2-Data_Ingestion/Speech.pdf', 'page': 0}, page_content='Let usbeginbyreflectingonhowAI istransformingindustries. Fromhealthcaretoagriculture, AI isenablingunprecedentedadvancements. Forinstance, AI-drivendiagnostictoolsaredetectingdiseasesearlierthaneverbefore, savingcountlesslives. Inagriculture,AI-poweredanalyticsarehelpingfarmersoptimizecropyieldswhileminimizingenvironmental impact. Thesearejust afewexamplesof howA

In [7]:
print(final_doc[0])

page_content='Goodmorningdistinguishedguests, colleagues, andfriends,
It isanhonortostandbeforeyoutodaytodiscussatopicthat isshapingthefutureofhumanity—Artificial Intelligence(AI). Aswenavigatethroughthefourthindustrial revolution,AI standsasbothabeaconof hopeandasubject of caution.
Today, I will coverthreecritical areas: thetransformativepotential of AI, theethicalconsiderationsit brings, andourcollectiveresponsibilityasstewardsof thispowerfultechnology.' metadata={'source': './../2-Data_Ingestion/Speech.pdf', 'page': 0}


In [8]:
print(final_doc[1])

page_content='Let usbeginbyreflectingonhowAI istransformingindustries. Fromhealthcaretoagriculture, AI isenablingunprecedentedadvancements. Forinstance, AI-drivendiagnostictoolsaredetectingdiseasesearlierthaneverbefore, savingcountlesslives. Inagriculture,AI-poweredanalyticsarehelpingfarmersoptimizecropyieldswhileminimizingenvironmental impact. Thesearejust afewexamplesof howAI isdrivingprogress.
However, wemust alsoacknowledgethechallengesahead. Therapidpaceof AI adoptionraisesquestionsabout dataprivacy, algorithmictransparency, andequitableaccesstotechnology.' metadata={'source': './../2-Data_Ingestion/Speech.pdf', 'page': 0}


## How to split by character-character text splitter
This is the simplest method. This splits based on a given character sequence, which is defaults to "\n\n". Chunk lenght is measured by number of characters
- How list is split : by single characters seperater
- How the chunk size is measured: by number of characters

In [30]:
## Test Example
from langchain.text_splitter import CharacterTextSplitter

text = "This is a paragraph. It has multiple sentences. \n Here is another paragraph. Let's split it."
splitter = CharacterTextSplitter(separator=" ",chunk_size=10, chunk_overlap=10)
chunks = splitter.split_text(text)
print(chunks)


['This is a', 'paragraph.', 'It has', 'multiple', 'sentences.', 'Here is', 'is another', 'paragraph.', "Let's", 'split it.']


In [3]:
## Test Example
from langchain.text_splitter import CharacterTextSplitter

text = "This is a paragraph. It has multiple sentences. \n Here is another paragraph. Let's split it."
splitter = CharacterTextSplitter(separator="\n",chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)
print(chunks)


['This is a paragraph. It has multiple sentences.', "Here is another paragraph. Let's split it."]
