# Text Splitting From Documents using Recursive Character Splitter

In [3]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../Documnets/deepseek_r1.pdf")
docs = loader.load_and_split()
docs

[Document(metadata={'source': '../Documnets/deepseek_r1.pdf', 'page': 0}, page_content='DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.\nDeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-\nvised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.\nThrough RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing\nreasoning behaviors. However, it encounters challenges such as poor readability, and language\nmixing. To address these issues and further enhance reasoning performance, we introduce\nDeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-\nR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the\nresearch community, we open-source DeepSeek-R1-Zer

## How to Recurivesly Split Text by Characters

In [19]:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
final_docs = text_splitter.split_documents(docs)
final_docs


[Document(metadata={'source': '../Documnets/deepseek_r1.pdf', 'page': 0}, page_content='DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via\nReinforcement Learning\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.\nDeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-\nvised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.\nThrough RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing'),
 Document(metadata={'source': '../Documnets/deepseek_r1.pdf', 'page': 0}, page_content='Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing\nreasoning behaviors. However, it encounters challenges such as poor readability, and language\nmixing. To address these issues and further enhance reasoning performance, we introduce\nDeepSeek-R1, which incorporates multi-stage training and c

### Look at the few overlapping characters of first doc and second doc

In [20]:
print(final_docs[0], end='\n\n')
print(final_docs[1])

page_content='DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI
research@deepseek.com
Abstract
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without super-
vised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.
Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing' metadata={'source': '../Documnets/deepseek_r1.pdf', 'page': 0}

page_content='Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing
reasoning behaviors. However, it encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-
R1 achieves performance comparable to OpenAI-o1-1217 on reaso

In [23]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../Documnets/Global_Warming_Pak.txt")
docs = loader.load()
docs



### Notice how manually reading .txt doesnt give us Document Type

In [26]:
docs = ""

with open('../Documnets/Global_Warming_Pak.txt') as file:
    docs = file.read()

docs



In [40]:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text = text_splitter.create_documents([docs])
text

[Document(metadata={}, page_content='Global Warming in Pakistan: A Looming Crisis\n\nIntroduction'),
 Document(metadata={}, page_content='Global warming is one of the most pressing environmental challenges of the 21st century, and'),
 Document(metadata={}, page_content='21st century, and Pakistan is among the countries most vulnerable to its effects. Rising'),
 Document(metadata={}, page_content='its effects. Rising temperatures, melting glaciers, changing precipitation patterns, and increasing'),
 Document(metadata={}, page_content='and increasing frequency of extreme weather events are severely impacting Pakistanâ€™s ecosystem,'),
 Document(metadata={}, page_content='ecosystem, economy, and society. This essay explores the causes, consequences, and potential'),
 Document(metadata={}, page_content='and potential solutions to global warming in Pakistan.'),
 Document(metadata={}, page_content='Causes of Global Warming in Pakistan'),
 Document(metadata={}, page_content='The primary drive

In [41]:
print(text[1])
print(text[2])

page_content='Global warming is one of the most pressing environmental challenges of the 21st century, and'
page_content='21st century, and Pakistan is among the countries most vulnerable to its effects. Rising'


The string document is not converted to Document Type

In [42]:
type(text[1])

langchain_core.documents.base.Document