# Data Transformation:- Text Splitting Using Langchain
- In RAG, our first step is DATA Ingestion.
- DATA Ingestion: Ingest data from different data sources using Dataloaders.
- Loading pdf files, web pages, images, tabular data etc.
- In Step 2, we have to transform the data.
- Data Transformation: Transform the data into a format that can be used by the model.
- This includes text splitting, tokenization, stop words removal, stemming or lemmatization, etc
- In this step we convert Data (for example pdf documents) into chunks (eg. text chunks).
- Because the model can only process a certain amount of text at a time, we need to split the text.
- Every LLM model have their own limitation of context size, in order to take care of that we do text splitting.

# Text Splitting from Documents

In [1]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf")
docs = loader.load()
docs

[Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}, page_content='Published as a conference paper at ICLR 2024\nRAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING\nFOR TREE -ORGANIZED RETRIEVAL\nParth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning\nStanford University\npsarthi@cs.stanford.edu\nABSTRACT\nRetrieval-augmented language models can better adapt to changes in world state\nand incorporate long-tail knowledge. However, most existing methods retrieve\nonly short contiguous chunks from a retrieval corpus, limiting holistic under-\nstanding of the overall document context. We introduce the novel approach of\nrecursively embedding, clustering, and summarizing chunks of text, constructing\na tree with differing levels of summarization from the bottom up. At inference\ntime, our RAPTOR model retrieves from this tree, integrating information across\nlengthy documents at different 

In [3]:
type(docs[0])

langchain_core.documents.base.Document

# How to recursively split text by characters


In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=50
)
final_documents = text_splitter.create_documents(docs) # Helpful for Text Files
final_documents

TypeError: expected string or bytes-like object

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=50
)
final_documents = text_splitter.split_documents(docs)
final_documents

[Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}, page_content='Published as a conference paper at ICLR 2024\nRAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING'),
 Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}, page_content='RAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING\nFOR TREE -ORGANIZED RETRIEVAL'),
 Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}, page_content='Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning'),
 Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}, page_content='Stanford University\npsarthi@cs.stanford.edu\nABSTRACT'),
 Document(metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 

# Overlap of 50 characters from page 1 into page 2

In [5]:
print(final_documents[0])
print(final_documents[1])

page_content='Published as a conference paper at ICLR 2024
RAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING' metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}
page_content='RAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING
FOR TREE -ORGANIZED RETRIEVAL' metadata={'source': '/Users/surajbhardwaj/Desktop/Langchain/L1_Langchain/1_2_Data_Ingestion/Raptor.pdf', 'page': 0}


# Text File

In [7]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("speech.txt")
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.\n\nIn operating systems such as CP/M, where the operating system does \nnot keep track of the file size in bytes, the end of a text file is \ndenoted by placing one or more special characters, \nknown as an end-of-file (EOF) marker, as padding after the last \nline in a text file. In modern operating systems such as DOS, \nMicrosoft Windows and Unix-like systems, \ntext files do not contain any special EOF character, \nbecause file systems on those operating systems \nkeep track of the file size in bytes.\n\nSome operating systems, such as Multics, Unix-like systems, \nCP/M, DOS, the classic Mac OS, and Windows, \nstore text files as a sequence of bytes, \nwith an end-of-line delimiter at the 

In [None]:
speech = ""
with open("speech.txt") as f:
    speech=f.read()

speech

'A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.\n\nIn operating systems such as CP/M, where the operating system does \nnot keep track of the file size in bytes, the end of a text file is \ndenoted by placing one or more special characters, \nknown as an end-of-file (EOF) marker, as padding after the last \nline in a text file. In modern operating systems such as DOS, \nMicrosoft Windows and Unix-like systems, \ntext files do not contain any special EOF character, \nbecause file systems on those operating systems \nkeep track of the file size in bytes.\n\nSome operating systems, such as Multics, Unix-like systems, \nCP/M, DOS, the classic Mac OS, and Windows, \nstore text files as a sequence of bytes, \nwith an end-of-line delimiter at the end of each line. \nOther operating systems, such as OpenV

In [15]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50
)
text = text_splitter.create_documents([speech])
print(text[0])
print(text[1])

page_content='A text file (sometimes spelled textfile; 
an old alternative name is flat file) is a kind of computer file 
that is structured as a sequence of lines of electronic text.'
page_content='A text file exists stored as data within a computer file system.'


In [16]:
from langchain_community.document_loaders import TextLoader
loader = TextLoader("speech.txt")
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.\n\nIn operating systems such as CP/M, where the operating system does \nnot keep track of the file size in bytes, the end of a text file is \ndenoted by placing one or more special characters, \nknown as an end-of-file (EOF) marker, as padding after the last \nline in a text file. In modern operating systems such as DOS, \nMicrosoft Windows and Unix-like systems, \ntext files do not contain any special EOF character, \nbecause file systems on those operating systems \nkeep track of the file size in bytes.\n\nSome operating systems, such as Multics, Unix-like systems, \nCP/M, DOS, the classic Mac OS, and Windows, \nstore text files as a sequence of bytes, \nwith an end-of-line delimiter at the 

# Character Text Splitter
- Separato = "\n\n"

In [20]:
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=20
)
text_splitter.split_documents(docs)

Created a chunk of size 235, which is longer than the specified 100
Created a chunk of size 499, which is longer than the specified 100
Created a chunk of size 468, which is longer than the specified 100


[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.'),
 Document(metadata={'source': 'speech.txt'}, page_content='In operating systems such as CP/M, where the operating system does \nnot keep track of the file size in bytes, the end of a text file is \ndenoted by placing one or more special characters, \nknown as an end-of-file (EOF) marker, as padding after the last \nline in a text file. In modern operating systems such as DOS, \nMicrosoft Windows and Unix-like systems, \ntext files do not contain any special EOF character, \nbecause file systems on those operating systems \nkeep track of the file size in bytes.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Some operating systems, such as Multics, Unix-like systems, \nCP/M, DO