<a href="https://colab.research.google.com/github/SarshaDev/GenAICookBooks/blob/main/simple_rag_chat_splitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
! pip install langchain-openai deeplake tiktoken langchain-community "deeplake[enterprise]<4.0.0"



 **from langchain.document_loaders import TextLoader**  - Use it to load Text data from disk file

** from langchain.schema import Document** - use it to load text data from a string

***output of TextLoader class is Document Type***

In [5]:
Text = """
Kohli was born on 5 November 1988 in Delhi into a Punjabi Hindu family. His mother Saroj Kohli is as a housewife while his father Prem Nath Kohli worked as a criminal lawyer. He has an elder brother Vikas and an elder sister Bhawna. His formative years were spent in Uttam Nagar. His early education was at Vishal Bharti Public School.[4] As per his family, Kohli exhibited an early affinity for cricket as a 3-year-old. He would pick up a bat and request his father bowl to him.[5] In 1998, the West Delhi Cricket Academy was created. In May, his father arranged for him to meet Rajkumar Sharma.[6] Upon the suggestion of their neighbours, Kohli's father considered enrolling his son in a professional cricket academy, as they believed his ability merited more than gully cricket.[7] He was unable to secure a place in the U-14 Delhi team, due to extraneous factors. His father reportedly received offers to relocate his son to influential clubs, which would ensure his selection, but he declined the proposals. Kohli eventually found his way into the U-15 team.[8] He received training at the academy and participated in matches at the Sumeet Dogra Academy located at Vasundhara Enclave.[9] In pursuit of furthering his cricketing career, he transferred to Saviour Convent School during his ninth-grade education.[7] On 18 December 2006, his father died due to a cerebral attack.[10] As per his mother, Kohli's demeanour shifted noticeably after his father's death. He took on cricket with newfound seriousness, prioritizing playing time and dedicating himself fully to the sport.[7] Kohli's family resided in Meera Bagh, Paschim Vihar until the year 2015, after which they relocated to Gurgaon.[11]
"""

In [6]:
from langchain.schema import Document

In [7]:
docs = [Document(page_content=Text)]
len(docs[0].page_content)

1703

In [8]:
from langchain.text_splitter import CharacterTextSplitter

In [9]:
chars_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

In [10]:
doc_chunks = chars_splitter.split_documents(docs)

In [11]:
print(len(doc_chunks)), print(doc_chunks[0])

1
page_content='Kohli was born on 5 November 1988 in Delhi into a Punjabi Hindu family. His mother Saroj Kohli is as a housewife while his father Prem Nath Kohli worked as a criminal lawyer. He has an elder brother Vikas and an elder sister Bhawna. His formative years were spent in Uttam Nagar. His early education was at Vishal Bharti Public School.[4] As per his family, Kohli exhibited an early affinity for cricket as a 3-year-old. He would pick up a bat and request his father bowl to him.[5] In 1998, the West Delhi Cricket Academy was created. In May, his father arranged for him to meet Rajkumar Sharma.[6] Upon the suggestion of their neighbours, Kohli's father considered enrolling his son in a professional cricket academy, as they believed his ability merited more than gully cricket.[7] He was unable to secure a place in the U-14 Delhi team, due to extraneous factors. His father reportedly received offers to relocate his son to influential clubs, which would ensure his selection, bu

(None, None)

We got 1 chunk because **CharacterTextSplitter** splits only at its separator (default is "\n\n").
Here text is basically one long paragraph (no double newlines), so the splitter sees a single piece and won’t break it further—even if it’s longer than chunk_size. Result: one big chunk.

The better way for most real-world cases is to use ***RecursiveCharacterTextSplitter***



1.   It tries ["\n\n", "\n", " ", ""] in order and will fall back to character-level splitting
2.   Handles text with inconsistent formatting
3. Works well with Chunk Overlap



In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [13]:
rec_chars_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

In [14]:
doc_rec_split = rec_chars_splitter.split_documents(docs)
len(doc_rec_split)

10

In [15]:
for each in doc_rec_split:
  print(each)

page_content='Kohli was born on 5 November 1988 in Delhi into a Punjabi Hindu family. His mother Saroj Kohli is as a housewife while his father Prem Nath Kohli worked as a criminal lawyer. He has an elder brother'
page_content='an elder brother Vikas and an elder sister Bhawna. His formative years were spent in Uttam Nagar. His early education was at Vishal Bharti Public School.[4] As per his family, Kohli exhibited an'
page_content='Kohli exhibited an early affinity for cricket as a 3-year-old. He would pick up a bat and request his father bowl to him.[5] In 1998, the West Delhi Cricket Academy was created. In May, his father'
page_content='In May, his father arranged for him to meet Rajkumar Sharma.[6] Upon the suggestion of their neighbours, Kohli's father considered enrolling his son in a professional cricket academy, as they'
page_content='academy, as they believed his ability merited more than gully cricket.[7] He was unable to secure a place in the U-14 Delhi team, due to extran

We can use custom logic to split the text into different chunks when we need special rules (code blocks, headings, tables), tight control, or token-aware limits without extra abstractions.



## Vector Embedding

Lets create embedding for each text

In [23]:
from langchain_openai import OpenAIEmbeddings
import os
from langchain_community.vectorstores import DeepLake
# from langchain-deeplake import DeeplakeVectorStore

In [42]:
os.environ["OPENAI_API_KEY"] = ""
os.environ["ACTIVELOOP_TOKEN"] = ""

In [43]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [44]:
active_loop_org_id = ""
dataset_name = "virat_kohli_childhood"

In [45]:
dataset_path = f"hub://{active_loop_org_id}/{dataset_name}"

In [46]:
db = DeepLake(dataset_path=dataset_path, embedding=embeddings)


Deep Lake Dataset in hub://sarshadev/virat_kohli_childhood already exists, loading from the storage


In [47]:
db.add_documents(doc_rec_split)

Creating 10 embeddings in 1 batches of size 10:: 100%|██████████| 1/1 [00:07<00:00,  7.82s/it]

Dataset(path='hub://sarshadev/virat_kohli_childhood', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
 embedding  embedding  (10, 1536)  float32   None   
    id        text      (10, 1)      str     None   
 metadata     json      (10, 1)      str     None   
   text       text      (10, 1)      str     None   





['e5ce06e4-9173-11f0-b280-0242ac1c000c',
 'e5ce1652-9173-11f0-b280-0242ac1c000c',
 'e5ce20d4-9173-11f0-b280-0242ac1c000c',
 'e5ce2278-9173-11f0-b280-0242ac1c000c',
 'e5ce2318-9173-11f0-b280-0242ac1c000c',
 'e5ce23a4-9173-11f0-b280-0242ac1c000c',
 'e5ce2426-9173-11f0-b280-0242ac1c000c',
 'e5ce24bc-9173-11f0-b280-0242ac1c000c',
 'e5ce2548-9173-11f0-b280-0242ac1c000c',
 'e5ce25ca-9173-11f0-b280-0242ac1c000c']

In [48]:
retriever = db.as_retriever()

In [49]:
from langchain.chains import RetrievalQA

In [50]:
from langchain.llms import OpenAI

In [51]:
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)

  qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)


In [55]:
Query = "when did virat kohli born in and where did he resided and when did west delhi academy created"

In [56]:
qa_chain.run(Query)

' Virat Kohli was born on 5 November 1988 in Delhi, and his family resided in Meera Bagh, Paschim Vihar. The West Delhi Cricket Academy was created in 1998.'