### Retrival Documents
* Step 1 - Load documents
* Step 2 - Split documets
* Step 3 - Encode and store in a vectorstore
* Step 4 - Retrival

#### Load Documents             
* Langchain have multiple document loaders buildin which loads a document into **langchain_core.documents.base.Document** type.          
* Langchain have several inbulit document loaders - 
    * CSV      
    * HTML         
    * JSON        
    * Markdown      
    * PDF      
    * Microsoft

In [20]:
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from typing import Iterator

class TextDocumentLoader(BaseLoader):
    def __init__(self, path:str):
        self.path = path     #file path
    def lazy_load(self) -> Iterator[Document]:
        with open(self.path, encoding='utf=8') as file:
            for i,line in enumerate(file):
                yield Document(page_content=line,metadata={'line_number':i})
    def load(self) -> list[Document]:
        return super().load()
    def full_load(self) -> Document:
        with open(self.path, encoding='utf-8') as file:
            text = file.read()
            return Document(page_content=text, metadata={'file_path':self.path})

In [None]:
from pathlib import Path

file_path = Path("text.txt")
loader = TextDocumentLoader(file_path).lazy_load()
for doc in loader:
    print(doc)

page_content='Hello this is line 1.
' metadata={'line_number': 0}
page_content='This is the 2nd line of the document.
' metadata={'line_number': 1}
page_content='This is the 3rd line of the document.
' metadata={'line_number': 2}
page_content='This is the 4th and the final line of the document.' metadata={'line_number': 3}


In [17]:
TextDocumentLoader(file_path).load()

[Document(metadata={'line_number': 0}, page_content='Hello this is line 1.\n'),
 Document(metadata={'line_number': 1}, page_content='This is the 2nd line of the document.\n'),
 Document(metadata={'line_number': 2}, page_content='This is the 3rd line of the document.\n'),
 Document(metadata={'line_number': 3}, page_content='This is the 4th and the final line of the document.')]

In [22]:
document = TextDocumentLoader(file_path).full_load()

#### Text Splitter              
* After a document is loaded the document must be split into text chunks.             
* Different statergies can be used to split the text into different chunks.           
* Most common text splitters -      
    * Character chunker           - split the text from a given special character eg - ('\n',' ')   
    * Recursive character spliter - It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough.      
    * Sematic Chucking            - chuck text based on the sematic relationships between words

In [None]:
file_path = Path("text_para.txt")
with open(file_path, 'r') as file:
    text = file.read()

text, type(text)

("\tThe rapid rise of artificial intelligence has transformed industries across the globe. From healthcare to finance, AI technologies are enabling faster decision-making and automating complex processes. In healthcare, AI-powered systems assist doctors in diagnosing diseases by analyzing medical images, while in finance, algorithms predict market trends and assess risks. As a result, businesses are embracing AI to improve efficiency and gain a competitive edge. \n\tMeanwhile, the development of renewable energy sources is reshaping the energy sector. Solar, wind, and hydropower are becoming more viable options as technology advances and costs decrease. Governments worldwide are setting ambitious goals to reduce carbon emissions and combat climate change. This shift toward renewable energy not only mitigates environmental impact but also opens up new opportunities for economic growth and job creation in sustainable industries.\n\tLastly, digital transformation is redefining how compani

In [62]:
##Character Text Splitter
from langchain_text_splitters import CharacterTextSplitter


character_text_splitter = CharacterTextSplitter(
                                                separator="\t",
                                                chunk_size=450,
                                                chunk_overlap=50,
                                                length_function=len,
                                                is_separator_regex=False,)

In [61]:
documents = character_text_splitter.create_documents(texts=[text,])
texts = character_text_splitter.split_text(text)
texts

Created a chunk of size 464, which is longer than the specified 450
Created a chunk of size 470, which is longer than the specified 450
Created a chunk of size 464, which is longer than the specified 450
Created a chunk of size 470, which is longer than the specified 450


['The rapid rise of artificial intelligence has transformed industries across the globe. From healthcare to finance, AI technologies are enabling faster decision-making and automating complex processes. In healthcare, AI-powered systems assist doctors in diagnosing diseases by analyzing medical images, while in finance, algorithms predict market trends and assess risks. As a result, businesses are embracing AI to improve efficiency and gain a competitive edge.',
 'Meanwhile, the development of renewable energy sources is reshaping the energy sector. Solar, wind, and hydropower are becoming more viable options as technology advances and costs decrease. Governments worldwide are setting ambitious goals to reduce carbon emissions and combat climate change. This shift toward renewable energy not only mitigates environmental impact but also opens up new opportunities for economic growth and job creation in sustainable industries.',
 "Lastly, digital transformation is redefining how companie

In [64]:
## Recursive Character Text Splitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(separators=['\t','\n',' '],
                                               chunk_size=450,
                                               chunk_overlap=50,
                                               length_function=len,
                   )

In [65]:
text_splitter.create_documents(texts=[text,])

[Document(metadata={}, page_content='The rapid rise of artificial intelligence has transformed industries across the globe. From healthcare to finance, AI technologies are enabling faster decision-making and automating complex processes. In healthcare, AI-powered systems assist doctors in diagnosing diseases by analyzing medical images, while in finance, algorithms predict market trends and assess risks. As a result, businesses are embracing AI to improve efficiency and gain a'),
 Document(metadata={}, page_content='are embracing AI to improve efficiency and gain a competitive edge.'),
 Document(metadata={}, page_content='Meanwhile, the development of renewable energy sources is reshaping the energy sector. Solar, wind, and hydropower are becoming more viable options as technology advances and costs decrease. Governments worldwide are setting ambitious goals to reduce carbon emissions and combat climate change. This shift toward renewable energy not only mitigates environmental impact 

In [70]:
## Sematic Text Chunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os
load_dotenv()

text_splitter_1 = SemanticChunker(embeddings=OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY")))
text_splitter_2 = SemanticChunker(embeddings=OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY")),breakpoint_threshold_type='percentile',breakpoint_threshold_amount=0.5)
text_splitter_3 = SemanticChunker(embeddings=OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY")),breakpoint_threshold_type="interquartile")
text_splitter_4 = SemanticChunker(embeddings=OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY")),breakpoint_threshold_type='standard_deviation',breakpoint_threshold_amount=1.9)

In [71]:
text_splitter_1.create_documents(texts=[text,])

[Document(metadata={}, page_content='\tThe rapid rise of artificial intelligence has transformed industries across the globe. From healthcare to finance, AI technologies are enabling faster decision-making and automating complex processes. In healthcare, AI-powered systems assist doctors in diagnosing diseases by analyzing medical images, while in finance, algorithms predict market trends and assess risks. As a result, businesses are embracing AI to improve efficiency and gain a competitive edge. Meanwhile, the development of renewable energy sources is reshaping the energy sector. Solar, wind, and hydropower are becoming more viable options as technology advances and costs decrease. Governments worldwide are setting ambitious goals to reduce carbon emissions and combat climate change. This shift toward renewable energy not only mitigates environmental impact but also opens up new opportunities for economic growth and job creation in sustainable industries. Lastly, digital transformati

In [None]:
text_splitter_2.create_documents(texts=[text,])

In [None]:
text_splitter_3.create_documents(texts=[text,])

In [None]:
text_splitter_4.create_documents(texts=[text,])

### VectorStores              
* After text is broken down into chunks these chunks can be converted into embeddings using different embedding models and can be saved in
a vectorstore.             
* Different embedding models can be used for embedding text such as -     
    * OpenAI Embeddings                     
    * RoBerta        
    * BERT         
* Different vector stores - 
    * Pine-Cone          
    * Chroma            
    * FAISS        

In [None]:
## Chroma 


In [2]:
#generator function
def count_to_n(n):
    count = 1
    while count<n+1:
        yield count
        count += 1

counter = count_to_n(10)
type(counter)

generator

In [3]:
for count in counter:
    print(f"count = {count}")
    print("Break")

count = 1
Break
count = 2
Break
count = 3
Break
count = 4
Break
count = 5
Break
count = 6
Break
count = 7
Break
count = 8
Break
count = 9
Break
count = 10
Break
