# RAGS- Retrival augumented generation
Is a method where we combine LLMs with a retrival system. This retrival system can search through vast sources of external information like documents, databases or knowledge bases. Whenever an LLM needs additianal sources of information.

This process involves breaking the data into multiple chunks, converting it into embeddings and storing it in a vector database.

# What are Tokens?
Tokens are crucial in that each llm usually has a token limit they can handle at once, which is referred to the context window. This basically means that if we have a pdf that is 1M tokens we cannot feed it to the llm at once.
GPT3.5 has has 16000 tokens and this includes both user message and ai message,
GPT 4 - 130,000 tokens
Gemini - 1,000,000 
# what is Chucking

Chucking is a process where we can split big documents into several smaller chucks, Next this chucks are converted to embeddings by the LLM embedder, this usually costs money. An embedding is a numerical represantation of text. This are then stored in a vector database(lets us search for chuck semantically based on meaning and not exact match)

# Embedding and Vector DBs
Embedding is a mathematical represantion of words, texts or even images. For example the word cat can have a vector embedding that can look like [32,32,321,4,5,3]
a db that stores vector embeddings is called a vector database eg mongo db atlas, chromadb

In [None]:
import os

from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_community.embeddings import OpenAIEmbeddings

def main():
    # current_dir = os.path.dirname(os.path.abspath(__file__))
    current_dir = os.getcwd()
    file_path = os.path.join(current_dir,"documents","lord_of_the_rings.txt")
    # check if chroma directory exists
    chroma_dir = os.path.join(current_dir,"db", "chroma_db")
    if not os.path.exists(chroma_dir):
        os.makedirs(chroma_dir)
    if not os.path.exists(file_path):
        print(f"File not found: {file_path}")
        raise FileNotFoundError(f"File not found: {file_path}")
    # Load the text file
    loader = TextLoader(file_path, encoding="utf8")
    documents = loader.load()
    # Split the text into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    texts = text_splitter.split_documents(documents)
    print("\n--- Document Chunks Information ---")
    print(f"Number of document chunks: {len(documents)}")
    print(f"Sample chunk:\n{documents[0].page_content}\n")

    # Create embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    print("\n--- Embeddings Information ---")
    print(f"Embedding model: {embeddings.model}")
    # Create a Chroma vector store
    vector_store = Chroma.from_documents(texts, embeddings, persist_directory=chroma_dir)
    print("\n--- Vector Store Information ---")
    print(f"Vector store created with {len(texts)} chunks.")
    # Persist the vector store
    # vector_store.persist()
    print(f"Vector store persisted at: {chroma_dir}")

if __name__ == "__main__":
    main()
# This script loads a text file, splits it into chunks, creates embeddings for those chunks,




Created a chunk of size 1619, which is longer than the specified 1000
Created a chunk of size 1315, which is longer than the specified 1000
Created a chunk of size 1058, which is longer than the specified 1000
Created a chunk of size 1343, which is longer than the specified 1000
Created a chunk of size 1329, which is longer than the specified 1000
Created a chunk of size 1991, which is longer than the specified 1000
Created a chunk of size 1414, which is longer than the specified 1000
Created a chunk of size 1103, which is longer than the specified 1000
Created a chunk of size 1198, which is longer than the specified 1000
Created a chunk of size 1232, which is longer than the specified 1000
Created a chunk of size 1195, which is longer than the specified 1000
Created a chunk of size 1045, which is longer than the specified 1000
Created a chunk of size 1503, which is longer than the specified 1000
Created a chunk of size 1349, which is longer than the specified 1000
Created a chunk of s


--- Document Chunks Information ---
Number of document chunks: 1
Sample chunk:
This book is largely concerned with Hobbits, and from its pages a
reader may discover much of their character and a little of their
history. Further information will also be found in the selection from
the Red Book of Westmarch that has already been published, under
the title of The Hobbit. That story was derived from the earlier chapters of the Red Book, composed by Bilbo himself, the first Hobbit
to become famous in the world at large, and called by him There and
Back Again, since they told of his journey into the East and his return:
an adventure which later involved all the Hobbits in the great events
of that Age that are here related.

The story of the Ring is, of course, known to many, but what most do not realize is that the power of the Ring is subtle and its evil grows with time. That is why the burden falls upon Frodo Baggins, a simple Hobbit from the Shire, to bear it. He will carry the weight of

# RETRIVING THE DATA STORED ABOVE and asking questions

In [13]:
import os
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

def load_vector_store():
    current_dir = os.getcwd()
    chroma_dir = os.path.join(current_dir, "db", "chroma_db")
    if not os.path.exists(chroma_dir):
        raise FileNotFoundError(f"Chroma directory not found: {chroma_dir}")
    
    # Load the vector store
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma(persist_directory=chroma_dir, embedding_function=embeddings)
    query = "Where does Gandalf meet Frodo?"
    results = vector_store.as_retriever(search_type="similarity_score_threshold", search_kwargs={"k": 5, "score_threshold": 0.1})
    relevant_docs = results.invoke(query)
    print("\n--- Similarity Search Results ---")
    for i, result in enumerate(relevant_docs, 1):
        print(f"Result {i+1}:")
        print(f"Content: {result.page_content}")
        print(f"Metadata: {result.metadata}\n")
    # Return the vector store
    # and loads it into memory for further use.
    # It also performs a similarity search on the vector store.
    # and prints the results.
    # print(f"Vector store loaded with {len(vector_store)} documents.")
    # return vector_store

if __name__ == "__main__":
    load_vector_store()
    # You can now use `vector_store` for further operations.
    # For example, you can perform more queries or save it to a different format.
    



--- Similarity Search Results ---
Result 2:
Content: It was in the heart of Hobbiton, at the home of Frodo Baggins, that Gandalf came to speak of matters far greater than Frodo could have imagined. This was the moment that Frodo learned of the perilous journey ahead and the burden of being the bearer of the Ring.
Metadata: {'source': '/Users/farhan/codebase/langchain/rags/documents/lord_of_the_rings.txt'}

Result 3:
Content: It was in the heart of Hobbiton, at the home of Frodo Baggins, that Gandalf came to speak of matters far greater than Frodo could have imagined. This was the moment that Frodo learned of the perilous journey ahead and the burden of being the bearer of the Ring.
Metadata: {'source': '/Users/farhan/codebase/langchain/rags/documents/lord_of_the_rings.txt'}

Result 4:
Content: Gandalf had been a friend to the Bagginses for many years, and he came to Hobbiton to visit Frodo one summer day. He found him sitting outside Bag End, the home of the Baggins family. It was her

# RAGS with metadata

In [15]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.embeddings import OpenAIEmbeddings

def load_and_split_text(file_path):
    # Load the text file
    docs = []
    loader = TextLoader(file_path, encoding="utf8")
    documents = loader.load()

    for doc in documents:
            
            doc.metadata['source'] = file_path
            docs.append(doc)
    
    # Split the text into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    texts = text_splitter.split_documents(docs)
    
    print(f"Number of document chunks: {len(texts)}")
    print(f"Sample chunk:\n{texts[0].page_content}\n")
    
    return texts
def create_vector_store(texts, persist_directory):
    # Create embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    # Create a Chroma vector store
    vector_store = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory)
    
    print(f"Vector store created with {len(texts)} chunks.")
    
    # Persist the vector store
    # vector_store.persist()
    print(f"Vector store persisted at: {persist_directory}")
    
    return vector_store
def main():
    current_dir = os.getcwd()
    doc_dir = os.path.join(current_dir, "documents")
    db_dir = os.path.join(current_dir, "db", "chroma_db")
    persistant_file = os.path.join(db_dir, "chroma_with_metadata")
    if not os.path.exists(db_dir):
        os.makedirs(db_dir)
    if not os.path.exists(doc_dir):
        print(f"Document directory not found: {doc_dir}")
        raise FileNotFoundError(f"Document directory not found: {doc_dir}")
    
    docs = [doc for doc in os.listdir(doc_dir) if doc.endswith('.txt')]
    if not docs:
        print(f"No text files found in the directory: {doc_dir}")
        raise FileNotFoundError(f"No text files found in the directory: {doc_dir}")
    for doc in docs:
        file_path = os.path.join(doc_dir, doc)
        print(f"Processing file: {file_path}")
        texts = load_and_split_text(file_path)
        vector_store = create_vector_store(texts, persist_directory=persistant_file)
    print(f"Vector store created and persisted for all documents in {doc_dir} at {db_dir}")
if __name__ == "__main__":
    main()
# This script loads a text file, splits it into chunks, creates embeddings for those chunks,
 

Created a chunk of size 1619, which is longer than the specified 1000
Created a chunk of size 1315, which is longer than the specified 1000
Created a chunk of size 1058, which is longer than the specified 1000
Created a chunk of size 1343, which is longer than the specified 1000
Created a chunk of size 1329, which is longer than the specified 1000
Created a chunk of size 1991, which is longer than the specified 1000
Created a chunk of size 1414, which is longer than the specified 1000
Created a chunk of size 1103, which is longer than the specified 1000
Created a chunk of size 1198, which is longer than the specified 1000
Created a chunk of size 1232, which is longer than the specified 1000
Created a chunk of size 1195, which is longer than the specified 1000
Created a chunk of size 1045, which is longer than the specified 1000
Created a chunk of size 1503, which is longer than the specified 1000
Created a chunk of size 1349, which is longer than the specified 1000
Created a chunk of s

Processing file: /Users/farhan/codebase/langchain/rags/documents/lord_of_the_rings.txt
Number of document chunks: 43
Sample chunk:
This book is largely concerned with Hobbits, and from its pages a
reader may discover much of their character and a little of their
history. Further information will also be found in the selection from
the Red Book of Westmarch that has already been published, under
the title of The Hobbit. That story was derived from the earlier chapters of the Red Book, composed by Bilbo himself, the first Hobbit
to become famous in the world at large, and called by him There and
Back Again, since they told of his journey into the East and his return:
an adventure which later involved all the Hobbits in the great events
of that Age that are here related.

The story of the Ring is, of course, known to many, but what most do not realize is that the power of the Ring is subtle and its evil grows with time. That is why the burden falls upon Frodo Baggins, a simple Hobbit from

Created a chunk of size 1610, which is longer than the specified 1000
Created a chunk of size 1562, which is longer than the specified 1000
Created a chunk of size 1063, which is longer than the specified 1000
Created a chunk of size 1543, which is longer than the specified 1000
Created a chunk of size 2597, which is longer than the specified 1000
Created a chunk of size 2613, which is longer than the specified 1000
Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1251, which is longer than the specified 1000
Created a chunk of size 1534, which is longer than the specified 1000
Created a chunk of size 1323, which is longer than the specified 1000
Created a chunk of size 1211, which is longer than the specified 1000
Created a chunk of size 1071, which is longer than the specified 1000
Created a chunk of size 1536, which is longer than the specified 1000
Created a chunk of size 1021, which is longer than the specified 1000
Created a chunk of s

Vector store created with 43 chunks.
Vector store persisted at: /Users/farhan/codebase/langchain/rags/db/chroma_db/chroma_with_metadata
Processing file: /Users/farhan/codebase/langchain/rags/documents/dracula.txt


Created a chunk of size 1860, which is longer than the specified 1000
Created a chunk of size 1873, which is longer than the specified 1000
Created a chunk of size 1591, which is longer than the specified 1000
Created a chunk of size 1053, which is longer than the specified 1000
Created a chunk of size 1085, which is longer than the specified 1000
Created a chunk of size 2918, which is longer than the specified 1000
Created a chunk of size 2112, which is longer than the specified 1000
Created a chunk of size 4144, which is longer than the specified 1000
Created a chunk of size 2179, which is longer than the specified 1000
Created a chunk of size 1262, which is longer than the specified 1000
Created a chunk of size 2192, which is longer than the specified 1000
Created a chunk of size 1376, which is longer than the specified 1000
Created a chunk of size 1073, which is longer than the specified 1000
Created a chunk of size 1076, which is longer than the specified 1000
Created a chunk of s

Number of document chunks: 964
Sample chunk:
The Project Gutenberg eBook of Dracula
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Dracula

Author: Bram Stoker

Release date: October 1, 1995 [eBook #345]
                Most recently updated: November 12, 2023

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team


*** START OF THE PROJECT GUTENBERG EBOOK DRACULA ***


                                DRACULA

                                  _by_

                              Bram Stoker

                        [Illustration: colophon]

             

# Load docs with metadata

In [28]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

def load_vector_store(persistant_file):
    # Load the vector store
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma(persist_directory=persistant_file, embedding_function=embeddings)
    
    # print(f"Vector store loaded with {len(vector_store)} documents.")
    
    return vector_store
def main():
    current_dir = os.getcwd()
    db_dir = os.path.join(current_dir, "db", "chroma_db")
    persistant_file = os.path.join(db_dir, "chroma_with_metadata")
    print(f"Loading vector store from: {persistant_file}")
    if not os.path.exists(db_dir):
        print(f"Database directory not found: {db_dir}")
        raise FileNotFoundError(f"Database directory not found: {db_dir}")
    if not os.path.exists(persistant_file):
        print(f"Persistant file not found: {persistant_file}")
        raise FileNotFoundError(f"Persistant file not found: {persistant_file}")
    vector_store = load_vector_store(persistant_file)
    query = "Where does Gandalf meet Frodo?"
    results = vector_store.as_retriever(search_type="similarity_score_threshold", search_kwargs={"k": 5, "score_threshold": 0.1})
    relevant_docs = results.invoke(query)
    print("\n--- Similarity Search Results ---")
    # for i, result in enumerate(relevant_docs, 1):
    #     print(f"Result {i}:")
    #     print(f"Content: {result.page_content}")
    #     print(f"Metadata: {result.metadata}\n")

    combined_content = (
    "Here are some documents that might help answer the question: "
    + query
    + "\nrelevant docs\n"
    + "\n".join([doc.page_content for doc in relevant_docs])
    + "\n\nPlease provide a rough answer based only on the provided documents. "
      "If the answer is not found in the documents, respond with 'I'm not sure'."
)

    model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    messages = [
        SystemMessage(content="You are a helpful assistant."),
        HumanMessage(content=combined_content)
    ]
    response = model.invoke(messages)
    print("\n--- Model Response ---")
    print(response.content)




if __name__ == "__main__":
    main()
# and loads it into a Chroma vector store.
# It also performs a similarity search on the vector store.


Loading vector store from: /Users/farhan/codebase/langchain/rags/db/chroma_db/chroma_with_metadata

--- Similarity Search Results ---

--- Model Response ---
Gandalf meets Frodo at Frodo's home in Hobbiton, specifically at Bag End, where Gandalf informs Frodo about the dangers of the One Ring and the perilous journey ahead.
