In [1]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain_google_genai.embeddings import GoogleGenerativeAIEmbeddings

In [2]:
load_dotenv()

os.environ["LANGSMITH_TRACING"] = 'true'
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY")
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")


In [3]:
llm_model = ChatGroq(model="openai/gpt-oss-20b", temperature=0.5)
embedding_model = GoogleGenerativeAIEmbeddings(model="gemini-embedding-001")

In [4]:
embedding_model.embed_query("Hola Amigos")

[-0.043887626,
 0.006160656,
 0.0020316977,
 -0.06461505,
 -0.00992698,
 0.00913335,
 -0.016391626,
 0.017985463,
 -0.022929356,
 0.007145739,
 -0.0054923827,
 -0.001151479,
 0.0037405135,
 0.011117922,
 0.11373654,
 0.0004198669,
 0.0007637996,
 -0.016790047,
 -0.0042107184,
 -0.007299354,
 -0.009247868,
 -0.010123065,
 -0.005286812,
 -0.0010112347,
 0.0017003999,
 0.004720265,
 0.005394482,
 0.0028646945,
 0.025659237,
 0.0031024392,
 0.022869011,
 0.0043197228,
 -0.0033043574,
 0.01105133,
 -0.0061619175,
 0.011692537,
 0.0074572014,
 0.010765835,
 -0.010969597,
 0.008745613,
 0.0075684097,
 -0.0113702575,
 -0.01430676,
 0.0052262577,
 -0.0012234662,
 -0.017486593,
 0.011730032,
 0.0098373,
 0.0075083817,
 0.0112323705,
 -0.024936488,
 -0.0072040255,
 -0.005934218,
 -0.15488249,
 -0.0072409953,
 -0.007824247,
 -0.008863838,
 0.016680203,
 0.0015726611,
 -0.0017900111,
 -0.0029041893,
 0.019131612,
 -0.018332131,
 0.0014290505,
 -0.008637148,
 0.026642626,
 0.0008796761,
 0.010393917

In [5]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

pdf_loader = PyPDFLoader("../data/pdf/1763651405773.pdf")
all_doc = pdf_loader.load()

In [6]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splitted_text = splitter.split_documents(all_doc)

In [7]:
splitted_text[:5]

[Document(metadata={'producer': 'PyPDF', 'creator': 'PyPDF', 'creationdate': '', 'source': '../data/pdf/1763651405773.pdf', 'total_pages': 42, 'page': 1, 'page_label': '2'}, page_content='ðŸš€ AIxFunda  Newsletter                                                           aixfunda.substack.com                           Q1.  Explain  the  requirement  of  RAG  when  LLMs  are  already  powerful.   LLMs  are  powerful,  as  they  are  trained  on  large  volumes  of  data  using  sophisticated  \ntechniques.\n \nHowever,\n \nLLMs\n \nbecause\n \nof\n \nknowledge\n \ncutoff\n \n(static\n \nknowledge),\n \nstruggle\n \nto\n \nanswer\n \nqueries\n \nrelated\n \nto\n \nthe\n \nlatest\n \nevents\n \nor\n \nthe\n \ndata\n \nnot\n \npresent\n \nin\n \ntheir\n \ntraining\n \ncorpus.\n \n \n RAG  addresses  this  challenge  by  retrieving  relevant  context  from  external  knowledge  \nsources,\n \nwhich\n \nallows\n \nLLMs\n \nto\n \nprovide\n \naccurate\n \nresponses.\n \nThis\n \nis\n \nwhy\n 

In [8]:
import re
text = splitted_text[:1][0].page_content
text = re.sub(r'\s+', ' ', text).strip()
text

'ðŸš€ AIxFunda Newsletter aixfunda.substack.com Q1. Explain the requirement of RAG when LLMs are already powerful. LLMs are powerful, as they are trained on large volumes of data using sophisticated techniques. However, LLMs because of knowledge cutoff (static knowledge), struggle to answer queries related to the latest events or the data not present in their training corpus. RAG addresses this challenge by retrieving relevant context from external knowledge sources, which allows LLMs to provide accurate responses. This is why RAG is essential for LLM-based applications that need to be accurate. Otherwise, LLMs alone might provide you answers that are incomplete or outdated. Q2. Is RAG still relevant in the era of long'

In [9]:
for text in splitted_text:
    text.page_content = re.sub(r'\s+', ' ', text.page_content).strip()

In [10]:
splitted_text[0]

Document(metadata={'producer': 'PyPDF', 'creator': 'PyPDF', 'creationdate': '', 'source': '../data/pdf/1763651405773.pdf', 'total_pages': 42, 'page': 1, 'page_label': '2'}, page_content='ðŸš€ AIxFunda Newsletter aixfunda.substack.com Q1. Explain the requirement of RAG when LLMs are already powerful. LLMs are powerful, as they are trained on large volumes of data using sophisticated techniques. However, LLMs because of knowledge cutoff (static knowledge), struggle to answer queries related to the latest events or the data not present in their training corpus. RAG addresses this challenge by retrieving relevant context from external knowledge sources, which allows LLMs to provide accurate responses. This is why RAG is essential for LLM-based applications that need to be accurate. Otherwise, LLMs alone might provide you answers that are incomplete or outdated. Q2. Is RAG still relevant in the era of long')

In [11]:
texts = [re.sub(r'\s+', ' ', doc.page_content).strip() for doc in all_doc[:20]]
print(len(texts))
texts = [t for t in texts if t]
print(len(texts))
embeddings = embedding_model.embed_documents(texts)

20
19


In [12]:
docs_to_insert = [
    {
        "text": t,
        "embeddings": embed,
    }
    for t,embed in zip(texts, embeddings)
]

In [13]:
docs_to_insert[0]

{'text': 'ðŸš€ AIxFunda Newsletter aixfunda.substack.com Q1. Explain the requirement of RAG when LLMs are already powerful. LLMs are powerful, as they are trained on large volumes of data using sophisticated techniques. However, LLMs because of knowledge cutoff (static knowledge), struggle to answer queries related to the latest events or the data not present in their training corpus. RAG addresses this challenge by retrieving relevant context from external knowledge sources, which allows LLMs to provide accurate responses. This is why RAG is essential for LLM-based applications that need to be accurate. Otherwise, LLMs alone might provide you answers that are incomplete or outdated. Q2. Is RAG still relevant in the era of long context LLMs? RAG is still important even with long context LLMs. This is because long-context LLMs without RAG have three big problems: "lost in the middle,", high API costs, and increased latency. Long-context LLMs often struggle to find the most relevant info

In [14]:
from pymongo.mongo_client import MongoClient

load_dotenv()

os.environ["MONGO_URI"] = os.getenv("MONGO_URI")
uri = os.environ["MONGO_URI"]
client = MongoClient(uri)

collection = client['sample_mflix']["ragpdf"]

In [15]:
results = collection.insert_many(docs_to_insert)

In [16]:
from pymongo.operations import SearchIndexModel
import time

index_name = "vector_embeddings"
search_index_model = SearchIndexModel(
    definition={
        "fields": [{
            "type": "vector",
            "numDimensions": 3072,
            "path": "embeddings",
            "similarity": "cosine"
        }]
    },
    name=index_name,
    type="vectorSearch"
    )
collection.create_search_index(model=search_index_model)

'vector_embeddings'

In [17]:
predicate = None
if predicate is None:
    predicate = lambda index: index.get("queryable") is True
    
    while True:
        indices = list(collection.list_search_indexes(index_name))
        if len(indices) and predicate(indices[0]):
            break
        time.sleep(5)
print(index_name + "is ready for querying")

vector_embeddingsis ready for querying


In [18]:
query_embedding = embedding_model.embed_query("How do you choose the chunk size for a RAG system?")

In [None]:
results = collection.ragpdf.aggregate(
    pipeline=[{
        "$vectorSearch": {
            # Name of the Atlas Vector Search index to use.
            "index": "vector_embeddings",
            # Indexed vectorEmbedding type field to search.
            "path": "embeddings",
            # Array of numbers that represent the query vector.
            # The array size must match the number of vector dimensions specified in the index definition for the field.
            "queryVector": query_embedding,
            # Number of nearest neighbors to use during the search.
            # Value must be less than or equal to (<=) 10000.
            "numCandidates": 3072,
            "limit": 10,
            }}
        ])

<pymongo.synchronous.command_cursor.CommandCursor at 0x1f3c132f7a0>

In [27]:
def get_query_results(query: str):
    query_embedding = embedding_model.embed_query(query)
    pipeline = [{
        "$vectorSearch": {
            "index": "vector_embeddings",
            "path": "embeddings",
            "queryVector": query_embedding,
            "numCandidates": 3072,
            "limit": 10,
        }
    },]
    results = collection.aggregate(pipeline=pipeline)
    # print(results)
    arr_of_results = [doc for doc in results]
    return arr_of_results
    

In [28]:
results = get_query_results("How do you choose the chunk size for a RAG system?")

In [30]:
results[0]["text"]

"ðŸš€ AIxFunda Newsletter aixfunda.substack.com The optimal size depends on the use case, document structure, embedding model, and the generator (LLM) model. For example, smaller chunks are suitable for fact-based queries, and more complex queries benefit from larger ones. Q14. What are the potential consequences of having chunks that are too large versus chunks that are too small? Large chunks often mix different topics into one chunk and reduce the chunk's relevance. This can lead to coarse vector representations and less accurate retrieval. Large chunks can also add noise and confuse the model with irrelevant information that isn't important, resulting in a less accurate answer. Small chunk sizes in RAG systems can lead to fragmented context. This fragmentation often leads to poor retrieval quality because information that is semantically related may be split up into chunks that are not retrieved together. Furthermore, smaller chunks mean that there are more chunks in the vector dat

In [31]:
results[1]["text"]

'ðŸš€ AIxFunda Newsletter aixfunda.substack.com AI search engines are a great example of how RAG systems have changed the way people find information online. AI search engines give you accurate, relevant answers by combining information retrieval with generative AI. For instance, RAG-based AI search platforms like Perplexity AI improve the user experience by fetching the most recent and relevant information from large knowledge bases and then giving it back in the format that the user wants. Q11. Explain the steps in the indexing process in a RAG pipeline. There are four steps in the indexing process of a RAG pipeline: parsing, chunking, encoding, and storing. The parsing step deals with extracting the document content. Then, the chunking step splits the extracted content into smaller pieces called chunks. The encoding step uses an embedding model to convert chunks into dense numerical vectors called embeddings. Finally, these embeddings are saved in a vector database for efficient sea

In [41]:
def RAG(query):
    context_doc = get_query_results(query)
    context_string = " ".join([doc["text"] for doc in context_doc])
    
    promt = f"""
    Use the following context information to answer the user's question at the end. Only use this context to answer instead of making up information from your own.
    Context: {context_string}
    Question: {query}
    """
    
    response = llm_model.invoke(input=promt)
    
    return response

In [42]:
answer = RAG("Explain the retrieval process step-by-step in a RAG pipeline")

In [43]:
print(answer.content)

**Retrieval process in a RAG pipeline â€“ stepâ€‘byâ€‘step**

1. **Encode the user query**  
   * Convert the query into a dense vector using an embedding model.  

2. **Search the vector database**  
   * Use the query vector to query the vector database that stores embeddings of all document chunks.  

3. **Compute similarity scores**  
   * The database calculates similarity (e.g., cosine similarity) between the query vector and each chunk embedding.  

4. **Return the most relevant chunks**  
   * Based on the similarity scores, the system returns the topâ€‘k most relevant document chunks to the generator.  

These retrieved chunks are then supplied to the LLM as context for answer generation.
