# **Main Components of RAG (Retrieval-Augmented Generation)**

A RAG system consists of **three major components**:

---

## 1. **Indexing**

The indexing pipeline loads, processes, and stores data for future retrieval. This step is typically done **once**.
It can be broken down into four key stages:

- **Loading Data** – Reading raw documents (e.g., PDFs, web pages, etc.)
- **Splitting Data** – Breaking the content into manageable chunks
- **Embedding Data** – Converting text into vector representations using embedding models
- **Storing Data** – Saving the embeddings in a vector store for fast retrieval

---

## 2. **Retrieval**

Given a **user query**, this component:

- Searches the vector store
- Retrieves the **most relevant document chunks** based on similarity

---

## 3. **Generation**

The retrieved context is passed to a **language model**, which:

- Generates a complete, helpful, and coherent answer
- Uses the retrieved documents to **ground** the response

---

> **Note:** Only the **retrieval and generation** steps are performed during runtime. Indexing is a one-time setup unless your data changes.



In [1]:
from IPython.display import display
from IPython.display import Markdown
import textwrap
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [2]:
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI

In [3]:
import chromadb

In [4]:
# Loading the Api
with open('./api.txt','r') as file:
    API_KEY=file.read().strip()


In [11]:
class Indexing:
    def __init__(self, pdf_path, embedding_model_class, api_key):
        self.pdf_paths = pdf_path
        self.embedding_model_class = embedding_model_class
        self.api_key = api_key

    def load_pdf(self):
        self.pages=[]
        for path in self.pdf_paths:
            pdfloader = PyPDFLoader(path)
            self.pages.extend(pdfloader.load_and_split())

    def split_data(self, chunk_size, chunk_overlap):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        content = "\n\n".join(str(p.page_content) for p in self.pages)
        self.texts = text_splitter.split_text(content)

    def store_data(self, path, name):
        # Custom Wrapper
        class LangChainEmbeddingFunction:
            def __init__(self, langchain_embedder):
                self.langchain_embedder = langchain_embedder

            def __call__(self, input):
                return self.langchain_embedder.embed_documents(input)

        embedder = self.embedding_model_class(model="models/embedding-001", google_api_key=self.api_key)
        embedding_function = LangChainEmbeddingFunction(embedder)

        chroma_client = chromadb.PersistentClient(path=path)

        try:
            chroma_client.delete_collection(name=name)
        except:
            pass    
        db = chroma_client.create_collection(name=name, embedding_function=embedding_function)

        for i, d in enumerate(self.texts):
            db.add(documents=[d], ids=[str(i)])

        self.db_path = path
        self.path = path  
        self.name = name

    def load_db_data_(self):
        class LangChainEmbeddingFunction:
            def __init__(self, langchain_embedder):
                self.langchain_embedder = langchain_embedder

            def __call__(self, input):
                return self.langchain_embedder.embed_documents(input)

        embedder = self.embedding_model_class(model="models/embedding-001", google_api_key=self.api_key)
        embedding_function = LangChainEmbeddingFunction(embedder)

        chroma_client = chromadb.PersistentClient(path=self.path)
        db = chroma_client.get_collection(name=self.name, embedding_function=embedding_function)
        return db

In [12]:
def get_relevant_data(query,db ,n_results):
    results = db.query(query_texts=[query], n_results=n_results)
    return results["documents"][0]


In [13]:
class Generation():
    def __init__(self,llm):
        self.llm=llm
    def make_prompt(self,query,passage):
        esc=passage.replace("'","").replace('"', "").replace("\n", " ")
        prompt = ("""You are a helpful and informative bot that answers questions using text from the reference passage included below.\
              Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
          However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
          strike a friendly and converstional tone. \
          If the passage is irrelevant to the answer, you may ignore it.
          QUESTION: '{query}'
          PASSAGE: '{esc}'
        
          ANSWER:
          """).format(query=query, esc=esc)
        
        self.prompt=prompt
    def generate_answer(self,db,query,n_results=3):
        relevant_text=get_relevant_data(query,db,n_results)
        self.make_prompt(query,passage=' '.join(relevant_text))
        answer=self.llm.invoke(self.prompt)
        return answer

In [14]:
index_model=Indexing(['./cancer_and_cure__a_critical_analysis.27.pdf','./medical_oncology_handbook_june_2020_edition.pdf'],GoogleGenerativeAIEmbeddings,API_KEY)
index_model.load_pdf()

In [15]:
index_model.split_data(5000,1000)

In [16]:
index_model.store_data('./new_db/','database_medi')
db=index_model.load_db_data_()

In [17]:
llm=GoogleGenerativeAI(model='gemini-2.0-flash',api_key=API_KEY,temperature=0.3)

In [18]:
qa=Generation(llm)

In [19]:
query='what are the common side effects of systemic therapeutic agents?'
answer=qa.generate_answer(db,query)

In [20]:
Markdown(answer)

Common side effects of systemic therapeutic agents include issues like cardiac dysfunction with Adriamycin/Epirubicin, lung problems with Bleomycin, and kidney issues, nerve damage, and hearing loss with Cisplatin. Carboplatin doses need adjustment based on kidney function, while 5-FU can cause severe diarrhea and heart issues. Gemcitabine may lead to lung inflammation and swelling, Irinotecan can cause diarrhea and flushing, and Taxol/Paclitaxel can result in nerve damage and flu-like symptoms. Taxotere/Docetaxel can cause liver issues, swelling, nerve damage, and rash, while Oxaliplatin may lead to cold-induced nerve issues and spasms. Cyclophosphamide/Ifosfamide can affect kidney function and cause confusion, Capecitabine can cause mouth sores, hand-foot syndrome, rash, angina, and diarrhea, and Trastuzumab can affect heart function. Cetuximab/Panitumumab often cause an acne-like rash, Methotrexate requires folinic acid to mitigate its effects, and Caelyx/Liposomal doxorubicin can cause rash, hand-foot syndrome, and heart issues. Avastin/Bevacizumab may lead to high blood pressure and protein in the urine, Denosumab can cause low calcium levels, Dabrafenib may cause fever, rash, and skin cancers, and Mekinist is often combined with Dabrafenib to reduce the risk of skin cancer. Zolendronic acid can affect kidney function and calcium levels, and checkpoint inhibitors can cause autoimmune issues. Alopecia is also a common side effect with certain medications like those used in breast, ovarian, sarcoma, small cell lung cancer, and testicular regimens.