<a href="https://colab.research.google.com/github/Mahendrareddy2006/a-6/blob/main/RAG_ASS_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
# Install necessary libraries
# Before running this script, you need to install the required Python packages.
# pip install chromadb sentence-transformers transformers torch
# Import the libraries that will be used throughout the script.
!pip install chromadb sentence-transformers transformers torch
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
# Step 3: Define the RAG Pipeline Class
#Encapsulate the entire RAG logic into a reusable class.
# This makes the pipeline modular and easier to manage.

class RagPipeline:
    """
    This class implements a Retrieval-Augmented Generation (RAG) pipeline.
    It uses Sentence-Transformers for embeddings, ChromaDB for vector storage
    and retrieval, and a FLAN-T5 model for text generation.
    """

    def __init__(self, model_name='all-MiniLM-L6-v2', generator_model_name='google/flan-t5-base'):
        """
        Initializes the RAG pipeline.

        Args:
            model_name (str): The name of the Sentence-Transformer model to use for embeddings.
            generator_model_name (str): The name of the text generation model (e.g., FLAN-T5).
        """
        # --- Embedding Model Initialization ---
        # Purpose: Load the Sentence-Transformer model which will convert
        # text documents and queries into numerical vectors (embeddings).
        # Keyword `SentenceTransformer`: The main class from the sentence-transformers library
        # used to load pre-trained models.
        print("Initializing embedding model...")
        self.embedding_model = SentenceTransformer(model_name)

        # --- ChromaDB Initialization ---
        # Purpose: Set up the vector database client and a collection.
        # A collection in ChromaDB is a container for documents and their embeddings.
        # Keyword `chromadb.Client()`: Creates an in-memory instance of ChromaDB.
        # For persistence, you can use `chromadb.PersistentClient(path="/path/to/db")`.
        print("Initializing ChromaDB...")
        self.chroma_client = chromadb.PersistentClient(path="./chroma_db")

        # --- Custom Embedding Function for ChromaDB ---
        # Purpose: ChromaDB needs a specific function format to generate embeddings.
        # We wrap our SentenceTransformer model in a class that ChromaDB can use.
        self.embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

        # --- Collection Creation ---
        # Purpose: Create a new collection in ChromaDB to store our documents.
        # If the collection already exists, it will be retrieved.
        # Keyword `get_or_create_collection`: A convenient method to either create a new
        # collection or get a handle to an existing one.
        # Keyword `embedding_function`: We pass our custom function so ChromaDB knows
        # how to automatically convert text to embeddings upon insertion.
        self.collection_name = "document_collection"
        self.collection = self.chroma_client.get_or_create_collection(
            name=self.collection_name,
            embedding_function=self.embedding_function
        )
        print(f"ChromaDB collection '{self.collection_name}' is ready.")


        # --- Text Generation Model Initialization ---
        # Purpose: Load the pre-trained FLAN-T5 model and its tokenizer.
        # This model will generate the final answer based on the user's query
        # and the context retrieved from ChromaDB.
        # Keyword `T5ForConditionalGeneration`: The class from the `transformers` library
        # for loading T5-based models for tasks like summarization and question-answering.
        # Keyword `T5Tokenizer`: The corresponding tokenizer that preprocesses text
        # into a format the T5 model can understand.
        print("Initializing text generation model (FLAN-T5)...")
        self.generator_tokenizer = T5Tokenizer.from_pretrained(generator_model_name)
        self.generator_model = T5ForConditionalGeneration.from_pretrained(generator_model_name)
        print("RAG Pipeline initialization complete.")


    def add_documents(self, documents):
        """
        Adds documents to the ChromaDB collection.

        Args:
            documents (list of str): A list of text documents to be indexed.
        """
        # --- Document Ingestion ---
        # Purpose: Add a list of documents to the ChromaDB collection.
        # ChromaDB will automatically use the configured embedding function
        # to convert these documents into vectors and store them.
        # Keyword `self.collection.add`: The method to add documents.
        # It requires a list of documents and a corresponding list of unique IDs.
        print(f"Adding {len(documents)} documents to the collection...")
        self.collection.add(
            documents=documents,
            ids=[f"doc_{i}" for i in range(len(documents))] # Simple unique IDs
        )
        print("Documents added successfully.")


    def retrieve(self, query, n_results=3):
        """
        Retrieves relevant documents from ChromaDB based on a query.

        Args:
            query (str): The user's query.
            n_results (int): The number of top relevant documents to retrieve.

        Returns:
            list of str: A list of the most relevant document texts.
        """
        # --- Similarity Search / Retrieval ---
        # Purpose: Find the documents in the collection that are most semantically
        # similar to the user's query.
        # Keyword `self.collection.query`: The method to perform a similarity search.
        # It automatically embeds the query using the same model used for the documents
        # and finds the nearest neighbors in the vector space.
        print(f"Retrieving top {n_results} documents for query: '{query}'")
        results = self.collection.query(
            query_texts=[query],
            n_results=n_results
        )
        # The result is a dictionary, we are interested in the 'documents' part
        retrieved_docs = results['documents'][0]
        print("Retrieved documents.")
        return retrieved_docs


    def generate(self, query, context_docs):
        """
        Generates an answer using the FLAN-T5 model based on the query and context.

        Args:
            query (str): The user's original query.
            context_docs (list of str): The list of documents retrieved from ChromaDB.

        Returns:
            str: The generated answer.
        """
        # --- Prompt Engineering ---
        # Purpose: Construct a detailed prompt for the generation model.
        # This prompt includes the retrieved documents as context, instructing
        # the model to answer the user's query based *only* on this context.
        # This is the core of RAG - augmenting the query with retrieved information.
        context = "\n".join(context_docs)
        prompt = f"""
        Answer the following question based on the.
        If the context does not contain the answer, state that you don't know.

        Context:
        {context}

        Question: {query}

        Answer:
        """
        print("Generating response with FLAN-T5...")

        # --- Text Generation ---
        # Purpose: Use the prepared prompt to get the final answer from the FLAN-T5 model.
        # Keyword `self.generator_tokenizer`: Encodes the prompt string into tokens.
        # `return_tensors='pt'` specifies that the output should be PyTorch tensors.
        # Keyword `self.generator_model.generate`: The core generation method.
        # It takes the input tokens and generates a sequence of output tokens.
        # Keyword `self.generator_tokenizer.decode`: Converts the output tokens back into a human-readable string.
        inputs = self.generator_tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
        outputs = self.generator_model.generate(**inputs, max_length=150)
        answer = self.generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
        print("Response generated.")
        return answer


    def ask(self, query):
        """
        The main method to interact with the RAG pipeline.
        It orchestrates the retrieval and generation steps.

        Args:
            query (str): The user's question.

        Returns:
            str: The final, context-aware answer.
        """
        # Step 1: Retrieve relevant documents
        retrieved_docs = self.retrieve(query)

        # Step 2: Generate an answer based on the retrieved context
        answer = self.generate(query, retrieved_docs)

        return answer

In [7]:
#  Use the RAG Pipeline

# 1. Instantiate the pipeline
rag_pipeline = RagPipeline()

# 2. Add some documents to the knowledge base
# These are the texts the RAG model will use to answer questions.
documents = [
    "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
    "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials.",
    "The Colosseum is an oval amphitheatre in the centre of the city of Rome, Italy.",
    "The capital of Japan is Tokyo, a bustling metropolis known for its Imperial Palace and numerous shrines and temples.",
    "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas."
]
rag_pipeline.add_documents(documents)

# 3. Ask a question
# The pipeline will retrieve relevant documents and generate an answer.
query = "What is the capital of Japan?"
answer = rag_pipeline.ask(query)

# Print the final answer
print("\n--- Final Answer ---")
print(answer)

Initializing embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Initializing ChromaDB...
ChromaDB collection 'document_collection' is ready.
Initializing text generation model (FLAN-T5)...


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

RAG Pipeline initialization complete.
Adding 5 documents to the collection...
Documents added successfully.
Retrieving top 3 documents for query: 'What is the capital of Japan?'
Retrieved documents.
Generating response with FLAN-T5...
Response generated.

--- Final Answer ---
Tokyo
