<a href="https://colab.research.google.com/github/Tanisha1201/RAG-Based-question-answer-system/blob/main/RAG_Based_question_answer_system_iynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install fastapi uvicorn python-multipart pydantic sentence-transformers faiss-cpu PyPDF2 transformers torch


Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.13.2


In [2]:
%%writefile models.py
from pydantic import BaseModel
from typing import List

class QuestionRequest(BaseModel):
    question: str

class MultiQuestionRequest(BaseModel):
    questions: List[str]


Writing models.py


In [3]:
%%writefile ingest.py
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
import os

# Embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
dimension = 384

# FAISS index
index = faiss.IndexFlatL2(dimension)

# Stores text chunks
documents = []

# Stores file names for each chunk
document_sources = []

def read_file(file_path):
    if file_path.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as f:
            return f.read()

    elif file_path.endswith(".pdf"):
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() or ""
        return text

    return ""

def chunk_text(text, chunk_size=60):
    words = text.split()
    return [" ".join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

def ingest_document(file_path):
    text = read_file(file_path)
    chunks = chunk_text(text)

    if not chunks:
        return

    embeddings = model.encode(chunks)
    index.add(np.array(embeddings).astype("float32"))

    for chunk in chunks:
        documents.append(chunk)
        document_sources.append(os.path.basename(file_path))


Writing ingest.py


In [4]:
%%writefile rag.py
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from ingest import index, documents, document_sources

embed_model = SentenceTransformer("all-MiniLM-L6-v2")

qa_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-large"
)

def retrieve_and_answer(question, k=3):
    q_embedding = embed_model.encode([question]).astype("float32")
    D, I = index.search(q_embedding, k)

    context = ""
    sources = set()

    for idx in I[0]:
        context += documents[idx] + "\n"
        sources.add(document_sources[idx])

    prompt = f"""
Answer the question using ONLY the given context.
If the answer is not clearly present, say "Not found in document".
Return short and exact answers.

Context:
{context}

Question:
{question}
"""

    result = qa_pipeline(prompt, max_length=150)
    return {
        "answer": result[0]["generated_text"],
        "sources": list(sources)
    }

def retrieve_and_answer_multiple(questions):
    output = {}
    for q in questions:
        output[q] = retrieve_and_answer(q)
    return output


Writing rag.py


In [5]:
%%writefile main.py
from fastapi import FastAPI, UploadFile, BackgroundTasks
from models import QuestionRequest, MultiQuestionRequest
from ingest import ingest_document
from rag import retrieve_and_answer, retrieve_and_answer_multiple
import shutil

app = FastAPI()

@app.post("/upload")
async def upload_file(file: UploadFile, background_tasks: BackgroundTasks):
    file_path = file.filename
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    background_tasks.add_task(ingest_document, file_path)
    return {"message": "File uploaded and indexed"}

@app.post("/ask")
def ask_question(req: QuestionRequest):
    return retrieve_and_answer(req.question)

@app.post("/ask-multiple")
def ask_multiple(req: MultiQuestionRequest):
    return retrieve_and_answer_multiple(req.questions)


Writing main.py


In [6]:
from ingest import ingest_document

ingest_document("/content/Flowers.txt")
ingest_document("/content/Machine Learning.pdf")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
from rag import retrieve_and_answer_multiple

questions = [
    "What is fertilisation?",
    "What is types of Machine learning?",

]

answers = retrieve_and_answer_multiple(questions)

for q, result in answers.items():
    print("Q:", q)
    print("A:", result["answer"])
    print("From files:", result["sources"])
    print("-" * 50)


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1501 > 512). Running this sequence through the model will result in indexing errors
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=256) and `max_length`(=150) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Q: What is fertilisation?
A: Fertilisation is the key step in sexual reproduction
From files: ['Flowers.txt', 'Machine Learning.pdf']
--------------------------------------------------
Q: What is types of Machine learning?
A: Supervised Learning: Models are trained on labeled data sets to predict outcomes or classify data. Examples include regression for predicting numerical values and classification for categorizing data. Unsupervised Learning: Models analyze unlabeled data to uncover patterns, such as clustering similar data points or reducing dimensionity. Reinforcement Learning: Models learn through trial and error by interacting with an environment and receiving rewards or penalties for actions. Semi-Supervised Learning: Combines a small amount of labeled data with a large amount of unlabeled data, useful when labeling is expensive Self-Supervised Learning: A subset of unsupervised learning where models generate their own labels from data, often used in deep learning.
From files: 