<a href="https://colab.research.google.com/github/LoveleenGaur/loveleengaur/blob/main/Capstone_project_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Research Paper Answer Bot (RAG System)**

A Retrieval-Augmented Generation (RAG) System that enables intelligent query-based responses using research papers in Generative AI. This implementation utilizes FAISS for vector search, OpenAI embeddings for semantic understanding, and LangChain for structured retrieval.

**Install Required** **Dependencies**

In [2]:
!pip install pypdf langchain faiss-cpu openai chromadb tiktoken sentence-transformers
!pip install langchain-community langchain-openai streamlit -q



Collecting pypdf
  Downloading pypdf-5.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.18.1-py2.py3-none-any.whl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m110.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m98.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h

**Load and Process Research Papers**

In [4]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [5]:
import os
import glob

# Define your Google Drive PDF folder path
pdf_folder = "/content/drive/My Drive/PDF Papers"

# List all PDFs
pdf_files = glob.glob(os.path.join(pdf_folder, "*.pdf"))

if pdf_files:
    print("✅ Found PDFs in Google Drive:")
    for pdf in pdf_files:
        print(pdf)
else:
    print("❌ No PDFs found. Check your folder path.")


✅ Found PDFs in Google Drive:
/content/drive/My Drive/PDF Papers/gpt4.pdf
/content/drive/My Drive/PDF Papers/instructgpt.pdf
/content/drive/My Drive/PDF Papers/attention_paper.pdf
/content/drive/My Drive/PDF Papers/mistral_paper.pdf
/content/drive/My Drive/PDF Papers/gemini_paper.pdf


In [7]:
!pip install pdfplumber pytesseract pdf2image -q

import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from langchain.docstore.document import Document

# ✅ Extract text from PDFs
def extract_text_from_pdfs(pdf_folder):
    all_text = []
    pdf_files = glob.glob(os.path.join(pdf_folder, "*.pdf"))

    for pdf_path in pdf_files:
        text = ""
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                text += page.extract_text() or ""  # Extract text (skip empty pages)

        if text.strip():  # If text is found, store it
            all_text.append(Document(page_content=text))
        else:  # If no text, try OCR
            all_text.extend(ocr_pdf(pdf_path))

    return all_text

# ✅ OCR for Scanned PDFs
def ocr_pdf(pdf_path):
    ocr_text = []
    images = convert_from_path(pdf_path)

    for img in images:
        text = pytesseract.image_to_string(img)  # Extract text from image
        if text.strip():
            ocr_text.append(Document(page_content=text))

    return ocr_text

# ✅ Extract text using pdfplumber + OCR
pdf_folder = "/content/drive/My Drive/PDF Papers"  # Update folder path
documents = extract_text_from_pdfs(pdf_folder)

if documents:
    print(f"✅ Successfully loaded {len(documents)} documents.")
    print(documents[0].page_content[:500])  # Preview extracted text
else:
    print("❌ No text extracted. Check PDF format or folder path.")


✅ Successfully loaded 5 documents.
GPT-4 Technical Report
OpenAI∗
Abstract
WereportthedevelopmentofGPT-4,alarge-scale,multimodalmodelwhichcan
acceptimageandtextinputsandproducetextoutputs. Whilelesscapablethan
humansinmanyreal-worldscenarios,GPT-4exhibitshuman-levelperformance
onvariousprofessionalandacademicbenchmarks,includingpassingasimulated
barexamwithascorearoundthetop10%oftesttakers. GPT-4isaTransformer-
basedmodelpre-trainedtopredictthenexttokeninadocument. Thepost-training
alignmentprocessresultsinimprovedperformanceonme


**Split Documents into Chunks**

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define chunking parameters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
doc_chunks = text_splitter.split_documents(documents)

print(f"✅ Split into {len(doc_chunks)} text chunks.")


✅ Split into 597 text chunks.


**Load API Keys for OpenAI**

In [9]:
import yaml

# Load OpenAI credentials
with open('chatgpt_api_credentials.yml', 'r') as file:
    api_creds = yaml.safe_load(file)

os.environ['OPENAI_API_KEY'] = api_creds['openai_key']


**Generate Embeddings and Store in FAISS**

In [10]:
import faiss
import numpy as np
from langchain.embeddings import OpenAIEmbeddings

# Initialize embedding model
embedding_model = OpenAIEmbeddings()

# Convert text chunks into embeddings
vector_data = [embedding_model.embed_query(chunk.page_content) for chunk in doc_chunks]

# Store embeddings in FAISS
embedding_dim = len(vector_data[0])
index = faiss.IndexFlatL2(embedding_dim)
faiss_data = np.array(vector_data, dtype=np.float32)
index.add(faiss_data)

print("✅ Embeddings stored in FAISS.")

# Save FAISS index for later use
faiss.write_index(index, "vector_db.index")

# Save document store
import pickle
with open("doc_store.pkl", "wb") as f:
    pickle.dump(doc_chunks, f)


  embedding_model = OpenAIEmbeddings()


✅ Embeddings stored in FAISS.


**Implement Retrieval Function**

In [11]:
def search_documents(query, top_k=5):
    """Retrieve top-k relevant documents from FAISS."""
    query_embedding = np.array([embedding_model.embed_query(query)], dtype=np.float32)
    _, indices = index.search(query_embedding, top_k)

    results = [doc_chunks[i].page_content for i in indices[0]]
    return results

# Test retrieval
query = "What is the impact of AI in medicine?"
retrieved_docs = search_documents(query)

for i, doc in enumerate(retrieved_docs):
    print(f"🔹 Result {i+1}:\n{doc[:300]}...\n")


🔹 Result 1:
7.1. Impact Assessment
At Google we apply an impact assessment framework throughout the product development lifecycle
related to Google’s AI Principles (Google, 2023). This means we assess the risk and impact of AI
models we’re building at both a model-level (e.g. for Gemini API Ultra 1.0, as deploy...

🔹 Result 2:
and benefit humanity, and we are enthusiastic to see how these models are used by our colleagues
at Google and beyond. We build on many innovations in machine learning, data, infrastructure,
and responsible development – areas that we have been pursuing at Google for over a decade. The
models we pre...

🔹 Result 3:
Google DeepMind Responsible Development and Innovation team, and are reviewed by the Google
DeepMind Responsibility and Safety Council. We draw from various sources in producing impact
assessments, including a wide range of literature, external expertise, and our in-house ethics and
safety research....

🔹 Result 4:
futuresystems. Wewillsoonpublishrecom

**Create and Run the Streamlit App (app.py)**

In [12]:
%%writefile app.py
import streamlit as st
import os
import faiss
import numpy as np
import pickle
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.docstore.document import Document
import yaml

# ✅ Load API Key
with open('chatgpt_api_credentials.yml', 'r') as file:
    api_creds = yaml.safe_load(file)
os.environ["OPENAI_API_KEY"] = api_creds["openai_key"]

# ✅ Initialize OpenAI Chat Model
llm = ChatOpenAI(model_name="gpt-4", openai_api_key=os.environ["OPENAI_API_KEY"])

# ✅ Load FAISS Index
index = faiss.read_index("vector_db.index")

# ✅ Load stored documents
with open("doc_store.pkl", "rb") as f:
    doc_store = pickle.load(f)

# ✅ Initialize OpenAI Embeddings
embedding_model = OpenAIEmbeddings()

# ✅ Define Retrieval Function
def retrieve_top_k(query, k=3):
    query_embedding = np.array([embedding_model.embed_query(query)], dtype=np.float32)
    _, indices = index.search(query_embedding, k)
    retrieved_docs = [doc_store[i].page_content for i in indices[0]]
    return retrieved_docs

# ✅ Define Answer Generation Function
def generate_answer(query):
    top_docs = retrieve_top_k(query)
    context = "\n".join(top_docs)

    response = llm.invoke(context)
    return response, top_docs

# ✅ Streamlit UI
st.title("📚 Research Paper Answer Bot")
query = st.text_input("🔍 Ask a question:")
if st.button("💡 Get Answer"):
    answer, sources = generate_answer(query)
    st.write("### 🤖 Answer:")
    st.write(answer)

    st.write("### 🔎 Sources:")
    for i, src in enumerate(sources):
        st.write(f"**Source {i+1}:** {src[:300]}...")  # Show preview


Writing app.py


**Run the Streamlit App**

In [17]:
!pip install pyngrok -q
from pyngrok import ngrok

!ngrok authtoken 2t7eFVhUYgYIA3g3uL7H6qYQnL9_ij9nHNV747EwZ91d7G3f # Replace with your Ngrok token


Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
!pkill streamlit  # Stop any running Streamlit instances
!streamlit run app.py &  # Restart Streamlit

# Expose the app via Ngrok
public_url = ngrok.connect(8501).public_url
print(f"🚀 Streamlit App is Live: {public_url}")



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.147.33.178:8501[0m
[0m
