# **Quest Analytics - AI Research Assistant**

# **Setup First**

In [2]:
# Install all required packages
!pip install -q langchain langchain-community langchain-groq
!pip install -q chromadb
!pip install -q sentence-transformers
!pip install -q pypdf
!pip install -q gradio
!pip install -q tiktoken

# **Task 1: Load Document Using LangChain**

In [3]:
# ============================================================
# TASK 1: Load Document Using LangChain (pdf_loader.png)
# ============================================================

from langchain_community.document_loaders import PyPDFLoader

# Path to your PDF in Colab
pdf_path = "/content/A_Comprehensive_Review_of_Low_Rank_Adaptation_in_Large_Language_Models_for_Efficient_Parameter_Tuning-1.pdf"

# Load the PDF
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Display results
print(f"✅ Document loaded successfully!")
print(f"📄 Total pages loaded: {len(documents)}")
print(f"\n--- Preview of Page 1 ---")
print(documents[0].page_content[:500])
print(f"\n--- Metadata ---")
print(documents[0].metadata)

✅ Document loaded successfully!
📄 Total pages loaded: 11

--- Preview of Page 1 ---
A Comprehensive Review of Low-Rank
Adaptation in Large Language Models for
Efficient Parameter Tuning
September 10, 2024
Abstract
Natural Language Processing (NLP) often involves pre-training large
models on extensive datasets and then adapting them for specific tasks
through fine-tuning. However, as these models grow larger, like GPT-3
with 175 billion parameters, fully fine-tuning them becomes computa-
tionally expensive. We propose a novel method called LoRA (Low-Rank
Adaptation) that signifi

--- Metadata ---
{'producer': 'pdfTeX-1.40.26', 'creator': 'TeX', 'creationdate': '2024-09-10T21:50:42+00:00', 'moddate': '2024-09-10T21:50:42+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'trapped': '/False', 'source': '/content/A_Comprehensive_Review_of_Low_Rank_Adaptation_in_Large_Language_Models_for_Efficient_Parameter_Tuning-1.pdf', 'to

# **Task 2: Apply Text Splitting Techniques**

In [6]:
# ============================================================
# TASK 2: Apply Text Splitting Techniques (code_splitter.png)
# ============================================================

from langchain_text_splitters import RecursiveCharacterTextSplitter

# Install if needed (run this first if you get ModuleNotFoundError)
# !pip install -q langchain-text-splitters

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Each chunk has max 1000 characters
    chunk_overlap=200,     # 200 character overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split the loaded documents into chunks
chunks = text_splitter.split_documents(documents)

# Display results
print(f"✅ Text splitting complete!")
print(f"📦 Total chunks created: {len(chunks)}")
print(f"\n--- Preview of Chunk 1 ---")
print(chunks[0].page_content)
print(f"\n--- Chunk 1 Metadata ---")
print(chunks[0].metadata)
print(f"\n--- Preview of Chunk 2 ---")
print(chunks[1].page_content[:300])

✅ Text splitting complete!
📦 Total chunks created: 38

--- Preview of Chunk 1 ---
A Comprehensive Review of Low-Rank
Adaptation in Large Language Models for
Efficient Parameter Tuning
September 10, 2024
Abstract
Natural Language Processing (NLP) often involves pre-training large
models on extensive datasets and then adapting them for specific tasks
through fine-tuning. However, as these models grow larger, like GPT-3
with 175 billion parameters, fully fine-tuning them becomes computa-
tionally expensive. We propose a novel method called LoRA (Low-Rank
Adaptation) that significantly reduces the overhead by freezing the orig-
inal model weights and only training small rank decomposition matrices.
This leads to up to 10,000 times fewer trainable parameters and reduces
GPU memory usage by three times. LoRA not only maintains but some-
times surpasses fine-tuning performance on models like RoBERTa, De-
BERTa, GPT-2, and GPT-3. Unlike other methods, LoRA introduces
no extra latency during in

# **Task 3: Embed Documents**

In [7]:
# ============================================================
# TASK 3: Embed Documents (embedding.png)
# ============================================================

import os
from google.colab import userdata
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load HuggingFace token
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Test embedding on a sample text
sample_text = chunks[0].page_content
embedding_vector = embedding_model.embed_query(sample_text)

# Display results
print(f"✅ Embedding model loaded: sentence-transformers/all-MiniLM-L6-v2")
print(f"🔢 Embedding vector dimension: {len(embedding_vector)}")
print(f"\n--- First 10 values of the embedding vector ---")
print(embedding_vector[:10])
print(f"\n✅ Total chunks ready for embedding: {len(chunks)}")

  embedding_model = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model loaded: sentence-transformers/all-MiniLM-L6-v2
🔢 Embedding vector dimension: 384

--- First 10 values of the embedding vector ---
[-0.046186309307813644, -0.09967907518148422, 0.01530721690505743, 0.04801216349005699, 0.06438610702753067, 0.03943644464015961, -0.039326027035713196, 0.018974632024765015, 0.003956829663366079, -0.06212154030799866]

✅ Total chunks ready for embedding: 38


# **Task 4: Create and Configure Vector Database**

In [8]:
# ============================================================
# TASK 4: Create & Configure Vector Database (vectordb.png)
# ============================================================

from langchain_community.vectorstores import Chroma

# Define a persistent directory for ChromaDB
persist_directory = "/content/chroma_db"

# Create Chroma vector database from document chunks
print("⏳ Creating Chroma vector database... (this may take a moment)")

vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=persist_directory
)

# Persist the database
vectordb.persist()

# Display results
print(f"✅ Chroma vector database created and persisted!")
print(f"📁 Storage location: {persist_directory}")
print(f"📊 Total documents stored: {vectordb._collection.count()}")

# Test a similarity search
test_query = "What is Low-Rank Adaptation?"
results = vectordb.similarity_search(test_query, k=2)
print(f"\n--- Test Similarity Search: '{test_query}' ---")
for i, doc in enumerate(results):
    print(f"\nResult {i+1}:")
    print(doc.page_content[:300])

⏳ Creating Chroma vector database... (this may take a moment)
✅ Chroma vector database created and persisted!
📁 Storage location: /content/chroma_db
📊 Total documents stored: 38

--- Test Similarity Search: 'What is Low-Rank Adaptation?' ---

Result 1:
efficiency. However, our approach differs in that we apply low-rank updates to
frozen pre-trained models, making it highly effective for task-specific adapta-
tion. Neural networks with low-rank structures have been shown to outperform
classical methods such as finite-width neural tangent kernels (A

Result 2:
model depth or reducing the usable sequence length. Furthermore, these meth-
ods typically do not perform as well as full fine-tuning, leading to a trade-off
between efficiency and model performance.
Inspired by prior works that demonstrate over-parametrized models often
reside in a low intrinsic di


  vectordb.persist()


# **Task 5: Develop a Retriever**

In [10]:
# ============================================================
# TASK 5: Develop a Retriever (retriever.png)
# ============================================================

# Create a retriever from the vector database
retriever = vectordb.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Retrieve top 4 most relevant chunks
)

# Test the retriever
query = "What this paper is talking about?"
retrieved_docs = retriever.invoke(query)

# Display results
print(f"✅ Retriever created successfully!")
print(f"🔍 Query: '{query}'")
print(f"📄 Number of documents retrieved: {len(retrieved_docs)}")

for i, doc in enumerate(retrieved_docs):
    print(f"\n{'='*50}")
    print(f"📌 Retrieved Document {i+1}")
    print(f"📄 Source: {doc.metadata.get('source', 'N/A')} | Page: {doc.metadata.get('page', 'N/A')}")
    print(f"{'='*50}")
    print(doc.page_content[:400])

✅ Retriever created successfully!
🔍 Query: 'What this paper is talking about?'
📄 Number of documents retrieved: 4

📌 Retrieved Document 1
📄 Source: /content/A_Comprehensive_Review_of_Low_Rank_Adaptation_in_Large_Language_Models_for_Efficient_Parameter_Tuning-1.pdf | Page: 4
ing it as a low-rank decomposition, W0 + ∆W = W0 + BA, where B ∈ Rd×r,
A ∈ Rr×k, and the rank r ≪ min(d, k). During training, W0 is fixed, and A
and B are the trainable parameters. Both W0 and ∆ W = BA are multiplied
with the input, and their respective outputs are summed element-wise. Thus,
for h = W0x, our updated forward pass becomes:
h = W0x + ∆W x = W0x + BAx
We illustrate this reparametrizat

📌 Retrieved Document 2
📄 Source: /content/A_Comprehensive_Review_of_Low_Rank_Adaptation_in_Large_Language_Models_for_Efficient_Parameter_Tuning-1.pdf | Page: 4
The principles outlined here apply generally to dense layers in neural networks,
although we focus on specific weights in Transformer language models, as these
mod

# **Task 6: Construct a QA Bot with Gradio Interface**

In [12]:
# Run this first
!pip install -q langchain-groq langchain-text-splitters

In [15]:
# ============================================================
# TASK 6: QA Bot with Gradio Interface (QA_bot.png)
# ============================================================

import os
import gradio as gr
from google.colab import userdata
from langchain_groq import ChatGroq
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import PromptTemplate        # ✅ Fixed
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Load API Keys
os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

# ---- Initialize LLM ----
llm = ChatGroq(
    model_name="llama-3.3-70b-versatile",
    temperature=0.2,
    max_tokens=1024
)

# ---- Prompt Template ----
prompt_template = """You are a helpful research assistant for Quest Analytics.
Use the following context from the research paper to answer the question.
If you don't know the answer from the context, say "I don't have enough information from the document to answer this."

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# ---- Global variable to hold QA chain ----
qa_chain = None

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def load_and_process_pdf(pdf_file):
    """Load PDF, create embeddings, and set up QA chain."""
    global qa_chain

    try:
        pdf_path = pdf_file

        # Load PDF
        loader = PyPDFLoader(pdf_path)
        documents = loader.load()

        # Split text
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        chunks = text_splitter.split_documents(documents)

        # Create embeddings
        embedding_model = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-MiniLM-L6-v2",
            model_kwargs={"device": "cpu"},
            encode_kwargs={"normalize_embeddings": True}
        )

        # Create vector store
        vectordb = Chroma.from_documents(
            documents=chunks,
            embedding=embedding_model,
            persist_directory="/content/chroma_qa_bot"
        )

        # Create retriever
        retriever = vectordb.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
        )

        # ✅ New LCEL chain replacing deprecated RetrievalQA
        qa_chain = (
            {
                "context": retriever | format_docs,
                "question": RunnablePassthrough()
            }
            | PROMPT
            | llm
            | StrOutputParser()
        )

        return f"✅ PDF loaded successfully!\n📄 Pages: {len(documents)}\n📦 Chunks: {len(chunks)}\n\nYou can now ask questions about the document!"

    except Exception as e:
        return f"❌ Error loading PDF: {str(e)}"


def answer_question(question, history):
    """Answer a question using the QA chain."""
    global qa_chain

    if qa_chain is None:
        history.append(("System", "⚠️ Please upload a PDF document first."))
        return history

    if not question.strip():
        history.append(("System", "⚠️ Please enter a question."))
        return history

    try:
        # ✅ Use invoke() instead of __call__()
        answer = qa_chain.invoke(question)
        history.append((question, answer))
        return history

    except Exception as e:
        history.append((question, f"❌ Error: {str(e)}"))
        return history


# ---- Build Gradio Interface ----
with gr.Blocks(
    theme=gr.themes.Soft(),
    title="Quest Analytics RAG Assistant"
) as demo:

    gr.Markdown("""
    # 🔬 Quest Analytics - AI Research Assistant
    ### Powered by LangChain + Groq (Mixtral-8x7B) + ChromaDB
    Upload a research paper and ask questions about it!
    """)

    with gr.Row():
        with gr.Column(scale=1):
            gr.Markdown("### 📁 Upload Document")
            pdf_input = gr.File(
                label="Upload PDF",
                file_types=[".pdf"],
                type="filepath"
            )
            upload_btn = gr.Button("🚀 Process PDF", variant="primary")
            upload_status = gr.Textbox(
                label="Status",
                lines=4,
                interactive=False,
                value="Waiting for PDF upload..."
            )

        with gr.Column(scale=2):
            gr.Markdown("### 💬 Ask Questions")
            chatbot = gr.Chatbot(
                label="Research Assistant",
                height=400,
                bubble_full_width=False
            )
            with gr.Row():
                question_input = gr.Textbox(
                    label="Your Question",
                    placeholder="e.g., What this paper is talking about?",
                    lines=2,
                    scale=4
                )
                ask_btn = gr.Button("Ask 🔍", variant="primary", scale=1)

            gr.Examples(
                examples=[
                    ["What this paper is talking about?"],
                    ["What is Low-Rank Adaptation (LoRA)?"],
                    ["What are the main contributions of this paper?"],
                    ["What datasets were used in the experiments?"],
                    ["What are the limitations mentioned in this paper?"]
                ],
                inputs=question_input
            )

            clear_btn = gr.Button("🗑️ Clear Chat", variant="secondary")

    # ---- Event Handlers ----
    upload_btn.click(
        fn=load_and_process_pdf,
        inputs=[pdf_input],
        outputs=[upload_status]
    )

    ask_btn.click(
        fn=answer_question,
        inputs=[question_input, chatbot],
        outputs=[chatbot]
    ).then(
        fn=lambda: "",
        outputs=[question_input]
    )

    question_input.submit(
        fn=answer_question,
        inputs=[question_input, chatbot],
        outputs=[chatbot]
    ).then(
        fn=lambda: "",
        outputs=[question_input]
    )

    clear_btn.click(
        fn=lambda: [],
        outputs=[chatbot]
    )

    gr.Markdown("---\n*Quest Analytics RAG Assistant | Built with LangChain, Groq, ChromaDB & Gradio*")

# Launch
demo.launch(share=True, debug=True)

  with gr.Blocks(
  chatbot = gr.Chatbot(
  chatbot = gr.Chatbot(
  chatbot = gr.Chatbot(


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://bf62d87377e4a4ef8a.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1134, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/error

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://bf62d87377e4a4ef8a.gradio.live


