<a href="https://colab.research.google.com/github/Madhusudan3223/CollabGPT/blob/main/collabGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🤖 CollabGPT: A Multi-PDF RAG Chatbot using Cohere Embeddings

Welcome to **CollabGPT** — a powerful Retrieval-Augmented Generation (RAG) chatbot built with **Cohere’s embedding models**. This notebook lets you upload multiple PDF documents and ask natural language questions about them using a Gradio-powered chatbot interface.

### 🚀 What This Notebook Does:
1. 📄 Upload multiple PDFs (research papers, reports, etc.)
2. 🔍 Extract and split text into meaningful chunks
3. 🧠 Generate embeddings using **Cohere’s `embed-english-v3.0` model**
4. 🔎 Match your query with the most relevant chunks using cosine similarity
5. 💬 Use the matched content as context to answer your questions
6. 🧑‍💻 Interact with a **Gradio UI** to ask anything from your documents

### 📚 Use Case Examples:
- Summarize PDF content
- Ask questions like “What are the key findings?”
- Extract definitions or bullet points from dense documents
- Academic literature review
- Product or policy document Q&A




📦 Step 1: Install Required Libraries
Before running the chatbot, we need to install the following Python packages:

cohere==4.44: To use Cohere’s embedding and text generation APIs

PyMuPDF: For extracting text from PDF files

In [34]:
!pip install cohere==4.44 PyMuPDF



## 📄  Upload  PDF Documents

Use the cell below to upload multiple PDFs into the notebook. These PDFs will act as the knowledge source for the chatbot.

- Files are automatically saved into a `data/` folder.
- Make sure the uploaded PDFs are relevant to your domain or use case.


In [35]:
import fitz  # PyMuPDF
from google.colab import files
import os

# Create folder
os.makedirs("pdfs", exist_ok=True)

# Upload PDFs
uploaded = files.upload()

# Move them to 'pdfs/' folder
for filename in uploaded.keys():
    os.rename(filename, f"pdfs/{filename}")

# Function to extract text from all PDFs
def extract_all_text(pdf_folder="pdfs"):
    all_text = ""
    for filename in os.listdir(pdf_folder):
        if filename.endswith(".pdf"):
            path = os.path.join(pdf_folder, filename)
            with fitz.open(path) as doc:
                for page in doc:
                    all_text += page.get_text()
    return all_text

# Extract and show first few lines
raw_text = extract_all_text()
print("✅ Text extraction complete. Preview:")
print(raw_text[:1000])  # Preview first 1000 characters


Saving AutoGPT+P.pdf to AutoGPT+P.pdf
Saving AUTO GPT.pdf to AUTO GPT.pdf
Saving Large Language Models Survey.pdf to Large Language Models Survey.pdf
Saving Large Language Model.pdf to Large Language Model.pdf
Saving PROMPT DESIGN.pdf to PROMPT DESIGN.pdf
Saving Prompt Report.pdf to Prompt Report.pdf
Saving Prompt Engineering.pdf to Prompt Engineering.pdf
Saving GEN AI.pdf to GEN AI.pdf
Saving RAG.pdf to RAG.pdf
Saving Retrieval-Augmented Generation for Large.pdf to Retrieval-Augmented Generation for Large.pdf
Saving LLm2.pdf to LLm2.pdf
Saving LLM.pdf to LLM.pdf
Saving Low-resolution-25Oct2024-Conversations-for-tomorrow_Edition_9_Report-V2-1.pdf to Low-resolution-25Oct2024-Conversations-for-tomorrow_Edition_9_Report-V2-1.pdf
Saving ai2.pdf to ai2.pdf
Saving ai1.pdf to ai1.pdf
✅ Text extraction complete. Preview:
LLM Multi-Agent Systems: Challenges and Open Problems
Shanshan Han 1 Qifan Zhang 1 Yuhang Yao 2 Weizhao Jin 3 Zhaozhuo Xu 4
Abstract
This paper explores multi-agent systems an

## 📚  Load and Split PDF Text into Chunks

In this step:

- We use `PyMuPDFLoader` to extract text from each uploaded PDF.
- Then, we split the text into smaller, overlapping chunks using `RecursiveCharacterTextSplitter`.

This chunking makes it easier for the language model to process and embed meaningful context.


In [36]:
import cohere
import os
import time
from google.colab import userdata

# Set your API key using Colab Secrets Manager
# Add my Cohere API key to the Secrets Manager with the name 'COHERE_API_KEY'
co = cohere.Client(userdata.get('COHERE_API_KEY'))

# Chunk text into overlapping pieces
def chunk_text(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

# Chunk the raw text
chunks = chunk_text(raw_text)
print(f"✅ Split into {len(chunks)} chunks.")

# Embed chunks using Cohere with batching
all_embeddings = []
batch_size = 100

for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i + batch_size]
    print(f"📦 Embedding batch {i // batch_size + 1} of {len(chunks) // batch_size + 1}...")

    response = co.embed(
        texts=batch,
        model="embed-english-v3.0",
        input_type="search_document"
    )
    all_embeddings.extend(response.embeddings)

    # Pause to respect rate limit
    if i + batch_size < len(chunks):
        time.sleep(60)

print("✅ All embeddings created!")

✅ Split into 4352 chunks.
📦 Embedding batch 1 of 44...
📦 Embedding batch 2 of 44...
📦 Embedding batch 3 of 44...
📦 Embedding batch 4 of 44...
📦 Embedding batch 5 of 44...
📦 Embedding batch 6 of 44...
📦 Embedding batch 7 of 44...
📦 Embedding batch 8 of 44...
📦 Embedding batch 9 of 44...
📦 Embedding batch 10 of 44...
📦 Embedding batch 11 of 44...
📦 Embedding batch 12 of 44...
📦 Embedding batch 13 of 44...
📦 Embedding batch 14 of 44...
📦 Embedding batch 15 of 44...
📦 Embedding batch 16 of 44...
📦 Embedding batch 17 of 44...
📦 Embedding batch 18 of 44...
📦 Embedding batch 19 of 44...
📦 Embedding batch 20 of 44...
📦 Embedding batch 21 of 44...
📦 Embedding batch 22 of 44...
📦 Embedding batch 23 of 44...
📦 Embedding batch 24 of 44...
📦 Embedding batch 25 of 44...
📦 Embedding batch 26 of 44...
📦 Embedding batch 27 of 44...
📦 Embedding batch 28 of 44...
📦 Embedding batch 29 of 44...
📦 Embedding batch 30 of 44...
📦 Embedding batch 31 of 44...
📦 Embedding batch 32 of 44...
📦 Embedding batch 33 of

## 🤖: Generate Embeddings with Cohere

We use Cohere’s `embed-english-v3.0` model to convert each document chunk into a dense vector. These vectors are crucial for semantic search and answering user queries.


In [55]:
import numpy as np

# Vector search: get top N similar chunks using cosine similarity
def find_similar_chunks(query, all_embeddings, chunks, co, k=5):
    # Embed the query
    query_embed = co.embed(
        texts=[query],
        model="embed-english-v3.0",
        input_type="search_query"
    ).embeddings[0]

    # Calculate cosine similarity
    similarities = []
    for i, emb in enumerate(all_embeddings):
        sim = np.dot(query_embed, emb) / (np.linalg.norm(query_embed) * np.linalg.norm(emb))
        similarities.append((i, sim))

    # Sort and select top k
    top_k = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
    return [chunks[i] for i, _ in top_k]

# Generate answer using top-k context
def generate_answer(query, context_chunks, co):
    context = "\n".join(context_chunks)
    message = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    response = co.chat(
        message=message,
        temperature=0.3,
        model="command-r" # Ensure correct model for chat
    )
    return response.text.strip()

🧠 : Generate Answer using Cohere LLM
This function takes the user query and the most relevant document chunks (retrieved earlier), and sends them to Cohere’s generate API.

In [41]:
def generate_answer(query, context_chunks, co):
    context = "\n".join(context_chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    response = co.generate(
        model="command",  # ✅ Corrected model name
        prompt=prompt,
        max_tokens=300,
        temperature=0.3
    )
    return response.generations[0].text.strip()


💬 : Generate Contextual Answer with co.chat()
This function is responsible for generating an answer to the user’s question based on the most relevant chunks from your uploaded PDFs. It works like this:

✅ Joins the top k relevant chunks into a single context block

✅ Builds a prompt in the format:

In [43]:
def generate_answer(query, context_chunks, co):
    context = "\n".join(context_chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    response = co.chat(
        message=prompt,
        temperature=0.3
    )
    return response.text.strip()


🧠 : Ask Questions & Generate Answers (RAG Inference)
This is the main inference pipeline where the chatbot answers your question by combining semantic search and text generation, powered by Cohere:

✅ Query input — You ask a natural language question.

✅ Step 1: Semantic search — The find_similar_chunks function uses the Cohere embeddings and FAISS-like search to retrieve the top k most relevant chunks from the uploaded PDFs.

✅ Step 2: Contextual generation — These chunks and the query are passed to the generate_answer function, which uses co.chat() to produce a smart, context-aware answer.

✅ Step 3: Print the result — The final answer is displayed.

In [45]:
# Ask your question here
query = "Summarize the key trends in the Conversations for Tomorrow report."

# Step 1: Find similar chunks
top_chunks = find_similar_chunks(query, all_embeddings, chunks, co, k=5)

# Step 2: Generate the answer
answer = generate_answer(query, top_chunks, co)

# Step 3: Show result
print("📌 Question:", query)
print("\n🧠 Answer:\n", answer)

📌 Question: Summarize the key trends in the Conversations for Tomorrow report.

🧠 Answer:
 The **Conversations for Tomorrow** report highlights several key trends and insights related to the intersection of generative AI (Gen AI), sustainability, and the future of business and society. Here’s a summary of the key trends:

1. **Rapid Adoption of Gen AI**: Organizations globally are rapidly embedding Gen AI across various functions, creating a ripple effect for wider societal impact.  
2. **Carbon Footprint Concerns**: Over one-third of organizations are already tracking their Gen AI carbon emissions, acknowledging the significant environmental impact of AI technologies.  
3. **Focus on Sustainability**: The report emphasizes the dual transition to a digital and sustainable economy, with a spotlight on sustainability and climate tech innovations.  
4. **Leadership and Innovation**: Gen AI is being leveraged for leadership and innovation, with a focus on uncovering innovations that matter

💬 : Generate Answers using co.chat() with the "command-r" Model
This function uses Cohere's chat() endpoint to generate smart, contextual answers from the top retrieved chunks.

Key actions:

🧩 Combine context chunks into a unified string block.

🧠 Build a message prompt with the format:

In [56]:
def generate_answer(query, context_chunks, co):
    context = "\n".join(context_chunks)
    message = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    response = co.chat(
        message=message,
        temperature=0.3,
        model="command-r" # Ensure correct model for chat
    )
    return response.text.strip()

🔁 : Interactive Q&A Loop (Console-based)
This loop enables users to interactively ask unlimited questions in the terminal. It retrieves relevant context and generates answers in real-time using Cohere.

How it works:

💬 Prompts the user to input a question (or type 'exit' to stop).

🔎 Uses find_similar_chunks() to get top-k relevant document chunks based on semantic similarity.

🧠 Sends the query + context to generate_answer() using Cohere's chat() API.

🖨️ Displays the AI-generated answer in the console.

♻️ Repeats until the user types 'exit'.

In [57]:
'while True:
    query = input("🔍 Ask a question (or type 'exit' to quit): ")
    if query.lower() == 'exit':
        print("👋 Exiting Q&A session.")
        break

    try:
        top_chunks = find_similar_chunks(query, all_embeddings, chunks, co, k=5)
        answer = generate_answer(query, top_chunks, co)
        print("\n🧠 Answer:\n", answer)
        print("\n" + "-"*80 + "\n")
    except Exception as e:
        print("⚠️ Error:", str(e))

🔍 Ask a question (or type 'exit' to quit): hi

🧠 Answer:
 Hello! I hope you're doing well today. Is there a specific question you'd like to ask regarding the text provided? It seems like a collection of references related to AI and natural language processing.

--------------------------------------------------------------------------------

🔍 Ask a question (or type 'exit' to quit): exit
👋 Exiting Q&A session.


🧠 : Interactive Terminal-Based Q&A + History Logger
This block enables a continuous chat loop in the terminal, allowing you to interactively ask questions and get answers from your uploaded PDFs using Cohere embeddings + generation.

🔄 Key Features:
💬 Ask Anything: User can input any question from the console.

🧠 Find Top Relevant Chunks: Uses semantic search via find_similar_chunks() to pull matching content.

🤖 Generate Accurate Answers: Feeds the context to generate_answer() (powered by Cohere's chat() API).

📜 Log Q&A: Every exchange is saved in the qa_history list for reference or export.

In [58]:
qa_history = []

while True:
    query = input("🔍 Ask a question (or type 'exit' to quit): ")
    if query.lower() == 'exit':
        print("👋 Exiting Q&A session.")
        break

    try:
        top_chunks = find_similar_chunks(query, all_embeddings, chunks, co, k=5)
        answer = generate_answer(query, top_chunks, co)
        print("\n🧠 Answer:\n", answer)
        print("\n" + "-"*80 + "\n")

        # Save each Q&A to history
        qa_history.append(f"Question: {query}\nAnswer: {answer}\n")

    except Exception as e:
        print("⚠️ Error:", str(e))

🔍 Ask a question (or type 'exit' to quit): hi

🧠 Answer:
 Hello! I hope you're doing well today. Is there a specific question you'd like to ask regarding the text provided? It seems like a research article or a summary of various studies related to natural language processing and interactive systems.

--------------------------------------------------------------------------------

🔍 Ask a question (or type 'exit' to quit): exit
👋 Exiting Q&A session.


💾: Save Q&A History to a Text File
This optional step allows you to export the entire Q&A session (from the interactive loop) to a .txt file. Each question and answer pair stored in the qa_history list is saved line-by-line for later reference, sharing, or reporting.

In [51]:
# Save to text file
with open("qa_history.txt", "w") as f:
    f.writelines([qa + "\n" for qa in qa_history])

print("✅ Q&A history saved to 'qa_history.txt'")


✅ Q&A history saved to 'qa_history.txt'


🌐 : Install Gradio for Web UI
We'll now install Gradio, a lightweight Python library to build a web-based chatbot interface. This makes it easy to interact with the RAG system directly in your browser—no need to use the terminal.

In [52]:
!pip install gradio --quiet


🧠 : Launch Chatbot with Gradio Web UI
Now let’s wrap our PDF QA system in a user-friendly chatbot interface using Gradio. This allows you to:

Interact with your Cohere-powered chatbot in the browser

Ask natural language questions about your uploaded PDFs

Share your app via a public link (if share=True is enabled)

In [59]:
import gradio as gr

# Define chatbot function
def chatbot_interface(query):
    try:
        # Ensure these variables are accessible in the global scope or passed
        # For Gradio, it's often easiest if they are global after running setup cells
        global all_embeddings, chunks, co
        top_chunks = find_similar_chunks(query, all_embeddings, chunks, co, k=5)
        answer = generate_answer(query, top_chunks, co)
        return answer
    except Exception as e:
        return f"⚠️ Error: {str(e)}"

# Launch Gradio UI
gr.Interface(
    fn=chatbot_interface,
    inputs=gr.Textbox(lines=2, placeholder="Ask me anything from the PDFs..."),
    outputs="text",
    title="📚 CollabGPT",
    description="Ask questions based on your uploaded PDFs. Powered by Cohere embeddings + RAG.",
    theme="soft"
).launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7fba9f608539673f23.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


