# üß† Study Buddy ‚Äî Build Your Own RAG Chatbot with Gemini
Upload any PDF or text file (e.g., course notes, a Wikipedia export, or an article).

Ask questions like:
- ‚ÄúSummarize Chapter 2‚Äù
- ‚ÄúWhat is reinforcement learning?‚Äù
- ‚ÄúWhat‚Äôs the main takeaway from this section?‚Äù


In [1]:
# üß© Step 1: Install dependencies
!pip install -q google-generativeai PyPDF2 faiss-cpu python-pptx

In [2]:
# üß† Step 2: Import libraries
import google.generativeai as genai
from getpass import getpass
import PyPDF2
import faiss
import numpy as np
import re

In [3]:
# ‚öôÔ∏è Step 3: Configure Gemini API
GEMINI_API_KEY = getpass("üîë Enter your Gemini API key: ")
genai.configure(api_key=GEMINI_API_KEY)

üîë Enter your Gemini API key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [4]:
# üßæ Step 4: Upload your study material
from google.colab import files
import io
from pptx import Presentation # Import Presentation for pptx files
uploaded = files.upload()

file_name = list(uploaded.keys())[0]
text = ""

if file_name.endswith(".pdf"):
    reader = PyPDF2.PdfReader(file_name)
    for page in reader.pages:
        text += page.extract_text() or ""
elif file_name.endswith((".pptx", ".ppt")): # Handle PowerPoint files
    # Use io.BytesIO to read the uploaded bytes as a file
    ppt_file = io.BytesIO(uploaded[file_name])
    presentation = Presentation(ppt_file)
    for slide in presentation.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                text += shape.text + "\n"
else: # Default for other text files
    text = uploaded[file_name].decode("utf-8")

print(f"‚úÖ Loaded {len(text)} characters from {file_name}")

Saving 07 - Web Applications and Attacks.pptx to 07 - Web Applications and Attacks (3).pptx
Saving 08 - Web Application Attacks and Security.pptx to 08 - Web Application Attacks and Security (2).pptx
‚úÖ Loaded 8240 characters from 07 - Web Applications and Attacks (3).pptx


In [5]:
# ü™Ñ Step 5: Split text into chunks
def split_text(text, chunk_size=1000, overlap=200):
    text = re.sub(r'\s+', ' ', text)
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

chunks = split_text(text)
print(f"üìö Split into {len(chunks)} chunks")

üìö Split into 11 chunks


In [6]:
# üß© Step 6: Create embeddings and index
embed_model = "models/gemini-embedding-001"
embeddings = []

for chunk in chunks:
    result = genai.embed_content(model=embed_model, content=chunk)
    embeddings.append(result["embedding"])

embeddings = np.array(embeddings, dtype="float32")

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
print("‚úÖ Vector index built!")

‚úÖ Vector index built!


In [7]:
# üí¨ Step 7: Define RAG query function
def retrieve(query, k=3):
    q_embed = genai.embed_content(model=embed_model, content=query)["embedding"]
    _, idx = index.search(np.array([q_embed], dtype="float32"), k)
    return [chunks[i] for i in idx[0]]

def ask_study_buddy(query):
    docs = retrieve(query)
    context = "\n\n".join(docs)
    prompt = f"You are Study Buddy, a helpful assistant for learning.\nUse the context below to answer the question concisely and clearly.\n\nContext:\n{context}\n\nQuestion: {query}"
    model_name = "gemini-2.5-flash"
    model = genai.GenerativeModel(model_name)
    response = model.generate_content(prompt)
    return response.text

# üß™ Step 8: Try asking a question
question = "Summarize Chapter 2"
print(f"ü§î Q: {question}\n")
print("üí° A:", ask_study_buddy(question))

ü§î Q: Summarize Chapter 2

üí° A: Based on the provided context, there is no information or content labeled as "Chapter 2." The context primarily contains slide numbers (e.g., ITSS4360 12-24) discussing topics like data processing, Content Delivery Networks (CDNs), frontend vs. backend, cybersecurity certification roadmaps, and web application security.


In [8]:
question = "Summarize Content Delivery Networks (CDNs)"
print(f"ü§î Q: {question}\n")
print("üí° A:", ask_study_buddy(question))

ü§î Q: Summarize Content Delivery Networks (CDNs)

üí° A: A Content Delivery Network (CDN) is a distributed network of servers that stores cached copies of static content. It improves an application's performance by reducing the latency that occurs when serving content from a single server.
