# Software Engineering Study Assistant - RAG Pipeline

This notebook implements a Retrieval-Augmented Generation (RAG) chatbot designed to help software engineering students with:
- **Understanding complex topics** from lecture notes and textbooks
- **Solving previous year exam questions** with detailed explanations
- **Getting contextual answers** from course materials and PDFs
- **Study assistance** with proper references to source materials

**Technology Stack:**
- **PyMuPDF** for PDF lecture notes extraction
- **LangChain's RecursiveCharacterTextSplitter** for intelligent text chunking
- **Sentence Transformers** (all-MiniLM-L6-v2) for semantic embeddings
- **ChromaDB** for fast similarity search across study materials
- **LangChain's retriever** for relevant content retrieval
- **Gemini Pro** for generating comprehensive answers with context

In [None]:
## Study Assistant Pipeline Flow

```
Lecture Notes PDFs → PyMuPDF → Text Extraction → RecursiveCharacterTextSplitter → Knowledge Chunks
                                                           ↓
                                              Sentence Transformers → Semantic Embeddings → ChromaDB Knowledge Base
                                                           ↓
Student Question/Problem → Query Embedding → LangChain Retriever → Relevant Study Materials
                                                           ↓
                        Gemini Pro ← Context + Question → Detailed Answer with References
```

**Use Cases:**
- "Explain object-oriented programming concepts"
- "How do I solve this data structures problem?"
- "What are the key points about software testing methodologies?"
- "Help me understand this previous year question on algorithms"

In [None]:
## Installation

Install all required packages using the requirements.txt file:

In [None]:
# Install required packages for RAG pipeline using requirements.txt
!pip install -r requirements.txt

In [None]:
## Import Libraries

In [None]:
import fitz  # PyMuPDF
import os
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.schema import Document
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import uuid
from typing import List

In [None]:
## Gemini API Key Setup

Get your free Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)

In [None]:
# Set your Gemini API key
GEMINI_API_KEY = "AIzaSyCATHVWy8gTDiLCZCMcJxmqDw-u33X9cFQ"  # Replace with your actual API key

# Verify API key is set
if GEMINI_API_KEY == "your-gemini-api-key-here":
    print("Please replace 'your-gemini-api-key-here' with your actual Gemini API key")
    print("Get your free API key from: https://makersuite.google.com/app/apikey")
else:
    print("Gemini API key configured")
    print(f"API key starts with: {GEMINI_API_KEY[:8]}...")

# Set environment variable for Google Generative AI
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

In [None]:
## PDF Text Extraction with PyMuPDF

This section handles text-based PDFs using direct text extraction:

- **Text-based PDFs**: Direct text extraction using PyMuPDF for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)

**Supported PDF Types:**
- Documents created from Word processors
- LaTeX-generated PDFs  
- Google Docs exports
- Any PDF with embedded text data

The pipeline uses PyMuPDF for fast and accurate text extraction from digital documents.

In [None]:
import fitz  # PyMuPDF
import os

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extract text from text-based PDFs using PyMuPDF
    Use this for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)
    """
    if not os.path.exists(pdf_path):
        print(f"Warning: PDF file not found: {pdf_path}")
        return ""
    
    print(f"Processing: {os.path.basename(pdf_path)}")
    print(f"  → Using direct text extraction")
    
    doc = fitz.open(pdf_path)
    text = ""
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        page_text = page.get_text()
        
        if page_text.strip():  # Only add non-empty pages
            text += f"\n\n--- Lecture Page {page_num + 1} ---\n\n"
            text += page_text
    
    doc.close()
    return text

# Add your text-based PDFs here (created from Word, LaTeX, Google Docs, etc.)
pdf_paths = [
    "./assets/metrics3.pdf", 
    "./assets/Lecture#7.pdf",
    "./assets/Sample.pdf",
    "./assets/GreedyAlgorithms.pdf"
]

all_extracted_text = ""

print("=== Processing Text-based PDFs ===")
for pdf_path in pdf_paths:
    if os.path.exists(pdf_path):
        extracted_text = extract_text_from_pdf(pdf_path)
        all_extracted_text += f"\n\n=== SOURCE: {os.path.basename(pdf_path)} ===\n\n" + extracted_text
        print(f"Extracted {len(extracted_text)} characters from {os.path.basename(pdf_path)}")
    else:
        print(f"PDF file not found: {pdf_path}")

if all_extracted_text:
    print(f"\n=== EXTRACTION SUMMARY ===")
    print(f"Total extracted content: {len(all_extracted_text)} characters")
    print(f"PDFs processed: {len([p for p in pdf_paths if os.path.exists(p)])}")
    print(f"First 500 characters:\n{all_extracted_text[:500]}...")
else:
    print("\nNo PDF files were processed.")
    print("Please add your text-based PDFs to the pdf_paths list")