# Software Engineering Study Assistant - RAG Pipeline

This notebook implements a Retrieval-Augmented Generation (RAG) chatbot designed to help software engineering students with:
- **Understanding complex topics** from lecture notes and textbooks
- **Solving previous year exam questions** with detailed explanations
- **Getting contextual answers** from course materials and PDFs
- **Study assistance** with proper references to source materials

**Technology Stack:**
- **PyMuPDF** for PDF lecture notes extraction
- **LangChain's RecursiveCharacterTextSplitter** for intelligent text chunking
- **Sentence Transformers** (all-MiniLM-L6-v2) for semantic embeddings
- **ChromaDB** for fast similarity search across study materials
- **LangChain's retriever** for relevant content retrieval
- **Gemini Pro** for generating comprehensive answers with context

## Study Assistant Pipeline Flow

```
Lecture Notes PDFs → PyMuPDF → Text Extraction → RecursiveCharacterTextSplitter → Knowledge Chunks
                                                           ↓
                                              Sentence Transformers → Semantic Embeddings → ChromaDB Knowledge Base
                                                           ↓
Student Question/Problem → Query Embedding → LangChain Retriever → Relevant Study Materials
                                                           ↓
                        Gemini Pro ← Context + Question → Detailed Answer with References
```

**Use Cases:**
- "Explain object-oriented programming concepts"
- "How do I solve this data structures problem?"
- "What are the key points about software testing methodologies?"
- "Help me understand this previous year question on algorithms"

## Installation

Install all required packages using the requirements.txt file:

In [6]:
# Install required packages for RAG pipeline using requirements.txt
!pip install -r requirements.txt




[notice] A new release of pip is available: 24.0 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Import Libraries

In [7]:
import fitz  # PyMuPDF
import os
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.schema import Document
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import uuid
from typing import List

## Gemini API Key Setup

Get your free Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)

In [8]:
# Set your Gemini API key
GEMINI_API_KEY = "AIzaSyCATHVWy8gTDiLCZCMcJxmqDw-u33X9cFQ"  # Replace with your actual API key

# Verify API key is set
if GEMINI_API_KEY == "your-gemini-api-key-here":
    print("Please replace 'your-gemini-api-key-here' with your actual Gemini API key")
    print("Get your free API key from: https://makersuite.google.com/app/apikey")
else:
    print("Gemini API key configured")
    print(f"API key starts with: {GEMINI_API_KEY[:8]}...")

# Set environment variable for Google Generative AI
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

Gemini API key configured
API key starts with: AIzaSyCA...


## PDF Text Extraction with PyMuPDF

This section handles text-based PDFs using direct text extraction:

- **Text-based PDFs**: Direct text extraction using PyMuPDF for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)

**Supported PDF Types:**
- Documents created from Word processors
- LaTeX-generated PDFs  
- Google Docs exports
- Any PDF with embedded text data

The pipeline uses PyMuPDF for fast and accurate text extraction from digital documents.

In [9]:
import fitz  # PyMuPDF
import os

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extract text from text-based PDFs using PyMuPDF
    Use this for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)
    """
    if not os.path.exists(pdf_path):
        print(f"Warning: PDF file not found: {pdf_path}")
        return ""
    
    print(f"Processing: {os.path.basename(pdf_path)}")
    print(f"  → Using direct text extraction")
    
    doc = fitz.open(pdf_path)
    text = ""
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        page_text = page.get_text()
        
        if page_text.strip():  # Only add non-empty pages
            text += f"\n\n--- Lecture Page {page_num + 1} ---\n\n"
            text += page_text
    
    doc.close()
    return text

# Add your text-based PDFs here (created from Word, LaTeX, Google Docs, etc.)
pdf_paths = [
    "./assets/metrics3.pdf", 
    "./assets/Lecture#7.pdf",
    "./assets/Sample.pdf",
    "./assets/GreedyAlgorithms.pdf"
]

all_extracted_text = ""

print("=== Processing Text-based PDFs ===")
for pdf_path in pdf_paths:
    if os.path.exists(pdf_path):
        extracted_text = extract_text_from_pdf(pdf_path)
        all_extracted_text += f"\n\n=== SOURCE: {os.path.basename(pdf_path)} ===\n\n" + extracted_text
        print(f"Extracted {len(extracted_text)} characters from {os.path.basename(pdf_path)}")
    else:
        print(f"PDF file not found: {pdf_path}")

if all_extracted_text:
    print(f"\n=== EXTRACTION SUMMARY ===")
    print(f"Total extracted content: {len(all_extracted_text)} characters")
    print(f"PDFs processed: {len([p for p in pdf_paths if os.path.exists(p)])}")
    print(f"First 500 characters:\n{all_extracted_text[:500]}...")
else:
    print("\nNo PDF files were processed.")
    print("Please add your text-based PDFs to the pdf_paths list")

=== Processing Text-based PDFs ===
Processing: metrics3.pdf
  → Using direct text extraction
Extracted 17980 characters from metrics3.pdf
Processing: Lecture#7.pdf
  → Using direct text extraction
Extracted 19743 characters from Lecture#7.pdf
Processing: Sample.pdf
  → Using direct text extraction
Extracted 20737 characters from Sample.pdf
Processing: GreedyAlgorithms.pdf
  → Using direct text extraction
Extracted 19614 characters from GreedyAlgorithms.pdf

=== EXTRACTION SUMMARY ===
Total extracted content: 78209 characters
PDFs processed: 4
First 500 characters:


=== SOURCE: metrics3.pdf ===



--- Lecture Page 1 ---

                                                           
 
 
Institute of Information Technology 
University of Dhaka 
 
 
Topic: Function Point Analysis of SPL-2 Project 
Software Metrics (SE-611) 
 
 
 
 
 
  
Submitted to 
Dr. Emon Kumar Dey  
Associate Professor  
IIT, University of Dhaka  
 
 
 
Submitted by 
Md. Shakibul Islam Shakib - BSSE 1404 
Nandan Bhowmi

### Understanding Text-based PDF Processing

**Text-based PDFs**: Created from digital documents (Word, LaTeX, Google Docs, etc.) - contain actual text data that can be directly extracted.

**Processing Features:**
- Fast direct text extraction using PyMuPDF
- Maintains original text formatting and structure
- Works with all standard PDF formats containing embedded text
- Preserves page structure with clear page separators

**Best Results With:**
- Documents created from word processors (Word, Google Docs, etc.)
- LaTeX-generated academic papers and textbooks
- Exported PDFs from presentation software
- Any PDF with selectable/copyable text

**Note**: This pipeline is optimized for text-based PDFs. If you have scanned documents (images of text), you would need OCR functionality, which can be added later if needed.

## Text Chunking with LangChain's RecursiveCharacterTextSplitter

In [10]:
def create_overlapping_chunks(text: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> List[Document]:
    """
    Split lecture notes into overlapping chunks for better retrieval
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    
    # Split text into chunks
    chunks = text_splitter.split_text(text)
    
    # Convert to LangChain Documents with educational metadata
    documents = []
    for i, chunk in enumerate(chunks):
        # Extract source from chunk if available
        source = "Unknown"
        if "=== SOURCE:" in chunk:
            lines = chunk.split('\n')
            for line in lines:
                if "=== SOURCE:" in line:
                    source = line.replace("=== SOURCE:", "").replace("===", "").strip()
                    break
        
        doc = Document(
            page_content=chunk,
            metadata={
                "chunk_id": i,
                "source": source,
                "chunk_size": len(chunk),
                "content_type": "lecture_notes"
            }
        )
        documents.append(doc)
    
    return documents

# Create chunks from all extracted text
if 'all_extracted_text' in locals() and all_extracted_text:
    documents = create_overlapping_chunks(all_extracted_text)
    print(f"Created {len(documents)} knowledge chunks from study materials")
    print(f"Average chunk size: {sum(len(doc.page_content) for doc in documents) // len(documents)} characters")
    
    # Show sources processed
    sources = set(doc.metadata.get('source', 'Unknown') for doc in documents)
    print(f"Sources processed: {', '.join(sources)}")
    
    print(f"\nFirst chunk preview:\n{documents[0].page_content[:300]}...")
else:
    print("No extracted text available for chunking. Please ensure PDFs are properly loaded.")

Created 123 knowledge chunks from study materials
Average chunk size: 686 characters
Sources processed: Unknown, Sample.pdf, Lecture#7.pdf, metrics3.pdf, GreedyAlgorithms.pdf

First chunk preview:
=== SOURCE: metrics3.pdf ===



--- Lecture Page 1 ---

                                                           
 
 
Institute of Information Technology 
University of Dhaka 
 
 
Topic: Function Point Analysis of SPL-2 Project 
Software Metrics (SE-611) 
 
 
 
 
 
  
Submitted to 
Dr. Emon Kumar ...


## Initialize Sentence Transformers for Embeddings

In [11]:
# Initialize Sentence Transformers embeddings
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

print("Sentence Transformers (all-MiniLM-L6-v2) model loaded")
print(f"Embedding dimension: 384")  # all-MiniLM-L6-v2 produces 384-dimensional embeddings

  embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


Sentence Transformers (all-MiniLM-L6-v2) model loaded
Embedding dimension: 384


## Create ChromaDB Vector Store

In [12]:
# Set up ChromaDB knowledge base for study materials
persist_directory = "./chroma_db"

# Create or load ChromaDB vector store for educational content
if 'documents' in locals() and documents:
    vectorstore = Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=persist_directory,
        collection_name="software_engineering_knowledge_base"
    )
    
    # Persist the database
    vectorstore.persist()
    
    print(f"Software Engineering Knowledge Base created with {len(documents)} chunks")
    print(f"Database persisted to: {persist_directory}")
    print("Your study materials are now ready for questions!")
else:
    print("No study materials available for knowledge base creation")

Software Engineering Knowledge Base created with 123 chunks
Database persisted to: ./chroma_db
Your study materials are now ready for questions!


  vectorstore.persist()


## Create LangChain Retriever

In [13]:
# Create a retriever for study materials
if 'vectorstore' in locals():
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}  # Retrieve top 5 most relevant study materials
    )
    
    print("Study Materials Retriever created")
    print("Search type: similarity")
    print("Number of chunks retrieved per query: 5")
    
    # Test the retriever with a typical student question
    test_query = "What are the main concepts in object-oriented programming?"
    retrieved_docs = retriever.get_relevant_documents(test_query)
    print(f"\nTest retrieval for '{test_query}':")
    print(f"Retrieved {len(retrieved_docs)} relevant study materials")
    if retrieved_docs:
        print(f"First retrieved content preview:\n{retrieved_docs[0].page_content[:200]}...")
        print(f"Source: {retrieved_docs[0].metadata.get('source', 'Unknown')}")
else:
    print("Knowledge base not available for retriever creation")

Study Materials Retriever created
Search type: similarity
Number of chunks retrieved per query: 5

Test retrieval for 'What are the main concepts in object-oriented programming?':
Retrieved 5 relevant study materials
First retrieved content preview:
--- Lecture Page 14 ---

Design Size
• Object-oriented designs add new abstraction mechanisms: objects, classes, 
interfaces, operations, methods, associations, inheritance, etc.
• Thus, we will measu...
Source: Unknown


  retrieved_docs = retriever.get_relevant_documents(test_query)


## Initialize Gemini Pro with API Key

In [14]:
# Initialize Gemini Pro LLM with API key
try:
    llm = ChatGoogleGenerativeAI(
        model="models/gemini-2.5-pro",
        google_api_key=GEMINI_API_KEY,
        temperature=0.3,
        max_output_tokens=1024
    )
    
    print("Gemini Pro LLM initialized with API key")
    print(f"Model: gemini-pro")
    print(f"Temperature: 0.3")
    print(f"Max output tokens: 1024")
    
except Exception as e:
    print(f"Error initializing Gemini Pro: {e}")
    print("Please ensure you have:")
    print("1. Valid Gemini API key")
    print("2. Correct API key format")
    print("3. Get your key from: https://makersuite.google.com/app/apikey")

Gemini Pro LLM initialized with API key
Model: gemini-pro
Temperature: 0.3
Max output tokens: 1024


## Create RAG Chain with Custom Prompt

In [15]:
# Define a custom prompt template for educational assistance
prompt_template = PromptTemplate(
    template="""
You are an AI Study Assistant for Software Engineering students. Your role is to help students understand concepts, solve problems, and prepare for exams using their course materials.

Instructions:
- Provide clear, detailed explanations suitable for students
- Include examples when helpful for understanding
- Reference the source materials when possible
- For previous year questions, provide step-by-step solutions
- If you need to make assumptions, state them clearly
- If the context doesn't contain enough information, say so and suggest what additional materials might help

Context from Study Materials:
{context}

Student Question: {question}

Study Assistant Response:""",
    input_variables=["context", "question"]
)

# Create the Study Assistant RAG chain
if 'llm' in locals() and 'retriever' in locals():
    rag_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt_template}
    )
    
    print("Software Engineering Study Assistant created successfully")
    print("Chain type: stuff (combines all retrieved study materials)")
    print("Returns source documents: Yes")
    print("Ready to help with your studies!")
else:
    print("Cannot create Study Assistant - missing LLM or retriever")

Software Engineering Study Assistant created successfully
Chain type: stuff (combines all retrieved study materials)
Returns source documents: Yes
Ready to help with your studies!


## Test the RAG Pipeline

In [16]:
def ask_study_question(question: str):
    """
    Ask a study-related question using the RAG pipeline
    """
    if 'rag_chain' not in locals() and 'rag_chain' not in globals():
        print("Study Assistant not available")
        return
    
    try:
        # Get response from RAG chain
        result = rag_chain({"query": question})
        
        print(f"Student Question: {question}")
        print(f"\nStudy Assistant Answer:\n{result['result']}")
        
        # Show source materials referenced
        print(f"\nSource Materials Referenced ({len(result['source_documents'])})")
        print("-" * 50)
        for i, doc in enumerate(result['source_documents'], 1):
            source = doc.metadata.get('source', 'Unknown')
            chunk_id = doc.metadata.get('chunk_id', 'N/A')
            print(f"\nSource {i}: {source} (Chunk {chunk_id})")
            print(f"Content Preview: {doc.page_content[:150]}...")
            
    except Exception as e:
        print(f"Error during study query: {e}")

# Test the Study Assistant with a sample question
if 'rag_chain' in locals():
    # Test with a typical software engineering question
    ask_study_question("What are the key principles of software design?")
else:
    print("Study Assistant not ready for testing")

  result = rag_chain({"query": question})


Student Question: What are the key principles of software design?

Study Assistant Answer:


Source Materials Referenced (5)
--------------------------------------------------

Source 1: Unknown (Chunk 33)
Content Preview: --- Lecture Page 7 ---

Halstead’s Approach
• Halstead’s Software Science is a theoretical approach to measuring software
complexity and predicting at...

Source 2: Unknown (Chunk 37)
Content Preview: --- Lecture Page 14 ---

Design Size
• Object-oriented designs add new abstraction mechanisms: objects, classes, 
interfaces, operations, methods, ass...

Source 3: Unknown (Chunk 26)
Content Preview: ●​ Medium-sized projects with moderate complexity.​
 
●​ Teams consist of both experienced and less-experienced members.​
 
●​ Requirements may be par...

Source 4: Unknown (Chunk 42)
Content Preview: --- Lecture Page 25 ---

Project Size - Metrics
Availability of Size Estimation Metrics:
Development Phase
Available 
Metrics
a
Requirements Gathering...

Source 5: Unknown (

## Interactive Q&A Session

In [18]:

# Ask your own study question here
your_question = "What is the COCOMO model and how to estimate the cost of a software?"  # Modify this

if 'rag_chain' in locals():
    print(f"\nAsking: {your_question}")
    print("=" * 60)
    ask_study_question(your_question)
else:
    print("\nStudy Assistant not ready. Please ensure all previous cells ran successfully.")


Asking: What is the COCOMO model and how to estimate the cost of a software?
Student Question: What is the COCOMO model and how to estimate the cost of a software?

Study Assistant Answer:


Source Materials Referenced (5)
--------------------------------------------------

Source 1: Unknown (Chunk 53)
Content Preview: --- Lecture Page 49 ---

Intermediate COCOMO Model
• Intermediate COCOMO model is an extension of the Basic COCOMO 
model which includes a set of cost...

Source 2: Sample.pdf (Chunk 57)
Content Preview: --- Lecture Page 55 ---

Detailed/Advanced COCOMO Model
• In the Detailed COCOMO Model, the cost of each subsystem is estimated 
separately. This appr...

Source 3: Unknown (Chunk 56)
Content Preview: --- Lecture Page 54 ---

Detailed/Advanced COCOMO Model
• The model accounts for the influence of the 
individual development phase (analysis, 
design...

Source 4: Unknown (Chunk 91)
Content Preview: --- Lecture Page 17 ---

 
To calculate the efficiency of the project, w

## Software Engineering Study Assistant Summary

This notebook successfully implements a comprehensive study assistant for software engineering students with:

1. **PyMuPDF** - Extracts text from lecture notes, textbooks, and previous year papers
2. **RecursiveCharacterTextSplitter** - Creates intelligent chunks for better knowledge retrieval
3. **Sentence Transformers** (all-MiniLM-L6-v2) - Semantic understanding of technical concepts
4. **ChromaDB** - Fast search across your entire study material collection
5. **LangChain Retriever** - Finds most relevant content for your questions
6. **Gemini Pro** - Provides detailed explanations with proper context and references

**Perfect for:**
- Understanding complex software engineering concepts
- Solving previous year exam questions step-by-step
- Getting quick explanations with proper source references
- Preparing for exams with comprehensive study assistance
- Clarifying doubts from multiple lecture sources

**To get started:**
1. Get your free Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Replace `your-gemini-api-key-here` with your actual API key
3. Add your lecture notes PDFs to the `pdf_paths` list in the PDF extraction cell
4. Run all cells to build your knowledge base
5. Start asking questions about your course materials!

**Pro Tips:**
- Add all your course PDFs for comprehensive coverage
- Ask specific questions for better answers
- Use the source references to dive deeper into topics
- Perfect for exam preparation and assignment help