# Software Engineering Study Assistant - RAG Pipeline

This notebook implements a Retrieval-Augmented Generation (RAG) chatbot designed to help software engineering students with:
- **Understanding complex topics** from lecture notes and textbooks
- **Solving previous year exam questions** with detailed explanations
- **Getting contextual answers** from course materials and PDFs
- **Study assistance** with proper references to source materials

**Technology Stack:**
- **PyMuPDF** for PDF lecture notes extraction
- **LangChain's RecursiveCharacterTextSplitter** for intelligent text chunking
- **Sentence Transformers** (all-MiniLM-L6-v2) for semantic embeddings
- **ChromaDB** for fast similarity search across study materials
- **LangChain's retriever** for relevant content retrieval
- **Gemini Pro** for generating comprehensive answers with context

## Study Assistant Pipeline Flow

```
Lecture Notes PDFs → PyMuPDF → Text Extraction → RecursiveCharacterTextSplitter → Knowledge Chunks
                                                           ↓
                                              Sentence Transformers → Semantic Embeddings → ChromaDB Knowledge Base
                                                           ↓
Student Question/Problem → Query Embedding → LangChain Retriever → Relevant Study Materials
                                                           ↓
                        Gemini Pro ← Context + Question → Detailed Answer with References
```

**Use Cases:**
- "Explain object-oriented programming concepts"
- "How do I solve this data structures problem?"
- "What are the key points about software testing methodologies?"
- "Help me understand this previous year question on algorithms"

## Installation

Install all required packages using the requirements.txt file:

In [1]:
# Install required packages for RAG pipeline using requirements.txt
!pip install -r requirements.txt



## Import Libraries

In [2]:
import fitz  # PyMuPDF
import os
import io
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.schema import Document
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import uuid
from typing import List

## Gemini API Key Setup

Get your free Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)

In [3]:
# Set your Gemini API key
GEMINI_API_KEY = "AIzaSyCATHVWy8gTDiLCZCMcJxmqDw-u33X9cFQ"  # Replace with your actual API key

# Verify API key is set
if GEMINI_API_KEY == "your-gemini-api-key-here":
    print("Please replace 'your-gemini-api-key-here' with your actual Gemini API key")
    print("Get your free API key from: https://makersuite.google.com/app/apikey")
else:
    print("Gemini API key configured")
    print(f"API key starts with: {GEMINI_API_KEY[:8]}...")

# Set environment variable for Google Generative AI
os.environ["GOOGLE_API_KEY"] = GEMINI_API_KEY

Gemini API key configured
API key starts with: AIzaSyCA...


## PDF Text Extraction with PyMuPDF

This section handles text-based PDFs using direct text extraction:

- **Text-based PDFs**: Direct text extraction using PyMuPDF for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)

**Supported PDF Types:**
- Documents created from Word processors
- LaTeX-generated PDFs  
- Google Docs exports
- Any PDF with embedded text data

The pipeline uses PyMuPDF for fast and accurate text extraction from digital documents.

In [4]:
import fitz  # PyMuPDF
import os

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extract text from text-based PDFs using PyMuPDF
    Use this for PDFs created from digital documents (Word, LaTeX, Google Docs, etc.)
    """
    if not os.path.exists(pdf_path):
        print(f"Warning: PDF file not found: {pdf_path}")
        return ""
    
    print(f"Processing: {os.path.basename(pdf_path)}")
    print(f"  → Using direct text extraction")
    
    doc = fitz.open(pdf_path)
    text = ""
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        page_text = page.get_text()
        
        if page_text.strip():  # Only add non-empty pages
            text += f"\n\n--- Lecture Page {page_num + 1} ---\n\n"
            text += page_text
    
    doc.close()
    return text

# Add your text-based PDFs here (created from Word, LaTeX, Google Docs, etc.)
pdf_paths = [
    "./assets/metrics3.pdf", 
    "./assets/Lecture#7.pdf",
    "./assets/Sample.pdf",
    "./assets/GreedyAlgorithms.pdf"
]

extracted_texts = {}

print("=== Processing Text-based PDFs ===")
for pdf_path in pdf_paths:
    if os.path.exists(pdf_path):
        text = extract_text_from_pdf(pdf_path)
        extracted_texts[os.path.basename(pdf_path)] = text
        print(f"Extracted {len(text)} characters from {os.path.basename(pdf_path)}")
    else:
        print(f"PDF file not found: {pdf_path}")

"""if all_extracted_text:
    print(f"\n=== EXTRACTION SUMMARY ===")
    print(f"Total extracted content: {len(all_extracted_text)} characters")
    print(f"PDFs processed: {len([p for p in pdf_paths if os.path.exists(p)])}")
    print(f"First 500 characters:\n{all_extracted_text[:500]}...")
else:
    print("\nNo PDF files were processed.")
    print("Please add your text-based PDFs to the pdf_paths list")"""

=== Processing Text-based PDFs ===
Processing: metrics3.pdf
  → Using direct text extraction
Extracted 17980 characters from metrics3.pdf
Processing: Lecture#7.pdf
  → Using direct text extraction
Extracted 19743 characters from Lecture#7.pdf
Processing: Sample.pdf
  → Using direct text extraction
Extracted 20737 characters from Sample.pdf
Processing: GreedyAlgorithms.pdf
  → Using direct text extraction
Extracted 19614 characters from GreedyAlgorithms.pdf


'if all_extracted_text:\n    print(f"\n=== EXTRACTION SUMMARY ===")\n    print(f"Total extracted content: {len(all_extracted_text)} characters")\n    print(f"PDFs processed: {len([p for p in pdf_paths if os.path.exists(p)])}")\n    print(f"First 500 characters:\n{all_extracted_text[:500]}...")\nelse:\n    print("\nNo PDF files were processed.")\n    print("Please add your text-based PDFs to the pdf_paths list")'

### Understanding Text-based PDF Processing

**Text-based PDFs**: Created from digital documents (Word, LaTeX, Google Docs, etc.) - contain actual text data that can be directly extracted.

**Processing Features:**
- Fast direct text extraction using PyMuPDF
- Maintains original text formatting and structure
- Works with all standard PDF formats containing embedded text
- Preserves page structure with clear page separators

**Best Results With:**
- Documents created from word processors (Word, Google Docs, etc.)
- LaTeX-generated academic papers and textbooks
- Exported PDFs from presentation software
- Any PDF with selectable/copyable text

**Note**: This pipeline is optimized for text-based PDFs. If you have scanned documents (images of text), you would need OCR functionality, which can be added later if needed.

## Text Chunking with LangChain's RecursiveCharacterTextSplitter

In [5]:
def create_overlapping_chunks(text: str, chunk_size: int = 1000, chunk_overlap: int = 200) -> List[Document]:
    """
    Split lecture notes into overlapping chunks for better retrieval
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    
    chunks = text_splitter.split_text(text)
    
    documents = []
    for i, chunk in enumerate(chunks):
        doc = Document(
            page_content=chunk,
            metadata={
                "chunk_id": i,
                "source": "Unknown",  # default, will be overwritten later
                "chunk_size": len(chunk),
                "content_type": "lecture_notes"
            }
        )
        documents.append(doc)
    
    return documents

documents = []

for source, text in extracted_texts.items():
    docs = create_overlapping_chunks(text)
    for doc in docs:
        doc.metadata["source"] = source
    documents.extend(docs)

unknown_sources = [doc for doc in documents if doc.metadata.get("source") == "Unknown"]
print(f"Chunks with Unknown source: {len(unknown_sources)}")

Chunks with Unknown source: 0


## Initialize Sentence Transformers for Embeddings

In [6]:
# Initialize Sentence Transformers embeddings
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

print("Sentence Transformers (all-MiniLM-L6-v2) model loaded")
print(f"Embedding dimension: 384")  # all-MiniLM-L6-v2 produces 384-dimensional embeddings

  embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


Sentence Transformers (all-MiniLM-L6-v2) model loaded
Embedding dimension: 384


## Create ChromaDB Vector Store

In [7]:
# Set up ChromaDB knowledge base for study materials
persist_directory = "./chroma_db"

# Create or load ChromaDB vector store for educational content
if 'documents' in locals() and documents:
    vectorstore = Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=persist_directory,
        collection_name="software_engineering_knowledge_base"
    )
    
    # Persist the database
    vectorstore.persist()
    
    print(f"Software Engineering Knowledge Base created with {len(documents)} chunks")
    print(f"Database persisted to: {persist_directory}")
    print("Your study materials are now ready for questions!")
else:
    print("No study materials available for knowledge base creation")
print("Loaded vectorstore document metadata samples:")
results = vectorstore.similarity_search("test", k=3)  # just a quick search to get some docs

for i, doc in enumerate(results, 1):
    print(f"Document {i} metadata: {doc.metadata}")
    print(f"Content preview: {doc.page_content[:150]}...\n")

Software Engineering Knowledge Base created with 124 chunks
Database persisted to: ./chroma_db
Your study materials are now ready for questions!
Loaded vectorstore document metadata samples:
Document 1 metadata: {'chunk_id': 11, 'content_type': 'lecture_notes', 'chunk_size': 900, 'source': 'Lecture#7.pdf'}
Content preview: Cost Estimation Process
Errors
Effort
Development Time
Size Table
Lines of Code
Number of Use Case
Function Point
Estimation Process
Number of Personn...

Document 2 metadata: {'chunk_size': 813, 'source': 'Lecture#7.pdf', 'chunk_id': 1, 'content_type': 'lecture_notes'}
Content preview: --- Lecture Page 4 ---

Properties of Valid Software Size Measurement
●Three properties for any valid measure of software size:
Nonnegativity: All sy...

Document 3 metadata: {'source': 'GreedyAlgorithms.pdf', 'chunk_size': 868, 'chunk_id': 9, 'content_type': 'lecture_notes'}
Content preview: --- Lecture Page 19 ---

1
2
3
4
5
6
7
8
9
10
11
0     1      2      3      4     5      6  

  vectorstore.persist()


## Create LangChain Retriever

In [8]:
# Create a retriever for study materials
if 'vectorstore' in locals():
    retriever = vectorstore.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={
            "score_threshold": 0.4, #cosine_distance = 1 — cosine_similarity
            "k": 5
        }
    )
    
    print("Study Materials Retriever created")
    print("Search type: similarity")
    print("Number of chunks retrieved per query: 5")
    
    # Test the retriever with a typical student question
    test_query = "What is Function point?"
    retrieved_docs = retriever.get_relevant_documents(test_query)
    print(f"\nTest retrieval for '{test_query}':")
    print(f"Retrieved {len(retrieved_docs)} relevant study materials")
    if retrieved_docs:
        print(f"First retrieved content preview:\n{retrieved_docs[0].page_content[:200]}...")
        print(f"Source: {retrieved_docs[0].metadata.get('source', 'Unknown')}")
else:
    print("Knowledge base not available for retriever creation")

  retrieved_docs = retriever.get_relevant_documents(test_query)


Study Materials Retriever created
Search type: similarity
Number of chunks retrieved per query: 5

Test retrieval for 'What is Function point?':
Retrieved 1 relevant study materials
First retrieved content preview:
--- Lecture Page 27 ---

Function Points Calculation
STEP 1: Measure size in terms of the amount of functionality in a system.
Function points are computed by first calculating an unadjusted function
...
Source: Lecture#7.pdf


## Initialize Gemini Pro with API Key

In [9]:
# Initialize Gemini Pro LLM with API key
try:
    llm = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        google_api_key=GEMINI_API_KEY,
        temperature=0.3,
        max_output_tokens=1024
    )
    
    print("Gemini Pro LLM initialized with API key")
    print(f"Model: gemini-1.5-flash")
    print(f"Temperature: 0.3")
    print(f"Max output tokens: 1024")
    
except Exception as e:
    print(f"Error initializing Gemini Pro: {e}")
    print("Please ensure you have:")
    print("1. Valid Gemini API key")
    print("2. Correct API key format")
    print("3. Get your key from: https://makersuite.google.com/app/apikey")

Gemini Pro LLM initialized with API key
Model: gemini-1.5-flash
Temperature: 0.3
Max output tokens: 1024


## Create RAG Chain with Custom Prompt

In [10]:
# Define a custom prompt template for educational assistance
prompt_template = PromptTemplate(
    template="""
You are an AI Study Assistant for Software Engineering students. Your role is to help students understand concepts, solve problems, and prepare for exams using their course materials.

Instructions:
- Provide clear, detailed explanations suitable for students
- Include examples when helpful for understanding
- Reference the source materials when possible
- For previous year questions, provide step-by-step solutions
- If you need to make assumptions, state them clearly
- If the context doesn't contain enough information, say so and suggest what additional materials might help

Context from Study Materials:
{context}

Student Question: {question}

Study Assistant Response:""",
    input_variables=["context", "question"]
)

# Create the Study Assistant RAG chain
if 'llm' in locals() and 'retriever' in locals():
    rag_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt_template}
    )
    
    print("Software Engineering Study Assistant created successfully")
    print("Chain type: stuff (combines all retrieved study materials)")
    print("Returns source documents: Yes")
    print("Ready to help with your studies!")
else:
    print("Cannot create Study Assistant - missing LLM or retriever")

Software Engineering Study Assistant created successfully
Chain type: stuff (combines all retrieved study materials)
Returns source documents: Yes
Ready to help with your studies!


## Test the RAG Pipeline

In [11]:
def ask_study_question(question: str):
    """
    Ask a study-related question using the RAG pipeline
    """
    if 'rag_chain' not in locals() and 'rag_chain' not in globals():
        print("Study Assistant not available")
        return
    
    try:
        # Get response from RAG chain
        result = rag_chain({"query": question})
        
        print(f"Student Question: {question}")
        print(f"\nStudy Assistant Answer:\n{result['result']}")
        
        # Show source materials referenced
        print(f"\nSource Materials Referenced ({len(result['source_documents'])})")
        print("-" * 50)
        for i, doc in enumerate(result['source_documents'], 1):
            source = doc.metadata.get('source', 'Unknown')
            chunk_id = doc.metadata.get('chunk_id', 'N/A')
            print(f"\nSource {i}: {source} (Chunk {chunk_id})")
            print(f"Content Preview: {doc.page_content[:150]}...")
            
    except Exception as e:
        print(f"Error during study query: {e}")

# Test the Study Assistant with a sample question
if 'rag_chain' in locals():
    # Test with a typical software engineering question
    ask_study_question("What are function points?")
else:
    print("Study Assistant not ready for testing")

  result = rag_chain({"query": question})


Student Question: What are function points?

Study Assistant Answer:
Function points are a unit of measurement used to estimate the size of a software system based on its functionality, rather than lines of code or other implementation-specific metrics.  This is a crucial difference because it allows for a more abstract and technology-independent assessment of project size.  As described on Lecture Page 27, the process begins by calculating an *unadjusted function point count (UFC)*.

The UFC is determined by counting instances within five key categories:

1. **External Inputs:** These are data items provided by the user that trigger specific actions within the system.  Think of them as distinct pieces of information the system receives from the user. Examples include file names entered by a user, menu selections, or parameters entered into a form.  It's important to note that it's the *distinct* data items that are counted, not individual fields within a form, for instance.  If a user

## Interactive Q&A Session

In [12]:

# Ask your own study question here
your_question = "What is the COCOMO model and how to estimate the cost of a software?"  # Modify this

if 'rag_chain' in locals():
    print(f"\nAsking: {your_question}")
    print("=" * 60)
    ask_study_question(your_question)
else:
    print("\nStudy Assistant not ready. Please ensure all previous cells ran successfully.")


Asking: What is the COCOMO model and how to estimate the cost of a software?
Student Question: What is the COCOMO model and how to estimate the cost of a software?

Study Assistant Answer:
The COCOMO (Constructive Cost Model) is a family of regression models used for estimating the effort and time required to develop a software project.  It's not a single model, but rather a set of models with increasing complexity: Basic, Intermediate, and Detailed (or Advanced).  The choice of model depends on the level of accuracy needed and the amount of information available.

Let's break down each model and how to use them for cost estimation:

**1. Basic COCOMO:**

This is the simplest model. It estimates effort based solely on the estimated size of the software (in thousands of lines of code, KLOC).  The formula is:

*Effort = a * KLOC<sup>b</sup>*

Where:

* **Effort:**  The estimated effort in person-months.
* **a and b:**  Coefficients that depend on the software development mode (Organic, 

## Software Engineering Study Assistant Summary

This notebook successfully implements a comprehensive study assistant for software engineering students with:

1. **PyMuPDF** - Extracts text from lecture notes, textbooks, and previous year papers
2. **RecursiveCharacterTextSplitter** - Creates intelligent chunks for better knowledge retrieval
3. **Sentence Transformers** (all-MiniLM-L6-v2) - Semantic understanding of technical concepts
4. **ChromaDB** - Fast search across your entire study material collection
5. **LangChain Retriever** - Finds most relevant content for your questions
6. **Gemini Pro** - Provides detailed explanations with proper context and references

**Perfect for:**
- Understanding complex software engineering concepts
- Solving previous year exam questions step-by-step
- Getting quick explanations with proper source references
- Preparing for exams with comprehensive study assistance
- Clarifying doubts from multiple lecture sources

**To get started:**
1. Get your free Gemini API key from [Google AI Studio](https://makersuite.google.com/app/apikey)
2. Replace `your-gemini-api-key-here` with your actual API key
3. Add your lecture notes PDFs to the `pdf_paths` list in the PDF extraction cell
4. Run all cells to build your knowledge base
5. Start asking questions about your course materials!

**Pro Tips:**
- Add all your course PDFs for comprehensive coverage
- Ask specific questions for better answers
- Use the source references to dive deeper into topics
- Perfect for exam preparation and assignment help