# Level 1: RAG System Pseudocode

This notebook presents the core functionality of a Retrieval-Augmented Generation (RAG) system for PDF documents in simple pseudocode. The purpose is to help beginners understand the logical flow without worrying about syntax details.

---

## What is RAG (Retrieval-Augmented Generation)?

RAG is an AI approach that combines the power of large language models with the ability to retrieve and use specific information from a knowledge base. In this case, our knowledge base consists of PDF documents.

## Core Components and Flow

## 1. PDF Processing

The first step is extracting and processing text from PDF documents.

```
FUNCTION ProcessPDF(pdf_path)
    // Extract text from PDF while keeping track of page numbers
    text_with_pages = []
    
    FOR EACH page IN pdf_document
        page_text = Extract text from page
        page_number = current page number
        
        IF page_text is not empty
            Add (page_text, page_number) to text_with_pages
        END IF
    END FOR
    
    // Split text into manageable chunks
    chunks = []
    
    FOR EACH (text, page_number) IN text_with_pages
        text_chunks = Split text into smaller pieces
        
        FOR EACH chunk IN text_chunks
            Create dictionary with:
                "content" = chunk
                "metadata" = {"page": page_number}
            
            Add dictionary to chunks
        END FOR
    END FOR
    
    RETURN chunks
END FUNCTION
```

This process transforms pages of text into smaller, manageable chunks while preserving metadata about which page each chunk came from.

## 2. Embedding Generation

Next, we need to create vector representations (embeddings) for each chunk of text.

```
FUNCTION GenerateEmbeddings(chunks)
    // Extract just the text content from each chunk
    texts = []
    FOR EACH chunk IN chunks
        Add chunk["content"] to texts
    END FOR
    
    // Generate embeddings (vector representations)
    embeddings = []
    
    IF using_local_model
        embeddings = Generate embeddings using local model
    ELSE
        TRY
            embeddings = Generate embeddings using OpenAI API
        CATCH API error
            embeddings = Fall back to local embedding model
        END TRY
    END IF
    
    // Add embeddings to the original chunks
    FOR i = 0 TO length(chunks) - 1
        chunks[i]["embedding"] = embeddings[i]
    END FOR
    
    RETURN chunks
END FUNCTION
```

Embeddings convert text into numerical vectors that capture semantic meaning, allowing us to find similar texts through vector operations.

## 3. Vector Storage

Once we have chunks with embeddings, we store them in a vector database for efficient retrieval.

```
FUNCTION StoreChunks(embedded_chunks, pdf_path, project_id)
    // Create a unique collection name based on the project ID
    collection_name = "project_" + project_id
    
    // Get or create the collection in the vector database
    IF collection exists in database
        collection = Get existing collection
    ELSE
        collection = Create new collection
    END IF
    
    // Prepare data for insertion
    ids = []
    documents = []
    metadatas = []
    embeddings = []
    
    FOR i = 0 TO length(embedded_chunks) - 1
        chunk = embedded_chunks[i]
        
        Generate unique ID based on project_id, pdf_path and i
        Add ID to ids
        
        Add chunk["content"] to documents
        
        metadata = chunk["metadata"]
        metadata["pdf_id"] = pdf_path  // Store source PDF info
        Add metadata to metadatas
        
        Add chunk["embedding"] to embeddings
    END FOR
    
    // Add all data to the collection
    collection.add(ids, documents, metadatas, embeddings)
END FUNCTION
```

This allows us to organize and search through document chunks efficiently.

## 4. Query and Retrieval

When a question is asked, we find the most relevant chunks of text that might contain the answer.

```
FUNCTION RetrieveRelevantChunks(question, project_id)
    // Generate embedding for the question
    question_embedding = Generate embedding for question
    
    // Get the collection for this project
    collection_name = "project_" + project_id
    collection = Get collection from database
    
    // Search for similar chunks
    results = collection.query(
        query_embeddings = [question_embedding],
        n_results = number of results to return
    )
    
    // Format results into a usable structure
    chunks = []
    FOR i = 0 TO length(results["ids"][0]) - 1
        chunk = {
            "content": results["documents"][0][i],
            "metadata": results["metadatas"][0][i]
        }
        Add chunk to chunks
    END FOR
    
    RETURN chunks
END FUNCTION
```

This semantic search finds text chunks that are conceptually related to the question, not just keyword matches.

## 5. Answer Generation

Finally, we use the retrieved chunks and the original question to generate a helpful answer.

```
FUNCTION GenerateAnswer(question, context_chunks)
    // Build context string with citation information
    context_texts = []
    
    FOR EACH chunk IN context_chunks
        content = chunk["content"]
        pdf_id = chunk["metadata"]["pdf_id"]
        page = chunk["metadata"]["page"]
        
        formatted_chunk = "[Document: " + pdf_id + ", Page " + page + "]: " + content
        Add formatted_chunk to context_texts
    END FOR
    
    context = Join context_texts with newlines
    
    // Build prompt for the language model
    prompt = "Answer the following question based ONLY on the provided context:
    
    CONTEXT:
    " + context + "
    
    QUESTION: " + question + "
    
    INSTRUCTIONS:
    1. Answer the question using ONLY information from the provided context.
    2. If the context doesn't contain the information needed, respond with 'I cannot answer this question based on the provided documents.'
    3. Cite the specific documents and page numbers (e.g., [Document: doc.pdf, Page X]).
    4. Be concise and accurate.
    
    ANSWER:"
    
    // Call language model API to generate answer
    answer = Send prompt to language model and get response
    
    // Extract citations from the answer
    citations = Extract all citation references from answer using pattern matching
    
    RETURN (answer, citations)
END FUNCTION
```

This ensures answers are grounded in the actual document content with proper citations.

## 6. Complete RAG Pipeline

This brings all the pieces together into a complete system.

```
FUNCTION RAGPipeline(pdf_path, question, project_id)
    // First, process and index the PDF if not already done
    IF pdf is not already indexed in project
        chunks = ProcessPDF(pdf_path)
        embedded_chunks = GenerateEmbeddings(chunks)
        StoreChunks(embedded_chunks, pdf_path, project_id)
    END IF
    
    // Now answer the question
    relevant_chunks = RetrieveRelevantChunks(question, project_id)
    answer, citations = GenerateAnswer(question, relevant_chunks)
    
    // Return the answer with its citations
    RETURN answer, citations
END FUNCTION
```

## Example Scenario

Let's walk through a simple example:

1. **Input**: 
   - PDF: "Solar_System.pdf"
   - Question: "How long does it take Earth to orbit the sun?"
   - Project ID: "astronomy"

2. **Processing**:
   - PDF is processed into chunks
   - Chunk #37 contains: "The Earth orbits the sun once every 365 days, which we call a year."
   - Each chunk gets an embedding vector
   - All chunks are stored in the "astronomy" project collection

3. **Query**:
   - Question embedding is compared to all chunk embeddings
   - Chunk #37 is found to be highly relevant to the question
   - The chunk is retrieved with its metadata (page 4)

4. **Answer Generation**:
   - LLM is given the question and relevant chunk
   - Answer: "Earth takes 365 days to complete one orbit around the sun. [Document: Solar_System.pdf, Page 4]"

This demonstrates how the system can find and use specific information from documents to answer questions with proper citation.