## Create and run a local RAG pipeline

We will use Google Collab to run this pipeline as they have dedicated GPUs for processing the model.

RAG stands for retrival augmented generation

RAG can help improve information processed through getting trained on specific models

This specific RAG will be parsing the 2008 C Programming Textbook Written by K.N. King 

Steps:

1. Open the PDF
2. Format the text of the PDF to be ready for embedding the model
3. Embed all the chunks of text in the textbook and turn them into numerical representations (embedding)
4. Build a retrival system that uses a vector search to find a relevant chunk of text based on a query
5. Create a prompt that incorperates the retrieve pieces of text
6. Generate an answer to a query based on the passages of text from the embedding with an LLM 



In [None]:
import os
import requests

# get the pdf from the path 
pdf_path = "ctextbook.pdf"

# download if not existing
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, attempting download...")

    #URL of PDF
    URL = "https://dn790000.ca.archive.org/0/items/c-programming-a-modern-approach-2nd-ed-c-89-c-99-king-by/C%20Programming%20-%20A%20Modern%20Approach%20-%202nd_Ed%28C89%2C%20c99%29%20-%20King%20by%20_text.pdf"
    # Download the file
    response = requests.get(URL)
    
    # Check if download was successful
    if response.status_code == 200:
        # Write content to file
        with open(pdf_path, "wb") as f:
            f.write(response.content)
        print(f"[INFO] Successfully downloaded {pdf_path}")
    else:
        print(f"[ERROR] Failed to download file. Status code: {response.status_code}")
else: 
    print(f"File {pdf_path} exists already.")

[INFO] File doesn't exist, attempting download...
[INFO] Successfully downloaded ctextbook.pdf


In [None]:
# Import the PDF 
import fitz 
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """
    Format text from PDF for processing while preserving important structure.
    Cleans text but maintains chapter titles and important formatting.
    """
    # Remove excessive whitespace and normalize line breaks
    cleaned_text = text.replace("\n", " ").strip()
    
    # Remove multiple spaces
    cleaned_text = " ".join(cleaned_text.split())
    
    # Preserve chapter markers and section headers
    # Look for patterns like "Chapter X" or "Section X.Y"
    import re
    
    # Add line breaks before chapter/section headers for better parsing
    cleaned_text = re.sub(r'(Chapter\s+\d+)', r'\n\1', cleaned_text)
    cleaned_text = re.sub(r'(Section\s+\d+\.?\d*)', r'\n\1', cleaned_text)
    
    # Clean up any double line breaks
    cleaned_text = re.sub(r'\n\s*\n', '\n', cleaned_text)
    
    return cleaned_text.strip()

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append()




