## Create and run a local RAG pipeline

We will use Google Collab to run this pipeline as they have dedicated GPUs for processing the model.

RAG stands for retrival augmented generation

RAG can help improve information processed through getting trained on specific models

This specific RAG will be parsing the 2008 C Programming Textbook Written by K.N. King 

Steps:

1. Open the PDF
2. Format the text of the PDF to be ready for embedding the model
3. Embed all the chunks of text in the textbook and turn them into numerical representations (embedding)
4. Build a retrival system that uses a vector search to find a relevant chunk of text based on a query
5. Create a prompt that incorperates the retrieve pieces of text
6. Generate an answer to a query based on the passages of text from the embedding with an LLM 



In [4]:
import os
import requests

# get the pdf from the path 
pdf_path = "ctextbook.pdf"

# download if not existing
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, attempting download...")

    #URL of PDF
    URL = "https://dn790000.ca.archive.org/0/items/c-programming-a-modern-approach-2nd-ed-c-89-c-99-king-by/C%20Programming%20-%20A%20Modern%20Approach%20-%202nd_Ed%28C89%2C%20c99%29%20-%20King%20by%20_text.pdf"
    # Download the file
    response = requests.get(URL)
    
    # Check if download was successful
    if response.status_code == 200:
        # Write content to file
        with open(pdf_path, "wb") as f:
            f.write(response.content)
        print(f"[INFO] Successfully downloaded {pdf_path}")
    else:
        print(f"[ERROR] Failed to download file. Status code: {response.status_code}")
else: 
    print(f"File {pdf_path} exists already.")

File ctextbook.pdf exists already.


In [5]:
# Import the PDF 
import fitz 
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """
    Format text from PDF for processing while preserving important structure.
    Cleans text but maintains chapter titles and important formatting.
    """
    # Remove excessive whitespace and normalize line breaks
    cleaned_text = text.replace("\n", " ").strip()
    
    # Remove multiple spaces
    cleaned_text = " ".join(cleaned_text.split())
    
    # Preserve chapter markers and section headers
    # Look for patterns like "Chapter X" or "Section X.Y"
    import re
    
    # Add line breaks before chapter/section headers for better parsing
    cleaned_text = re.sub(r'(Chapter\s+\d+)', r'\n\1', cleaned_text)
    cleaned_text = re.sub(r'(Section\s+\d+\.?\d*)', r'\n\1', cleaned_text)
    
    # Clean up any double line breaks
    cleaned_text = re.sub(r'\n\s*\n', '\n', cleaned_text)
    
    return cleaned_text.strip()

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 25,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text)/4,
                                "text": text})
    return pages_and_texts
    
pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -25,
  'page_char_count': 65,
  'page_word_count': 11,
  'page_sentence_count_raw': 1,
  'page_token_count': 16.25,
  'text': 'K.N.KING Covers both C89 and C99 A Modern Approach SECOND EDITION'},
 {'page_number': -24,
  'page_char_count': 2019,
  'page_word_count': 317,
  'page_sentence_count_raw': 13,
  'page_token_count': 504.75,
  'text': 'K.N.KING The first, edition of C Pivgmmimm/: A Modem Approach was a hit with students and faculty alike because of its clarify arid comprehensiveness as well as its trademark Q&A sections. King’s spiral approach made the first edition accessible to n broad range of readers, from beginners to more advanced st udents. The first edition was used at over 225 colleges, making it. one of the leading C textbooks of the last ten years. FEATURES OF THE SECOND EDITION Complete coverage of both the CS9 standard and the C99 standard, with all C99 changes clearly marked Includes a quick reference to all C89 and G99 library functions • Expanded

In [6]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-25,65,11,1,16.25,K.N.KING Covers both C89 and C99 A Modern Appr...
1,-24,2019,317,13,504.75,"K.N.KING The first, edition of C Pivgmmimm/: A..."
2,-23,1777,311,13,444.25,"PREFACE In computing, turning the obvious into..."
3,-22,3041,492,35,760.25,Includes a quick reference to all C89 and C99 ...
4,-21,2860,475,36,715.0,Preface xxiii I’ve also taken the opportunity ...


In [7]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,830.0,830.0,830.0,830.0,830.0
mean,389.5,1948.82,334.8,17.04,487.2
std,239.74,559.3,98.17,8.48,139.82
min,-25.0,0.0,1.0,1.0,0.0
25%,182.25,1638.5,279.25,12.0,409.62
50%,389.5,1990.0,341.5,17.0,497.5
75%,596.75,2319.25,400.75,21.0,579.81
max,804.0,3434.0,637.0,55.0,858.5


Keep in mind we cannot pass infinite tokens through the textbook to any LLM as 

Embedding models don't deal with infinite tokens, LLMs dont have infinite tokens, and is computational wasteful as an embedding model are trained to embed sequences to 384 tokens into numerical space. That is the model we are using.

In [8]:
from spacy.lang.en import English 

nlp = English()

# Add a sentencizer pipeline 

nlp.add_pipe("sentencizer")


<spacy.pipeline.sentencizer.Sentencizer at 0x17b22f690>

In [9]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (the default type is spacy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])


  0%|          | 0/830 [00:00<?, ?it/s]

In [10]:
import random

random.sample(pages_and_texts, k=1)

[{'page_number': 696,
  'page_char_count': 1887,
  'page_word_count': 338,
  'page_sentence_count_raw': 6,
  'page_token_count': 471.75,
  'text': '26.3 The <time.h> Header: Date and Time 697 Table 26.2 Conversion Specifiers for the strf time Function Conversion Replacement %a Abbreviated weekday name (e.g.. Sun) %A Full weekday name (e.g.. Sunday) %b Abbreviated month name (e.g.. Jun) %B Full month name (e.g., June) %c Complete day and time (e.g.. Sun Jun 3 17:48:34 2007) %Cf Year divided by 100 and truncated to an integer (00-99) %d Day of month (01-31) %D* Equivalent to %m/%d/%y %e\' Day of month (1—31); a single digit is preceded by a space %F7 Equivalent to %Y-%m-%d %g‘ Last two digits of ISO 8601 week-based year (00-99) %G’ ISO 8601 week-based year %h: Equivalent to %b %H Hour on 24-hour clock (00-23) %I Hour on 12-hour clock (01-12) %j Day of year (001-366) %m Month (01-12) %M Minute (00-59) %n; New-line character %p AM/PM designator (AM or PM) %r\' 12-hour clock time (e.g., 05 

In [11]:
df = pd.DataFrame(pages_and_texts)

df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,389.5,1948.82,334.8,17.04,487.2,17.15
std,239.74,559.3,98.17,8.48,139.82,8.13
min,-25.0,0.0,1.0,1.0,0.0,0.0
25%,182.25,1638.5,279.25,12.0,409.62,11.0
50%,389.5,1990.0,341.5,17.0,497.5,17.0
75%,596.75,2319.25,400.75,21.0,579.81,22.0
max,804.0,3434.0,637.0,55.0,858.5,50.0


## Next chunking our sentences 

We will split the groups into 10 sentences. There are frameworks such as langchain that will work with this but we will do it in python.

In [12]:
# Define split size to turn groups of sentences into chunks

num_sentence_chunk_size = 10

# create a function split lists of texts recursively into chunk size
# e.g. 20 _> [10, 10] or [25] -> [10, 10, 5]

def split_list(input_list: list,
               slize_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slize_size] for i in range(0, len(input_list), slize_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [13]:
#loop through the pages and texts and split them into chunks

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                          slize_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/830 [00:00<?, ?it/s]

In [15]:
random.sample(pages_and_texts, k=1)

[{'page_number': 700,
  'page_char_count': 3002,
  'page_word_count': 513,
  'page_sentence_count_raw': 23,
  'page_token_count': 750.5,
  'text': 'Q&A 701 A: Some C libraries supply functions with names like itoa that convert numbers to strings. Using these functions isn’t a great idea, though: they aren\'t part of the C standard and won\'t be portable. The best way to perform this kind of conversion is eprintf function >22.8 to call a function such as sprint f that writes formatted output into a siring: char str[20]; int i ; sprintf(str, "%d", i); /* writes i into the string str */ Not only is sprintf portable, but it also provides a great deal of control over the appearance of the number. *Q: The description of the str tod function says that C99 allows the string argu¬ ment to contain a hexadecimal floating-point number, infinity, or NaN. What is the format of these numbers? [p. 684] A: A hexadecimal floating-point number begins with Ox or OX. followed by one or more hexadecimal dig

In [16]:
df = pd.DataFrame(pages_and_texts)

df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,830.0,830.0,830.0,830.0,830.0,830.0,830.0
mean,389.5,1948.82,334.8,17.04,487.2,17.15,2.15
std,239.74,559.3,98.17,8.48,139.82,8.13,0.85
min,-25.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,182.25,1638.5,279.25,12.0,409.62,11.0,2.0
50%,389.5,1990.0,341.5,17.0,497.5,17.0,2.0
75%,596.75,2319.25,400.75,21.0,579.81,22.0,3.0
max,804.0,3434.0,637.0,55.0,858.5,50.0,5.0


## split each chunk into its own item

Make an embed chunk of sentences into its own numberical representation to give granularity and making it so we can look at specific text samples used in the model.

In [21]:
import re

# Split each chunk into its own item

pages_and_chunks = []

for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into paragraph-like structure, aka join the list of sentence into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace(" ", " ").strip()

        # Add a space after a period if the next letter is capitalized
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        # Add the joined sentence chunk to the dictionary
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ") if word.strip()])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk)

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/830 [00:00<?, ?it/s]

1785

In [23]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 44,
  'sentence_chunk': '3.2 The scanf Function 45 Ordinary Characters in Format Strings The concept of pattern-matching can be taken one step further by writing format strings that contain ordinary characters in addition to conversion specifications. The action that scanf takes when it processes an ordinary character in a format string depends on whether or not it’s a white-space character.■ White-space characters. When it encounters one or more consecutive white- space characters in a format string, scanf repeatedly reads white-space char¬ acters from the input until it reaches a non-white-space character (which is "put back"). The number of white-space characters in the format string is irrelevant; one white-space character in the format string will match any num¬ ber of white-space characters in the input. (Incidentally, putting a white-space character in a format string doesn’t force the input to contain white-space characters. A white-space character in a format 