# Demistifying RAG
## And how I prove that it is way more dificult writing Demistifying than RAG

In [None]:
import pandas as pd
import PyPDF2
import json

In [None]:
pdf_path = "./resources/DNDPlayersHandbook_Races.pdf"
output_folder = "./output/"
lines_per_chunk = 10

## Reading and extracting test
For this, you can really use any library/manual thing you want. It could also be the case that you have raw text/jsons/databases...So, as long as at the end there is raw data you can work with, you are good :)

In [None]:
# Function to read PDF and extract text
def pdf_to_text(pdf_path, start_page=0):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        # Loop from start_page to the end
        for page in reader.pages[start_page:]:
            text += page.extract_text() + "\n"
    return text

Depending on your extraction method you will get different results. A fancier program/library could give better results, which would make using the raw data easier. The next frame gives an example on using good'old PYPDF2 reader which, as you will see, gives AMAZING results for being a lightweight option.

In [None]:
raw_text = pdf_to_text(pdf_path=pdf_path)
print("Raw text extracted from PDF.\n{}\n".format(raw_text[:500]))  # Print first 500 characters
# This is very cool because, as you can see, the different columns are respected. This text only contains the first column and nothing from "Racial trates" appears

As you can see, we have data...but it is raw. We now need to divide it, but how do we do so?
### Things to consider:
- What is a line?  
- How do we define where it starts and ends?  
- What happens with sentences that -  
finish in different lines?
- Or sentences that start in one

page and then finish in another one?
- What punctuation do we use?  
- _Why is a sentence?_  

In [None]:
# Function to split text into chunks of n lines using line breaks
def split_text_by_lines(text, lines_per_chunk=5):
    lines = text.splitlines()
    chunks = []
    for i in range(0, len(lines), lines_per_chunk):
        chunk = "\n".join(lines[i:i+lines_per_chunk])
        chunks.append({"chunk_id": i // lines_per_chunk + 1, "text": chunk})
    return chunks

In [None]:
chunks_by_line_jump = split_text_by_lines(raw_text, lines_per_chunk=1) # Using 1 line per chunk for clarity
df_line_jump = pd.DataFrame(chunks_by_line_jump)
df_line_jump.to_csv(output_folder + "lines_chunk.csv", index=False)
print(df_line_jump.head(10))  # Display the first 10 chunks

In [None]:
# Function to split text into chunks of n sentences (using periods), handling line breaks
def split_text_by_sentences(text, sentences_per_chunk=5):
    import re
    # Replace line breaks with spaces to avoid breaking sentences
    clean_text = re.sub(r'\s*\n\s*', ' ', text)
    # Split by period, question mark, or exclamation mark followed by space or end of string
    sentences = re.split(r'(?<=[.!?])\s+', clean_text)
    # Remove empty sentences
    sentences = [s.strip() for s in sentences if s.strip()]
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i:i+sentences_per_chunk])
        chunks.append({"chunk_id": i // sentences_per_chunk + 1, "text": chunk})
    return chunks

In [None]:
chunks_by_sentence = split_text_by_sentences(text=raw_text, sentences_per_chunk=1) # Using 1 sentence per chunk for clarity
df_sentences = pd.DataFrame(chunks_by_sentence)

# Show all rows and columns, and prevent text truncation
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

print(df_sentences.head(10))  # Display the first 10 chunks by sentences
df_sentences.to_csv(output_folder + "sentences_chunks.csv", index=False)

# Return settings to default
pd.reset_option('display.max_rows')
pd.reset_option('display.max_colwidth')

It is still not perfect. For example, if you mention "A. Bonavides," it might be incorrectly split into two sentences. Or, you may want to save more information, such as grouping text by chapter (e.g., "Ability Score Increase"), which requires more advanced, semantic division of the PDF.  
But to get started, this approach is more than enough!

In [None]:
import re

def split_text_by_headings(text):
    # Example: Heading is a line with all uppercase letters and at least 3 characters
    pattern = r'(?m)^(?P<heading>[A-Z][A-Z\s]{2,})$'
    splits = [m.start() for m in re.finditer(pattern, text)]
    splits.append(len(text))
    chunks = []
    for i in range(len(splits)-1):
        chunk_text = text[splits[i]:splits[i+1]].strip()
        if chunk_text:
            chunks.append({"chunk_id": i+1, "text": chunk_text})
    return chunks

In [None]:
print(raw_text[:500])  # Print first 500 characters for context
chunks_by_heading = split_text_by_headings(text=raw_text)
df_headings = pd.DataFrame(chunks_by_heading)
print(df_headings.head(10))  # Display the first 10 chunks by headings

## Splitting by chapters!

In [None]:
def split_text_by_chapters(text):
    # Regex to match lines like "Chapter 1: Title" or "Chapter 2 Title"
    pattern = r'(?im)^(Chapter\s+(\d+)[^\n]*)$'
    matches = list(re.finditer(pattern, text))
    chapters = []
    for i, match in enumerate(matches):
        chapter_line = match.group(1).strip()
        chapter_num = match.group(2)
        start = match.end()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        content = text[start:end].strip()
        chapters.append({
            "chapter_number": chapter_num,
            "chapter_title": chapter_line,
            "content": content
        })
    return chapters

In [None]:
def split_text_by_chapters(text):
    import re
    # Regex: "chapter" (with optional spaces), number, optional spaces, colon (with optional spaces), then title
    # Allow optional non-word chars or digits before "chapter" (to handle cases like "20 10 1Ch apter  5: E q u ipm en t")
    pattern = r'(?im)^.*?(?:[Cc]\s*[Hh]\s*[Aa]\s*[Pp]\s*[Tt]\s*[Ee]\s*[Rr])\s*(\d+)\s*:\s*[^\n]*$'
    matches = list(re.finditer(pattern, text, re.MULTILINE))
    chapters = []
    expected_chapter = 1
    last_chapter = None

    # Handle introduction (everything before first chapter)
    if matches and matches[0].start() > 0:
        intro_content = text[:matches[0].start()].strip()
        if intro_content:
            chapters.append({
                "chapter_number": "introduction",
                "chapter_title": "Introduction",
                "content": intro_content
            })

    for i, match in enumerate(matches):
        # Find the full matched line for the chapter title
        line_start = text.rfind('\n', 0, match.start()) + 1
        line_end = text.find('\n', match.start())
        if line_end == -1:
            line_end = len(text)
        chapter_line = text[line_start:line_end].strip()
        chapter_num = int(match.group(1))
        # Check for sequential chapter numbers
        if chapter_num != expected_chapter:
            raise ValueError(
                f"Expected Chapter {expected_chapter} after Chapter {last_chapter}, but found Chapter {chapter_num}"
            )
        last_chapter = chapter_num
        expected_chapter += 1
        start = match.end()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        content = text[start:end].strip()
        chapters.append({
            "chapter_number": str(chapter_num),
            "chapter_title": chapter_line,
            "content": content
        })
    return chapters
# filepath: /Users/annie/dev/roll_20/roll_20.ipynb

In [None]:
import re

pattern = r'(?im)^((?:[Cc]\s*)?(?:[Hh]\s*)?(?:[Aa]\s*)?(?:[Pp]\s*)?(?:[Tt]\s*)?(?:[Ee]\s*)?(?:[Rr]\s*)\s*(\d+)(?:\s|:)[^\n]*)$'
test_text = """
355,000 20 +6

Ch apter  2: R aces
A  VISIT TO ONE OF TH
"""

matches = list(re.finditer(pattern, test_text, re.MULTILINE))
for m in matches:
    print("MATCH:", m.group(1))

In [None]:
all_book_pdf_path = "./resources/DNDPlayersHandbook.pdf"
all_book_raw_text = pdf_to_text(pdf_path=all_book_pdf_path, start_page=2)

In [None]:
# save the raw text to a file for reference
with open(output_folder + "all_book_raw_text.txt", "w") as f:
    f.write(all_book_raw_text)

In [None]:
chapters = split_text_by_chapters(all_book_raw_text)
df_chapters = pd.DataFrame(chapters)
df_chapters.to_csv(output_folder + "chapters.csv", index=False)
print(df_chapters.head())

In [None]:
# Build the dictionary: chapter number as key, value is dict with title and content
chapters_dict = {
    chapter["chapter_number"]: {
        "title": chapter["chapter_title"],
        "content": chapter["content"]
    }
    for chapter in chapters
}

# Save to JSON
with open(output_folder + "chapters.json", "w", encoding="utf-8") as f:
    json.dump(chapters_dict, f, ensure_ascii=False, indent=2)

In [None]:
# Uncomment to print the first two chapters for verification
# print(json.dumps({k: chapters_dict[k] for k in list(chapters_dict)[:2]}, indent=2))

## The Two Main Parts of RAG

When working with Retrieval-Augmented Generation (RAG), there are two main components to consider:

1. **Structuring Your Information:**  
    The first step is to ensure your data is organized in a way that makes it easy to retrieve and use. This involves cleaning, chunking, and formatting your information so that it can be efficiently searched and referenced.

2. **Choosing What to Send:**  
    Once your data is well-structured, the next challenge is deciding which pieces of information to send to your model or downstream process. This selection step is crucial for maximizing relevance and performance.

---

In the next section, I'll focus on strategies for choosing what information to send.