Setup

In [1]:
%load_ext dotenv
%dotenv
import json
import os
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

Load the parsed contents of the book

I used a LLM to generate this json that describes the chapter borders of the book. This makes this ingest method exclusive to this book pdf. Same goes for the page offset. We could make a pipeline that automatically detects the book contents, generates the start and end point of each chapter and a page offset but what happens when we upload a pdf with no contents? Thats the magic of RAG - you can optimize it all you want to your specific use case but it can never be perfect or universal

In [3]:
json_file_path = "../data/processed/contents.json"

with open(json_file_path, "r") as f:
    chapters_json = json.load(f)

Set page offset since "page 1" of the book is on the 27th page of the pdf

In [4]:
PAGE_OFFSET = 26

We use the recursive character text splitter that splits entire pages into smaller chunks. It tries to keep the chunks semantic meaning by splitting first by paragraphs, then line breaks and so on until the chunk is smaller that the set chunk size

In [7]:
file_path_to_book = "../data/raw/Textbook.pdf"

In [8]:
loader = PyPDFLoader(file_path_to_book)
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)



In [9]:
all_chunks = []

In [10]:
for chapter in chapters_json:
    start_page = chapter["start_page"]
    end_page = chapter["end_page"]
    
    start_idx = start_page + PAGE_OFFSET - 1
    
    if end_page is not None:
        end_idx = end_page + PAGE_OFFSET - 1
        chapter_pages = pages[start_idx : end_idx + 1] 
    else:
        chapter_pages = pages[start_idx : ]
        
    chapter_text = "\n".join([page.page_content for page in chapter_pages])
    
    if len(chapter_pages) > 0:
        chapter_metadata = chapter_pages[0].metadata.copy()
    else:
        chapter_metadata = {}
    
    chapter_metadata.pop("page", None)
    chapter_metadata.pop("page_label", None)
    chapter_metadata.pop("subject", None)
    chapter_metadata.pop("producer", None)
    chapter_metadata.pop("creator", None)
    chapter_metadata.pop("creationdate", None)
    chapter_metadata.pop("author", None)
    chapter_metadata.pop("moddate", None)
    chapter_metadata.pop("title", None)

    chapter_metadata["chapter_number"] = chapter["chapter_number"]
    chapter_metadata["chapter_title"] = chapter["title"]
    
    chapter_doc = Document(page_content=chapter_text, metadata=chapter_metadata)
    
        
    chapter_chunks = text_splitter.split_documents([chapter_doc])
    
    all_chunks.extend(chapter_chunks)    

In [11]:
print(f"Total chunks created: {len(all_chunks)}")
if len(all_chunks) > 0:
    print("\nSample Metadata:")
    print(all_chunks[0].metadata)
    print("\nSample chunk content:")
    print(all_chunks[0].page_content)

Total chunks created: 1251

Sample Metadata:
{'keywords': '', 'source': '../data/raw/Textbook.pdf', 'total_pages': 746, 'chapter_number': '1', 'chapter_title': 'An Introduction to Data Mining'}

Sample chunk content:
Chapter 1
An Introduction to Data Mining
“Education is not the piling on of learning, information, data, facts, skills,
or abilities – that’s training or instruction – but is rather making visible
what is hidden as a seed.”—Thomas More
1.1 Introduction
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful
insights from data. A wide variation exists in terms of the problem domains, applications,
formulations, and data representations that are encountered in real applications. Therefore,
“data mining” is a broad umbrella term that is used to describe these diﬀerent aspects of
data processing.
In the modern age, virtually all automated systems generate some form of data either
for diagnostic or analysis purposes. This has resulted in a de

Since the embedding model only embeds the documents page_content our metadata doesn't contribute in the similarity search later. We could write our rag agent to also filter chunks by metadata but we are going to do something different. We will inject the chunk metadata into page_content to ensure it adds to the semantic meaning of each chunk. 

In [24]:
for chunk in all_chunks:
    ch_num = chunk.metadata.get("chapter_number", "Unknown")
    ch_title = chunk.metadata.get("chapter_title", "Unknown Title")
    source_file = chunk.metadata.get("source", "Unknown Source")
    
    metadata_header = (
        f"Chapter {ch_num}: {ch_title}\n"
        f"Source: {source_file}\n"
        f"----------\n"
    )

    chunk.page_content = metadata_header + chunk.page_content



In [25]:
print("--- PREVIEW OF CHUNK 0 ---")
print(all_chunks[50].page_content)

--- PREVIEW OF CHUNK 0 ---
Chapter 2: Data Preparation
Source: ../data/raw/data_mining_the_textbook.pdf
----------
ent segments of an image. More recently, the use ofvisual words has become more
2.2. FEATURE EXTRACTION AND PORTABILITY 29
popular. This is a semantically rich representation that is similar to document data.
One challenge in image processing is that the data are generally very high dimen-
sional. Thus, feature extraction can be performed at diﬀerent levels, depending on the
application at hand.
3. Web logs:Web logs are typically represented as text strings in a prespeciﬁed format.
Because the ﬁelds in these logs are clearly speciﬁed and separated, it is relatively easy
to convert Web access logs into a multidimensional representation of (the relevant)
categorical and numeric attributes.
4. Network traﬃc:In many intrusion-detection applications, the characteristics of the
network packets are used to analyze intrusions or other interesting activity. Depending
on the underly