📂 Option 1: Download to /tmp/ (local storage)
### 1. Download sample PDFs from GitHub

In [0]:
%sh
mkdir -p /tmp/pdfs
wget -O /tmp/pdfs/sample1.pdf https://github.com/mozilla/pdf.js-sample-files/blob/master/tracemonkey.pdf?raw=true
#wget -O /tmp/pdfs/sample2.pdf tracemonkey.pdf


### 2. Verify Files Exist

In [0]:
import os
os.listdir("/tmp/pdfs")

In [0]:
%sh
ls -lh /tmp/pdfs

Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.

In [0]:
%pip install --quiet PyPDF2 sentence-transformers faiss-cpu 

In [0]:
%restart_python

####⚡ Rule of Thumb:
If Spark can’t guess the schema → you must tell it explicitly.

### 1. Re-create the page-level DataFrame (your working patch)

In [0]:
# import PyPDF2
# from pyspark.sql.types import StructType, StructField, StringType

# pdf_path = "/tmp/pdfs/sample1.pdf"

# reader = PyPDF2.PdfReader(pdf_path)

# # Collect page-level chunks
# page_chunks = []
# for i, page in enumerate(reader.pages):
#     text = page.extract_text()
#     if text:
#         page_chunks.append((f"sample1.pdf", f"page_{i+1}", text))

# # Schema for DataFrame
# schema = StructType([
#     StructField("filename", StringType(), True),
#     StructField("page", StringType(), True),
#     StructField("content", StringType(), True)
# ])

# # Create Spark DataFrame
# page_df = spark.createDataFrame(page_chunks, schema=schema)
# display(page_df.limit(5))


# %python
import os
from pyspark.sql.types import StructType, StructField, StringType
import PyPDF2

pdf_path = "/tmp/pdfs/sample1.pdf"   # <-- change if needed
reader = PyPDF2.PdfReader(pdf_path)

# Collect page-level chunks
page_chunks = []
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    if text:
        page_chunks.append((os.path.basename(pdf_path), f"page_{i+1}", text))

# Schema for DataFrame
schema = StructType([
    StructField("filename", StringType(), True),
    StructField("page", StringType(), True),
    StructField("content", StringType(), True)
])

# Create Spark DataFrame
page_df = spark.createDataFrame(page_chunks, schema=schema)
display(page_df.limit(5))


### 2. Persist page-level DataFrame into a Delta (Bronze) table

In [0]:
# %python
# Create a small database to keep things organized (optional)
spark.sql("CREATE DATABASE IF NOT EXISTS demo_docs")

# Write bronze
page_df.write.format("delta").mode("overwrite").saveAsTable("demo_docs.pages_bronze")
print("Saved demo_docs.pages_bronze")


### 3. Chunk text strings into smaller pieces (sliding window / overlap)
We'll chunk by approximate word count. You can change chunk_size and overlap to tune retrieval granularity.

> [!NOTE]: What is **RDD FlatMap**</br>
Apache Spark Map vs FlatMap Operation - DataFlairAn RDD flatMap in Apache Spark is a transformation operation that applies a function to each element of a Resilient Distributed Dataset (RDD) and then flattens the results into a single RDD, effectively performing a one-to-many mapping. Unlike map, which returns a single output for each input, flatMap can return zero, one, or more elements from the function, making it ideal for scenarios like splitting a line of text into individual word

### 4. Create embeddings for each chunk (using Sentence-Transformers)

For Free Edition we’ll use an in-session model (all-MiniLM-L6-v2) from sentence-transformers. This is small and fast. We will convert the chunk DataFrame to pandas (for small datasets) and compute embeddings in batches.