# Document Processing

## Techniques for Processing a Thousand-Paged Document

When working with large documents such as [Computer Networks by Andrew S. Tanenbaum](https://csc-knu.github.io/sys-prog/books/Andrew%20S.%20Tanenbaum%20-%20Computer%20Networks.pdf), efficient processing is essential. Here are the main techniques we will use:

### 1. Chunking
- **Definition:** Splitting the document into smaller, manageable sections (chunks) such as paragraphs, pages, or chapters.
- **Purpose:** Enables parallel processing, easier indexing, and targeted analysis.

### 2. Multimodal Processing
- **Definition:** Combining text, images, tables, and diagrams for comprehensive understanding.
- **Purpose:** Extracts information from both textual and visual elements, improving accuracy and context.

### 3. Preprocessing Pipelines
- **Definition:** Sequential steps to clean and prepare data, including OCR, tokenization, normalization, and noise removal.
- **Purpose:** Ensures consistent and high-quality input for downstream tasks like summarization, search, or classification.

---

By integrating these techniques, we can efficiently analyze, summarize, and extract insights from extensive technical documents.

## Semantic Chunking

In [1]:
import requests
from io import BytesIO
from PyPDF2 import PdfReader
import re
import os
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from dotenv import load_dotenv

# LangChain components
from langchain_core.documents import Document
from langchain_experimental.text_splitter import SemanticChunker
import spacy  # For sentence splitting

# === Load Environment Variables ===
load_dotenv()
DEEPINFRA_API_KEY = os.getenv("DEEPINFRA_API_KEY")
if not DEEPINFRA_API_KEY:
    raise ValueError("Please set DEEPINFRA_API_KEY in your .env file or environment.")

DEEPINFRA_API_KEY = DEEPINFRA_API_KEY.strip()

# === Config ===
pdf_filename = "sample.pdf"
pdf_url = "https://www.princexml.com/samples/textbook/somatosensory.pdf"
# ✅ Fixed: No trailing space in URL
deepinfra_api_url = "https://api.deepinfra.com/v1/openai/embeddings"
model_name = "Qwen/Qwen3-Embedding-8B"
batch_size = 8
max_retries = 3
delay_between_batches = 1  # Adjust based on rate limits

# === Step 1: Download PDF if not present ===
if not os.path.exists(pdf_filename):
    print(f"{pdf_filename} not found. Downloading...")
    response = requests.get(pdf_url)
    response.raise_for_status()
    with open(pdf_filename, "wb") as f:
        f.write(response.content)
else:
    print(f"{pdf_filename} found in current directory.")

with open(pdf_filename, "rb") as f:
    pdf_file = BytesIO(f.read())

# === Step 2: Extract text from PDF ===
reader = PdfReader(pdf_file)
text = ""
for page in reader.pages:
    page_text = page.extract_text()
    if page_text:
        text += page_text + "\n"

# === Step 3: Preprocess text ===
text = re.sub(r'\s+', ' ', text)
text = text.strip()
print(f"Total extracted text length: {len(text)} characters\n")

# === Step 4: DeepInfra Embedding Wrapper with Instruction Support ===
class DeepInfraEmbeddings:
    def __init__(self, api_key, model, batch_size=8, max_retries=3, delay=1):
        self.api_key = api_key
        self.model = model
        self.batch_size = batch_size
        self.max_retries = max_retries
        self.delay = delay
        self.api_url = "https://api.deepinfra.com/v1/openai/embeddings"  # ✅ No trailing space

    def _send_batch(self, texts):
        # ✅ Add instruction as per Qwen3 recommendation
        task = "Given a textbook passage, retrieve related concepts"
        instructed_texts = [
            f"Instruct: {task}\nQuery: {text}" for text in texts
        ]

        headers = {
            "Authorization": "Bearer {}".format(self.api_key),
            "Content-Type": "application/json",
        }
        payload = {
            "input": instructed_texts,
            "model": self.model,
            "encoding_format": "float"
        }

        for attempt in range(self.max_retries):
            try:
                response = requests.post(self.api_url, json=payload, headers=headers, timeout=30)
                if response.status_code == 200:
                    data = response.json()["data"]
                    return [d["embedding"] for d in data]
                elif response.status_code == 429:
                    print(f"Rate limited. Retrying in {2 ** attempt} seconds...")
                    time.sleep(2 ** attempt)
                else:
                    print(f"Error {response.status_code}: {response.text}")
                    time.sleep(2 ** attempt)
            except requests.RequestException as e:
                print(f"Request failed (attempt {attempt + 1}): {e}")
                time.sleep(2 ** attempt)
        raise RuntimeError(f"Failed to get embeddings after {self.max_retries} retries.")

    def embed_documents(self, texts):
        all_embeddings = []
        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            print(f"Embedding batch {i // self.batch_size + 1} / {len(texts) // self.batch_size + 1} (size: {len(batch)})")
            batch_embeddings = self._send_batch(batch)
            all_embeddings.extend(batch_embeddings)
            if self.delay > 0 and i + self.batch_size < len(texts):
                time.sleep(self.delay)
        return all_embeddings

    def embed_query(self, text):
        return self.embed_documents([text])[0]

# === Initialize Embedder ===
embeddings = DeepInfraEmbeddings(
    api_key=DEEPINFRA_API_KEY,
    model=model_name,
    batch_size=batch_size,
    delay=delay_between_batches
)

print(f"Using DeepInfra API with model: {model_name}")

# === Step 5: Initialize SemanticChunker ===
try:
    nlp = spacy.load("en_core_web_sm")
except (ImportError, OSError):
    print("Downloading spaCy model...")
    os.system("python -m spacy download en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

semantic_text_splitter = SemanticChunker(
    embeddings,
    buffer_size=3,
    breakpoint_threshold_type="percentile",  # Try 90-95
    add_start_index=True,
)

# === Step 6: Split into Paragraphs for Parallel Processing ===
paragraphs = re.split(r'\n\s*\n', text)
paragraphs = [p.strip() for p in paragraphs if len(p.strip()) > 50]
print(f"Processing {len(paragraphs)} paragraphs...\n")

# === Step 7: Parallel Semantic Chunking per Paragraph ===
def chunk_paragraph(paragraph):
    try:
        docs = semantic_text_splitter.create_documents([paragraph])
        return [
            {
                "content": doc.page_content,
                "metadata": doc.metadata.copy()
            }
            for doc in docs
        ]
    except Exception as e:
        print(f"Error processing paragraph: {str(e)[:100]}...")
        return []

semantic_chunks = []
with ThreadPoolExecutor(max_workers=3) as executor:  # Conservative to avoid rate limits
    futures = [executor.submit(chunk_paragraph, p) for p in paragraphs]
    for future in as_completed(futures):
        try:
            result = future.result()
            semantic_chunks.extend(result)
        except Exception as e:
            print(f"Task failed: {e}")

# === Step 8: Final Results ===
print(f"\n✅ Generated {len(semantic_chunks)} semantic chunks.\n")
for i, chunk in enumerate(semantic_chunks[:3]):
    preview = chunk["content"][:300]
    start_idx = chunk["metadata"].get("start_index", "N/A")
    print(f"Chunk {i+1} (Start Index: {start_idx})")
    print(f"Content (preview): {preview}...\n")

sample.pdf not found. Downloading...
Total extracted text length: 7537 characters

Using DeepInfra API with model: Qwen/Qwen3-Embedding-8B
Processing 1 paragraphs...

Embedding batch 1 / 7 (size: 8)
Embedding batch 2 / 7 (size: 8)
Embedding batch 3 / 7 (size: 8)
Embedding batch 4 / 7 (size: 8)
Embedding batch 5 / 7 (size: 8)
Embedding batch 6 / 7 (size: 8)
Embedding batch 7 / 7 (size: 4)

✅ Generated 4 semantic chunks.

Chunk 1 (Start Index: 0)
Content (preview): This is a sample document to showcase page-based formatting. It contains a chapter from a Wikibook called Sensory Systems . None of the content has been changed in this article, but some content has been remo ved.Anat omy of the Somat osensor y System FROM WIKIBOOKS1 Our somatosensor y system c onsi...

Chunk 2 (Start Index: 4164)
Content (preview): Wide content, lik e the table and Figure 3, intrude into the outside margins.only to intense mechanical stimuli, but also to heat and to no xious chemicals. These rec eptors respon

# Conclusion

- The run using Tanenbaum's Computer Network book, is too slow for just semantic chunking, though the heaviest part seems to be at step 6, taking total of `22m 8.2s` with total of 1210 semantic chunks

```text
Generated 1210 semantic chunks.

Chunk 1 (Start Index: 0)
Content (preview): This page intentionally left blank COMPUTER NETWORKS FIFTH EDITION This page intentionally left blank COMPUTER NETWORKS FIFTH EDITION ANDREW S. TANENBAUM Vrije Universiteit Amsterdam, The Netherlands DAVID J. WETHERALL University of Washington Seattle, WA PRENTICE HALL Boston Columbus Indianapolis N...

Chunk 2 (Start Index: 1187)
Content (preview): Dulles Interior Illustrations : Laserwords, Inc. Media Editor : Daniel Sandin Composition : Andrew S....

Chunk 3 (Start Index: 1288)
Content (preview): Tanenbaum Copyeditor : Rachel Head Proofreader : Joe Ruddick Printer/Binder : Courier/Westford Cover Printer: Lehigh-Phoenix Color/ Hagerstown Credits and acknowledgments borrowed from other sources and reproduced, with permission, in this textbook appear on appropriate page within text. Many of the...
```

## Summary

- The semantic chunking process, while effective, is computationally intensive and time-consuming for large documents. In the sample run, processing a moderately sized PDF resulted in over 1200 semantic chunks and required more than 22 minutes to complete. This highlights the need for further optimization or alternative approaches when scaling to full-length textbooks such as Tanenbaum's "Computer Networks." Efficient batching, parallelization, or leveraging faster embedding models may be necessary for practical large-scale document analysis.


