PDF Wasm Analyzer — Hybrid Architecture

A browser-based PDF analyzer powered by C++ WebAssembly (custom metadata scanner) + Google PDFium (text extraction & page rendering), with a Django REST backend for logging processing jobs.

Architecture

┌─────────────────────────────────────────────────────┐
│  BROWSER (Client-Side)                              │
│                                                     │
│  ┌──────────────┐  ┌──────────────────────────────┐ │
│  │ pdf_processor │  │ @embedpdf/pdfium (Wasm)      │ │
│  │ .cpp → .wasm  │  │ Text extraction, rendering   │ │
│  │ (Custom C++)  │  │ (Pre-built PDFium binary)    │ │
│  └──────┬───────┘  └──────────┬───────────────────┘ │
│         │                     │                     │
│  ┌──────▼─────────────────────▼───────────────────┐ │
│  │  app.js — JavaScript Bridge                     │ │
│  │  • Calls both Wasm modules                      │ │
│  │  • Renders split-pane UI                        │ │
│  │  • POST results to Django via fetch()           │ │
│  └───────────────────┬─────────────────────────────┘ │
└──────────────────────┼──────────────────────────────┘
                       │  HTTP POST /api/save-pdf-job/
                       │  Authorization: Bearer <token>
┌──────────────────────▼──────────────────────────────┐
│  DJANGO SERVER (Backend)                            │
│                                                     │
│  ┌────────────┐  ┌─────────────┐  ┌──────────────┐ │
│  │ .env        │  │ views.py    │  │ models.py    │ │
│  │ (SECRETS)   │  │ (DRF API)   │  │ (SQLite/PG)  │ │
│  │ SECRET_KEY  │  │ Auth + Save  │  │ ProcessedPDF │ │
│  │ DB creds    │  │             │  │ Job          │ │
│  └────────────┘  └─────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────┘

Quick Start

Prerequisites

Emscripten SDK (for building the custom C++ analyzer)
Node.js ≥ 16 (for npm/PDFium package)
Python ≥ 3.10 (for Django backend)

1. Build the C++ Analyzer (one-time)

source /path/to/emsdk/emsdk_env.sh
mkdir -p build && cd build
emcmake cmake ..
emmake make
cp pdf_processor.js pdf_processor.wasm ..
cd ..

2. Install Frontend Dependencies

npm install                          # Installs @embedpdf/pdfium
cp node_modules/@embedpdf/pdfium/dist/pdfium.wasm .  # Copy Wasm binary

3. Setup Django Backend

cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env                 # Edit with your secrets
python manage.py migrate
python manage.py createsuperuser     # Optional — for admin panel

4. Run Both Servers

# Terminal 1: Frontend (from project root)
python3 -m http.server 8080

# Terminal 2: Backend (from backend/)
source venv/bin/activate
python manage.py runserver 8000

Open http://localhost:8080 in your browser.

Features

Frontend (Browser)

Multi-file upload with drag & drop
C++ Metadata Analysis — PDF version, magic bytes, page count (heuristic), linearization, encryption
PDF Toolkit (via PDFium):
- 🔀 Merge PDFs — Combine multiple PDFs into one
- ✂️ Split PDF — Extract specific page ranges into new files
- 🗑️ Delete Pages — Remove specific pages from a PDF
- 🔄 Rotate Pages — Rotate pages by 90°, 180°, 270°
- 🔲 Crop Pages — Set custom crop box dimensions
- 📝 Extract Text — Extract native text from PDF layers
- 🖼 Render Pages — Generate high-quality PNG thumbnails for all pages
OCR (Optical Character Recognition) — Powered by tesseract.js (extracts text from scanned images/pages via WebAssembly)
Sidebar — document list with selection, deletion, clear all
localStorage — analysis history persists across sessions

Backend (Django)

REST API at /api/save-pdf-job/ — logs processing transactions
Bearer token auth — simple for PoC, swap to JWT/OAuth2 for production
Admin panel at /admin/ — view all logged jobs
.env security — all secrets isolated on the server

Security: Why `.env` NEVER Reaches the Frontend

Browser sandbox — WebAssembly runs in a browser sandbox. It has zero access to server-side files. It cannot call fopen(), getenv(), or any OS syscall.
SECRET_KEY — if leaked, an attacker can forge Django session cookies and hijack any user session.
DATABASE_URL — if leaked, direct database access is possible (read/write/delete all data).
Architecture is intentional — secrets live ONLY on the Django server. The frontend sends processed PDF metadata TO the server over HTTP. The server decides what to accept and store.
The .env file is not served — python3 -m http.server serves from the project root, but .env lives in backend/, which is a separate directory. Even if it were accessible, .gitignore prevents it from being committed.

API Reference

POST `/api/save-pdf-job/`

Headers:

Content-Type: application/json
Authorization: Bearer <token>

Body:

{
  "filename": "report.pdf",
  "file_size": 245760,
  "pdf_version": "1.7",
  "page_count": 12,
  "operation_type": "full_analysis",
  "text_preview": "First 500 chars of extracted text...",
  "status": "completed"
}

Response (201 Created):

{
  "message": "Job saved successfully.",
  "job": {
    "id": 1,
    "user_identifier": "token-user:dev-toke...",
    "filename": "report.pdf",
    "file_size": 245760,
    "pdf_version": "1.7",
    "page_count": 12,
    "operation_type": "full_analysis",
    "text_preview": "...",
    "status": "completed",
    "created_at": "2026-05-12T21:00:00Z",
    "updated_at": "2026-05-12T21:00:00Z"
  }
}

PDFium Integration Approach

Instead of compiling PDFium from source with Emscripten (notoriously difficult and slow), this project uses @embedpdf/pdfium (v2.14.2) — an npm package that ships PDFium pre-compiled to WebAssembly with JavaScript bindings.

The pdfium_worker.js module:

Fetches pdfium.wasm (4.5MB) and initializes the module once (singleton)
Allocates memory via malloc, copies PDF bytes into Wasm heap
Calls PDFium C API functions (FPDF_LoadMemDocument, FPDFText_LoadPage, etc.)
Cleans up with FPDF_CloseDocument / free in finally blocks

Our custom pdf_processor.cpp coexists alongside PDFium — it handles fast metadata scanning, while PDFium handles the heavy operations (text extraction, page rendering).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md
app.js		app.js
index.html		index.html
ocr_worker.js		ocr_worker.js
package.json		package.json
pdf_processor.cpp		pdf_processor.cpp
pdf_processor.js		pdf_processor.js
pdf_processor.wasm		pdf_processor.wasm
pdf_tools.js		pdf_tools.js
pdfium_worker.js		pdfium_worker.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Wasm Analyzer — Hybrid Architecture

Architecture

Quick Start

Prerequisites

1. Build the C++ Analyzer (one-time)

2. Install Frontend Dependencies

3. Setup Django Backend

4. Run Both Servers

Features

Frontend (Browser)

Backend (Django)

Security: Why `.env` NEVER Reaches the Frontend

API Reference

POST `/api/save-pdf-job/`

PDFium Integration Approach

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Wasm Analyzer — Hybrid Architecture

Architecture

Quick Start

Prerequisites

1. Build the C++ Analyzer (one-time)

2. Install Frontend Dependencies

3. Setup Django Backend

4. Run Both Servers

Features

Frontend (Browser)

Backend (Django)

Security: Why .env NEVER Reaches the Frontend

API Reference

POST /api/save-pdf-job/

PDFium Integration Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Security: Why `.env` NEVER Reaches the Frontend

POST `/api/save-pdf-job/`

Packages