PDFOCR

A client-side web app that extracts text from PDF and DOCX files. Everything runs in the browser — no server, no uploads, no data leaves your machine.

Live: nili-l.github.io/PDFOCR

How it works

Drop a PDF or DOCX file onto the page (or click to browse)
Click "Start OCR Processing"
Pages appear in the results panel as they're extracted
Copy the text or download it as a .txt file

For PDFs, the app first tries to extract embedded text directly. If that text is garbled or missing (common with scanned documents), it falls back to OCR via Tesseract.js. You can also check "Force OCR" to skip embedded text and compare OCR quality against the original — useful for validating Hebrew extraction accuracy. DOCX files are extracted directly via Mammoth.js, no OCR needed.

What you get

Incremental results — pages appear one at a time as they're processed, with live character/word/page counts
Extraction report — the downloaded .txt file includes the extraction method used and lists any pages that failed
PDF preview — thumbnail previews of the first 5 pages before processing
Crash recovery — if the tab closes mid-extraction, reopen the app to download whatever was saved
Paginated display — large documents show the first 20 pages with a "Load more" button for the rest

Privacy

All processing happens in your browser using Web Workers. No file data is sent anywhere. Safe for sensitive documents.

Supported formats

Format	Method	Languages
PDF (with embedded text)	Direct extraction via PDF.js	Any
PDF (scanned/image-only)	OCR via Tesseract.js	Hebrew, English
DOCX	Direct extraction via Mammoth.js	Any

No hard file size limit — the browser's available memory is the constraint. The app processes pages one at a time and releases memory after each, so usage stays flat regardless of document length. Results are stored in IndexedDB, not in memory.

Note: The app decides whether embedded PDF text is "garbled" by checking if 50%+ of characters are Hebrew, Latin, or numeric. PDFs in other scripts (Arabic, CJK, etc.) may trigger OCR unnecessarily even when embedded text is fine.

Run locally

Requires Node.js 16+ and npm.

git clone https://github.com/Nili-L/PDFOCR.git
cd PDFOCR
npm install
npm run dev

Opens at http://localhost:5173.

Build for production

npm run build

Static files go to dist/. Serve them with any static file server, or push to main and GitHub Pages deploys automatically.

OCR tips

Higher-quality scans produce better results
Clear, high-contrast text works best
The app renders pages at 3x scale before OCR — this balances accuracy against speed
Heavily skewed or rotated text may reduce accuracy
Use "Force OCR" to compare OCR output against embedded text on the same file

Customization

Add OCR languages

In main.js, find the createWorker call and change the language codes:

const worker = await createWorker(['heb', 'eng']);

See the Tesseract language list for available codes.

Adjust OCR resolution

Change PDF_RENDER_SCALE in the CONFIG object at the top of main.js:

PDF_RENDER_SCALE: 3.0,  // higher = better quality, slower

Tech stack

Tesseract.js — browser-based OCR engine
PDF.js — Mozilla's PDF renderer
Mammoth.js — DOCX text extraction
Vite — build tool and dev server
IndexedDB — browser-native storage for incremental results and crash recovery

Documentation

Guide	Description
Install and Deploy	Self-hosting on GitHub Pages, nginx, Apache, Docker, Caddy
Developer Setup	Project structure, code organization, making changes, testing

Browser support

Works on modern browsers (Chrome, Firefox, Safari, Edge). Requires Web Workers, Canvas API, and IndexedDB support.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
.serena		.serena
docs		docs
.gitignore		.gitignore
README.md		README.md
db.js		db.js
helpers.js		helpers.js
helpers.test.js		helpers.test.js
index.html		index.html
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFOCR

How it works

What you get

Privacy

Supported formats

Run locally

Build for production

OCR tips

Customization

Add OCR languages

Adjust OCR resolution

Tech stack

Documentation

Browser support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDFOCR

How it works

What you get

Privacy

Supported formats

Run locally

Build for production

OCR tips

Customization

Add OCR languages

Adjust OCR resolution

Tech stack

Documentation

Browser support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages