PDF Reader

A Go library for reading PDF files, with active CJK text extraction support.

Requires Go 1.25+ (go.mod directive).

Forked from ledongthuc/pdf (upstream inactive since 2024). Original lineage: rsc/pdf.

Features

Plain text extraction with context/cancellation support
Styled text extraction (font name, size, position)
Text grouped by row
Document metadata API (/Info dict: title, author, dates, …)
Outline (table of contents) with resolved page numbers
CJK predefined CMap decoders:
- Japanese Shift-JIS (90ms-RKSJ-H/V, 90pv-RKSJ-H)
- CJK UCS-2 BE (UniGB-UCS2-H/V, UniCNS-UCS2-H/V, UniJIS-UCS2-H/V, UniKS-UCS2-H/V)
- Simplified Chinese GBK / GB-EUC / GBKp-EUC (GBK-EUC-H/V, GB-EUC-H/V, GBKp-EUC-H/V)
- Traditional Chinese Big5-ETen / ETenms (ETen-B5-H/V, ETenms-B5-H/V)
- Korean UHC / KSC-EUC / UHC-HW (KSCms-UHC-H/V, KSC-EUC-H/V, KSCms-UHC-HW-H/V)

Install

go get github.com/Detective-XH/pdf

Examples

See the examples/ folder for runnable programs.

Read plain text

package main

import (
	"bytes"
	"context"
	"fmt"

	"github.com/Detective-XH/pdf"
)

func main() {
	f, r, err := pdf.Open("./sample.pdf")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	var buf bytes.Buffer
	b, err := r.GetPlainText(context.Background())
	if err != nil {
		panic(err)
	}
	buf.ReadFrom(b)
	fmt.Println(buf.String())
}

Read styled text

package main

import (
	"context"
	"fmt"

	"github.com/Detective-XH/pdf"
)

func main() {
	f, r, err := pdf.Open("./sample.pdf")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	sentences, err := r.GetStyledTexts(context.Background())
	if err != nil {
		panic(err)
	}
	for _, s := range sentences {
		fmt.Printf("font=%s size=%.1f x=%.1f y=%.1f text=%s\n",
			s.Font, s.FontSize, s.X, s.Y, s.S)
	}
}

Read text by row

package main

import (
	"fmt"
	"os"

	"github.com/Detective-XH/pdf"
)

func main() {
	f, r, err := pdf.Open(os.Args[1])
	if err != nil {
		panic(err)
	}
	defer f.Close()

	for i := 1; i <= r.NumPage(); i++ {
		p := r.Page(i)
		if p.V.IsNull() {
			continue
		}
		rows, _ := p.GetTextByRow()
		for _, row := range rows {
			fmt.Printf("row %d:", row.Position)
			for _, word := range row.Content {
				fmt.Printf(" %s", word.S)
			}
			fmt.Println()
		}
	}
}

Fork status

Area	Status
Upstream sync	Merged through upstream@HEAD (2024)
Shift-JIS CMaps	Added
UCS-2 BE CMaps	Added
GBK / GB-EUC / GBKp-EUC CMaps	Added
Big5-ETen / ETenms CMaps	Added
UHC / KSC-EUC / UHC-HW CMaps	Added
Metadata API (`r.Info()`)	Added
Outline page numbers (`Outline.Page`)	Added
Context / cancellation	Added
Crash/CPU-spike on PDFs with inline images (upstream #57)	Fixed — `readHexString` EOF guard + `Interpret` inline-image skip
Upstream PRs incorporated	#37, #42, #45, #58, #61, #63, #64, #66

Resolved upstream issues

Issue	Title	How it was fixed	Status
#13	Load Reader from bytes instead of file path	`OpenBytes(src []byte)` added in `read.go`	Directly fixed
#16	GetTextByRow returns disordered text	`sort.Sort` → `sort.Stable` in `GetTextByRow`/`GetTextByColumn`	Directly fixed
#18	GetTextByRow X/Y always 0	`Td`/`TD`/`T*`/`TL` wired in `walkTextBlocks`; `BT` resets position; `currentTL` tracks leading	Directly fixed
#20	`%%EOF` search window too small; valid PDFs rejected	Expanded search window from 100 → 1024 bytes (with clamp for small files); added `findStartxrefFallback` reverse-scan for `%%EOF` placed further than 1024 bytes before end	Directly fixed
#21	unknown encoding UniGB-UCS2-H	Same fix as #55 — `ucs2BEEncoder` handles `UniGB-UCS2-H`	Directly fixed
#22	Handle space after header	Relaxed byte-8 check in `NewReaderEncrypted` to accept space/tab	Directly fixed
#27	GetTextByRow returns empty rows	`Td` in `walkTextBlocks` now updates `currentX`/`currentY` additively instead of emitting a spurious empty walker call; `TD` and `TL` wired; `T*` decrements Y by leading	Directly fixed
#30	crash when encountering some CJK text amongst English	`dictEncoder` rewrite; `maxObjectDepth` guard; `readArray` EOF fix	Directly fixed
#31	Expose page dimensions	`Page.MediaBox()` and `Page.CropBox()` added; both walk the page-tree inheritance chain; `CropBox` falls back to `MediaBox` when absent	Directly fixed
#44	Cannot read Chinese	GBK / Big5 / UniGB / UniCNS CMaps all wired in `getEncoder()`	Directly fixed
#48	`\n` added by recent version breaks old systems	Removed `showText("\n")` from `case "BT":` — BT is matrix-init, not line-break	Directly fixed
#55	GetPlainText do not support encoding "UniGB-UCS2-H"	`ucs2BEEncoder` wired for all 8 `Uni*-UCS2-H/V` CMap names	Directly fixed
#57	Crash when image is in there (malformed PNG)	`case "ID":` skip in `ps.go` `Interpret()`; `readHexString` EOF guard in `lex.go`	Directly fixed
#59	Streaming / range-over-func API for large PDFs	`(*Reader).Pages() iter.Seq2[int, Page]` and `(Page).Texts() iter.Seq[Text]` added; `Texts` merges same-style runs matching `GetStyledTexts` output	Directly fixed
#60	Parse PDF, some content appears garbled	Removed shared `fonts` map from `(Reader).GetPlainText`; each page now passes `nil` so `(Page).GetPlainText` builds a fresh per-page font map	Directly fixed
#67 / #26	Text in Form XObjects not extracted	`case "Do":` added to all three interpreter callbacks (`interpretPlain`, `interpretWalk`, `interpret`); each handler looks up the named XObject in the current `/Resources/XObject` dict, confirms `Subtype /Form`, builds a child state with fonts re-resolved from the XObject's own `/Resources/Font`, and recursively calls `Interpret()` — merging extracted text back into the parent output; depth capped at 10 to guard against cycles	Directly fixed

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github/workflows		.github/workflows
examples		examples
pdfpasswd		pdfpasswd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ascii85.go		ascii85.go
context_test.go		context_test.go
go.mod		go.mod
go.sum		go.sum
lex.go		lex.go
metadata.go		metadata.go
metadata_test.go		metadata_test.go
name.go		name.go
outline_test.go		outline_test.go
page.go		page.go
page_cjk_test.go		page_cjk_test.go
page_test.go		page_test.go
ps.go		ps.go
read.go		read.go
read_test.go		read_test.go
text.go		text.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Reader

Features

Install

Examples

Read plain text

Read styled text

Read text by row

Fork status

Resolved upstream issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Reader

Features

Install

Examples

Read plain text

Read styled text

Read text by row

Fork status

Resolved upstream issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages