Skip to content

eterps/pdf-struct

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF::Extractor is a library that provides high level access to the text objects of a PDF document.

It is actually a simple wrapper around the pdftohtml command with its '-xml' option, so you need the pdftohtml command in your path to be able to use this library.
The pdftohtml command is written by Mikhail Kruk (originally written by Gueorgui Ovtcharov and Rainer Dorsch):

http://pdftohtml.sourceforge.net

If you need to have direct (low level) access to a PDF, use pdf-reader instead:

http://github.com/yob/pdf-reader/tree/master

Usage:

  document = PDF::Extractor.open('test.pdf')
  document.elements.each do |element|
    puts "#{element.left}, #{element.top}\t'#{element.content}'"
  end

About

PDF::Extractor is a library that provides high level access to the text objects of a PDF document

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages