This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
jhealy (author)
Tue Jan 01 04:07:47 -0800 2008
README
The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe. It provides programmatic access to the contents of a PDF file with a high degree of flexibility. The PDF 1.7 specification is a weighty document and not all aspects are currently supported. We welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving out support. = Installation The recommended installation method is via Rubygems. gem install pdf-reader = Usage PDF::Reader is designed with a callback-style architecture. The basic concept is to build a receiver class and pass that into PDF::Reader along with the PDF to process. As PDF::Reader walks the file and encounters various objects (pages, text, images, shapes, etc) it will call methods on the receiver class. What those methods do is entirely up to you - save the text, extract images, count pages, read metadata, whatever. For a full list of the supported callback methods and a description of when they will be called, refer to PDF::Reader::Content. See the code examples below for a way to print a list of all the callbacks generated by a file to STDOUT. = Exceptions There are two key exceptions that you will need to watch out for when processing a PDF file: MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn't raise an exception, please forward a copy of the file to the maintainers and we can attempt improve the code. UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements. = Maintainers - Peter Jones <mailto:pjones@pmade.com> - James Healy <mailto:jimmy@deefa.com> = Examples The easiest way to explain how this works in practice is to show some examples. == Page Counter A simple app to count the number of pages in a PDF File. require 'rubygems' require 'pdf/reader' class PageReceiver attr_accessor :page_count def initialize @page_count = 0 end # Called when page parsing ends def end_page @page_count += 1 end end receiver = PageReceiver.new pdf = PDF::Reader.file("somefile.pdf", receiver) puts "#{receiver.page_count} pages" == List all callbacks generated by a single PDF WARNING: this will generate a *lot* of output, so you probably want to pipe it through less or to a text file. require 'rubygems' require 'pdf/reader' receiver = PDF::Reader::RegisterReceiver.new pdf = PDF::Reader.file("somefile.pdf", receiver) receiver.callbacks.each do |cb| puts cb end == Basic RSpec of a generated PDF require 'rubygems' require 'pdf/reader' require 'pdf/writer' require 'spec' class PageTextReceiver attr_accessor :content def initialize @content = [] end # Called when page parsing starts def begin_page(arg = nil) @content << "" end def show_text(string, *params) @content.last << string.strip end # there's a few text callbacks, so make sure we process them all alias :super_show_text :show_text alias :move_to_next_line_and_show_text :show_text alias :set_spacing_next_line_show_text :show_text end context "My generated PDF" do specify "should have the correct text on 2 pages" do # generate our PDF pdf = PDF::Writer.new pdf.text "Chunky", :font_size => 32, :justification => :center pdf.start_new_page pdf.text "Bacon", :font_size => 32, :justification => :center pdf.save_as("chunkybacon.pdf") # process the PDF receiver = PageTextReceiver.new PDF::Reader.file("chunkybacon.pdf", receiver) # confirm the text appears on the correct pages receiver.content.size.should eql(2) receiver.content[0].should eql("Chunky") receiver.content[1].should eql("Bacon") end end == Extract ISBNs Parse all text in the requested PDF file and print out any valid book ISBNs. Requires the rbook-isbn gem. require 'rubygems' require 'pdf/reader' require 'rbook/isbn' class ISBNReceiver # there's a few text callbacks, so make sure we process them all def show_text(string, *params) process_words(string.split(/\W+/)) end def super_show_text(string, *params) process_words(string.split(/\W+/)) end def move_to_next_line_and_show_text (string) process_words(string.split(/\W+/)) end def set_spacing_next_line_show_text (aw, ac, string) process_words(string.split(/\W+/)) end private # check if any items in the supplied array are a valid ISBN, and print any # that are to console def process_words(words) words.each do |word| word.strip! puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word) end end end receiver = ISBNReceiver.new PDF::Reader.file("somefile.pdf", receiver) = Resources - PDF::Reader Homepage: http://software.pmade.com/pdfreader - PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/ - PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html - PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html








