public
Description: The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
Homepage: http://rubyforge.org/projects/pdf-reader
Clone URL: git://github.com/yob/pdf-reader.git
name age message
file CHANGELOG Loading commit data...
file README
file Rakefile
file TODO
directory bin/
directory lib/
directory specs/
directory tests/
README
The PDF::Reader library implements a PDF parser conforming as much as possible
to the PDF specification from Adobe.

It provides programmatic access to the contents of a PDF file with a high
degree of flexibility.

The PDF 1.7 specification is a weighty document and not all aspects are
currently supported. We welcome submission of PDF files that exhibit
unsupported aspects of the spec to assist with improving out support.

= Installation

The recommended installation method is via Rubygems.

  gem install pdf-reader

= Usage

PDF::Reader is designed with a callback-style architecture. The basic concept
is to build a receiver class and pass that into PDF::Reader along with the PDF
to process. 

As PDF::Reader walks the file and encounters various objects (pages, text,
images, shapes, etc) it will call methods on the receiver class.  What those
methods do is entirely up to you - save the text, extract images, count pages,
read metadata, whatever.

For a full list of the supported callback methods and a description of when they
will be called, refer to PDF::Reader::Content. See the code examples below for a
way to print a list of all the callbacks generated by a file to STDOUT.

= Exceptions

There are two key exceptions that you will need to watch out for when processing a 
PDF file:

MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the 
file should be valid, or that a corrupt file didn't raise an exception, please 
forward a copy of the file to the maintainers and we can attempt improve the code.

UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently 
support. Again, we welcome submissions of PDF files that exhibit these features to help 
us with future code improvements.

= Maintainers

- Peter Jones <mailto:pjones@pmade.com>
- James Healy <mailto:jimmy@deefa.com>

= Examples

The easiest way to explain how this works in practice is to show some examples.

== Page Counter

A simple app to count the number of pages in a PDF File.

  require 'rubygems'
  require 'pdf/reader'

  class PageReceiver
    attr_accessor :page_count

    def initialize
      @page_count = 0 
    end
    
    # Called when page parsing ends
    def end_page
      @page_count += 1
    end
  end

  receiver = PageReceiver.new
  pdf = PDF::Reader.file("somefile.pdf", receiver)
  puts "#{receiver.page_count} pages"

== List all callbacks generated by a single PDF

WARNING: this will generate a *lot* of output, so you probably want to pipe
it through less or to a text file.
  
  require 'rubygems'
  require 'pdf/reader'

  receiver = PDF::Reader::RegisterReceiver.new
  pdf = PDF::Reader.file("somefile.pdf", receiver)
  receiver.callbacks.each do |cb|
    puts cb
  end

== Basic RSpec of a generated PDF 

  require 'rubygems'
  require 'pdf/reader'
  require 'pdf/writer'
  require 'spec'

  class PageTextReceiver
    attr_accessor :content

    def initialize
      @content = []
    end

    # Called when page parsing starts
    def begin_page(arg = nil)
      @content << ""
    end

    def show_text(string, *params)
      @content.last << string.strip
    end

    # there's a few text callbacks, so make sure we process them all
    alias :super_show_text :show_text
    alias :move_to_next_line_and_show_text :show_text
    alias :set_spacing_next_line_show_text :show_text

  end

  context "My generated PDF" do
    specify "should have the correct text on 2 pages" do

      # generate our PDF
      pdf = PDF::Writer.new
      pdf.text "Chunky", :font_size => 32, :justification => :center
      pdf.start_new_page
      pdf.text "Bacon", :font_size => 32, :justification => :center
      pdf.save_as("chunkybacon.pdf")

      # process the PDF
      receiver = PageTextReceiver.new
      PDF::Reader.file("chunkybacon.pdf", receiver)

      # confirm the text appears on the correct pages
      receiver.content.size.should eql(2)
      receiver.content[0].should eql("Chunky")
      receiver.content[1].should eql("Bacon")
    end
  end

== Extract ISBNs

Parse all text in the requested PDF file and print out any valid book ISBNs.
Requires the rbook-isbn gem.

  require 'rubygems'
  require 'pdf/reader'
  require 'rbook/isbn'

  class ISBNReceiver

    # there's a few text callbacks, so make sure we process them all
    def show_text(string, *params)
      process_words(string.split(/\W+/))
    end

    def super_show_text(string, *params)
      process_words(string.split(/\W+/))
    end

    def move_to_next_line_and_show_text (string)
      process_words(string.split(/\W+/))
    end

    def set_spacing_next_line_show_text (aw, ac, string)
      process_words(string.split(/\W+/))
    end

    private

    # check if any items in the supplied array are a valid ISBN, and print any 
    # that are to console
    def process_words(words)
      words.each do |word|
        word.strip!
        puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word)
      end
    end
  end

  receiver = ISBNReceiver.new
  PDF::Reader.file("somefile.pdf", receiver)


= Resources

- PDF::Reader Homepage: http://software.pmade.com/pdfreader
- PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
- PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
- PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html