Using the code detailed in:
http://github.com/yob/pdf-reader/blob/master/examples/text.rb
This file:
http://dl.dropbox.com/u/175905/test1.pdf
Will hang when calling PDF::Reader.file. File in unencrypted, without password and is PDF v1.6. Stack track when aborting the hang with CTRL+C:
^C/usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:inmatch': Interrupt
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:in `from_buffer'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/parser.rb:46:in `parse_token'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:357:in `content_stream'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:314:in `walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:in `each'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:in `walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in `walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in `each'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in `walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:284:in `document'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:136:in `parse'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:76:in `file'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:in `open'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:in `file'`
Thanks for the detailed report and sample file - I can confirm the issue is reproducible on my system. I'll have a crack at tracking down the trigger.
Ah. It turns out to not be a freeze, just horribly inefficient code. On my 2.4Ghz core2 laptop the file parses in 3 minutes and 40 seconds. I'll look into doing some profiling and trying to optimise the code somewhat.
If that's something you're familiar with, feel free to have a crack at it yourself. It's not an area I'm overly familiar with myself.
I'll have a look if I can - did you find which particular part was causing the performance issue?
I run ruby-prof while extracting text from the test file and got https://gist.github.com/47430d6440c8e7fdcd53. Looks like Regexp#match is the primary issue. I suspect (but I'm not 100%) Reference#from_buffer is to blame, It's called lots from Parser#parse_token
Update: I've got an experimental branch that completely re-implements the buffer/tokeniser. Parsing the text from your test file dropped from 3:40 to 9 seconds. I'm still fixing tokenising in a few corner cases, but keep an eye out for the changes being merged into master soon.
This is now fixed in a series of commits released in the 0.8.2 gem.
Brilliant, thanks!