public
Description: The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
Homepage: http://rubyforge.org/projects/pdf-reader
Clone URL: git://github.com/yob/pdf-reader.git
pdf-reader / TODO
100644 48 lines (38 sloc) 2.308 kb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
v0.8
- optimise PDF::Reader::Reference#from_buffer
  - ruby-prof shows the match() call in this function is a real killer
- add extra callbacks
  - list implemented features
    - encrypted? tagged? bookmarks? annotated? optimised?
- Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
  - bookmarks?
  - outline?
  - articles?
  - viewer prefs?
- Don't remove comment when tokenising in the middle of a string
- Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
  poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
  from the Original encoding to Unicode.
- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
- Improve interpretation of non content stream data (ie metadata). recognise dates, etc
- Support Cross Reference Streams (spec 3.4.7)
- Fix inheritance of page attributes. Resources has been done, but plenty of other attributes
  are inheritable. See table 3.2.7 in the spec
 
v0.9
- Add a way to extract raster images
  - see XObjects section of spec (section 4.7)
- Add a way to extract font data?
 
Sometime
- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
  - Will require significantly improved handling of CMaps, including creating a bunch of predefined ones
 
- Work out why specs/data/zlib*.pdf isn't parsed correctly when all the major PDF viewers can display it correctly
 
- Ship some extra receivers in the standard package, particuarly ones that are useful for running
  rspec over generated PDF files
 
- When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
  sensible way to convert them to unicode
 
- Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
 
- Add support for additional encodings:
  - PDFDocEncoding
  - Identity-V(I *think* this relates to vertical text. Not sure how we'd support it sensibly)
 
- Investigate how R->L text is handled
 
- Add support for object streams (spec section 3.4.6)