v3.0.0 Encrypted document support, major text extraction updates
New Features
- Added support for encrypted PDFs
- New Text grouping algorithm: text with majority vertical overlap is considered part of the same line. Fixes subscript-superscript extraction issues.
- Several transformation matrix issues solved, fixing text extraction/ordering issues
💖 Sponsorship
If you depend on this package and want to support its maintenance, please consider sponsoring me. I'll continue maintaining and releasing updates regardless, but sponsorships help cover the time it takes to review changes and keep everything accurate.
Other changes
- Fix scale operand doesn't accept trailing 0 by @szepeviktor in #303
- Dictionary entry for key Order can be of type ReferenceValueArray by @PrinsFrank in #316
- Value for dictionaryKey AP can be a single dictionary by @PrinsFrank in #319
- Automatically resolve values from subdictionaries when expected value type is not a dictionary by @PrinsFrank in #321
- Simplify type checks XObject by @PrinsFrank in #322
- Implement value retrieval from ancestor nodes in page tree for inheritable properties by @PrinsFrank in #323
- Automatically resolve references in dictionary entries when retrieving values by @PrinsFrank in #325
- feat(rectangle): add width and height helpers by @vitormattos in #317
- Fix invalid section reference for file encryption key calculation by @PrinsFrank in #327
- User password entry length should always be 32 regardless of security handler revision by @PrinsFrank in #328
- Add file encryption key to metadata for samples by @PrinsFrank in #329
- Enable support for encrypted documents by @PrinsFrank in #282
- Add sample with user/owner password by @PrinsFrank in #332
- Add information about debugging file encryption to CONTRIBUTING.md by @PrinsFrank in #333
- Add support for all escape sequences in literal strings by @PrinsFrank in #334
- Support octals with one or two digits (next to support for three) in string literals by @PrinsFrank in #335
- Clean up decoding of string literals and hex strings in EncryptDictionary and use getText instead by @PrinsFrank in #336
- Fix improper handling of hex encoded binary strings in password entries by @PrinsFrank in #337
- Update minimum required PHP version to 8.2 by @PrinsFrank in #338
- Switch from readonly properties to readonly classes wherever possible by @PrinsFrank in #339
- Check file encryption key for samples by @PrinsFrank in #330
- Add upgrade guide for v3 by @PrinsFrank in #340
- Document argument for getValueForKey on dictionary is now required by @PrinsFrank in #341
- Recover userPassword from ownerPassword to also add support for ownerPasswords by @PrinsFrank in #343
- Cache calculated file encryption key on document by @PrinsFrank in #344
- Fix newly discovered PHPStan issue by @PrinsFrank in #346
- Update sponsorship section in README by @PrinsFrank in #345
- Properly parse hex strings by @PrinsFrank in #348
- Decrypt dictionary entries while parsing dictionaries in encrypted documents by @PrinsFrank in #347
- Decrypt content of compressed objects before parsing by @PrinsFrank in #349
- Replace escaped characters in encrypted strings before running decryption by @PrinsFrank in #350
- Check dictionary and page content for encrypted documents by @PrinsFrank in #342
- Add missing PNG predictor algorithms by @PrinsFrank in #351
- Flate decode columns should be multiplied by colors if present by @PrinsFrank in #352
- Ignore "endobj" markers in streams and search after length of stream dictionarymarker for it to allow for proper embedded PDF support by @PrinsFrank in #296
- The resource dictionary is now inherited by @PrinsFrank in #353
- Add sample with different font sizes by @PrinsFrank in #354
- Abstract line grouping strategy to make it replaceable by @PrinsFrank in #355
- Fix incorrect matrix multiplication in Move and MoveOffsetLeading operators causing scrambled text by @PrinsFrank in #356
- Apply transformation for NEXT_LINE Text positioning operator by @PrinsFrank in #358
- Add new overlap grouping strategy for text by @PrinsFrank in #357
- Fix initial text state not being set and appended/restored from stack resulting in lost textObjects by @PrinsFrank in #359
- Added sample file for #272 by @k00ni in #273
- Fix issues with operators that interact with both text state and transformation matrix by @PrinsFrank in #360
- Fix incorrect inverse matrix multiplication in graphicsStateOperator by @PrinsFrank in #361
- Handle text extraction with inverted Y-axis by @PrinsFrank in #362
- Use LineFeed as default page separator when extracting text for multiple pages by @PrinsFrank in #363
- Add sample from issue #290 by @PrinsFrank in #364
- Properly support encrypted documents in sample generation by @PrinsFrank in #365
- Move CONTRIBUTING.md to root of project by @PrinsFrank in #366
- FontReference can be any non-whitespace character by @PrinsFrank in #368
- Add benchmark comparison image to README.md by @PrinsFrank in #369
- Don't traverse loop nodes in page trees by @PrinsFrank in #370
- Support PAGE objects without CONTENTS by @PrinsFrank in #372
- Support NameValues in toUnicodeCMap dictionary entries for font objects by @PrinsFrank in #373
- Fix reference value array parsing when nr of items is divisible by 3 but items are not references by @PrinsFrank in #374
- Gracefully handle empty streams by @PrinsFrank in #375
- Gracefully handle only newlines between stream markers by @PrinsFrank in #376
- Universal reference value support now that auto resolving of references is implemented by @PrinsFrank in #377
- Handle empty crossReference types by @PrinsFrank in #378
- Add support for ASCIIHexDecode by @PrinsFrank in #380
- Properly handle object content that is not surrounded by newlines by @PrinsFrank in #379
- Encoding can be name values by @PrinsFrank in #381
- Gracefully handle multiple end operators for text objects by @PrinsFrank in #382
- Extend auto resolving of reference value to nameValues and dictionaries by @PrinsFrank in #383
- Update description in composer.json by @PrinsFrank in #384
- Fix parsing of comments in content streams by @PrinsFrank in #385
- Parsing ReferenceArrayValues in ArrayValues should preserve outside brackets by @PrinsFrank in #386
- Fix issues in retrieval of object content for uncompressed objects with content on the same line as start of object marker by @PrinsFrank in #387
- Ignore comments before text objects by @PrinsFrank in #388
- Auto resolve Reference value arrays by @PrinsFrank in #389
- Add sample from issue #215 by @PrinsFrank in #390
- CIDFontWidths can be empty by @PrinsFrank in #391
New Contributors
- @vitormattos made their first contribution in #317
Full Changelog: v2.8.0...v3.0.0