Skip to content

TemplateProcessor with PclZip creates duplicate XML files in the docx archive, making MS Word unable to open it #1617

@ahiijny

Description

@ahiijny

I think this is the root cause behind #1242.

In short, TemplateProcessor with PCLZip seems to produce malformed docx files that open fine in LibreOffice but will fail to open in Microsoft Word, saying "The file cannot be opened because there are problems with the contents" and requiring a repair before it will open. This seems to be because of duplicate XML files.

Here are two files produced exactly the same way, except one with PCLZip and another with ZipArchive. Note the size difference.

Screenshot from 2019-04-18 12-45-46

zipinfo shows that the PclZip version has two instances of word/document.xml, unlike the ZipArchive version, which only has one.

Archive:  env-PclZip-2019-04-18-124542.docx
Zip file size: 37967 bytes, number of entries: 14
-rw----     4.5 fat     1483 b- defS 80-Jan-01 00:00 [Content_Types].xml
-rw----     4.5 fat      735 b- defS 80-Jan-01 00:00 _rels/.rels
-rw----     4.5 fat      979 b- defS 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw----     4.5 fat    16390 b- defS 80-Jan-01 00:00 word/document.xml
-rw----     4.5 fat     6797 b- defS 80-Jan-01 00:00 word/theme/theme1.xml
-rw----     4.5 fat    15556 b- stor 80-Jan-01 00:00 docProps/thumbnail.jpeg
-rw----     4.5 fat     2803 b- defS 80-Jan-01 00:00 word/settings.xml
-rw----     4.5 fat     7681 b- defS 80-Jan-01 00:00 word/printerSettings/printerSettings1.bin
-rw----     4.5 fat      529 b- defS 80-Jan-01 00:00 word/webSettings.xml
-rw----     4.5 fat      751 b- defS 80-Jan-01 00:00 docProps/core.xml
-rw----     4.5 fat    29435 b- defS 80-Jan-01 00:00 word/styles.xml
-rw----     4.5 fat     1525 b- defS 80-Jan-01 00:00 word/fontTable.xml
-rw----     4.5 fat      966 b- defS 80-Jan-01 00:00 docProps/app.xml
-rw----     2.0 fat    14230 b- defN 19-Apr-18 12:45 word/document.xml
14 files, 99860 bytes uncompressed, 34489 bytes compressed:  65.5%
Archive:  env-ZipArchive-2019-04-18-124536.docx
Zip file size: 32713 bytes, number of entries: 13
-rw----     4.5 fat     1483 b- defN 80-Jan-01 00:00 [Content_Types].xml
-rw----     4.5 fat      735 b- defN 80-Jan-01 00:00 _rels/.rels
-rw----     4.5 fat      979 b- defN 80-Jan-01 00:00 word/_rels/document.xml.rels
-rw----     4.5 fat     6797 b- defN 80-Jan-01 00:00 word/theme/theme1.xml
-rw----     4.5 fat    15556 b- stor 80-Jan-01 00:00 docProps/thumbnail.jpeg
-rw----     4.5 fat     2803 b- defN 80-Jan-01 00:00 word/settings.xml
-rw----     4.5 fat     7681 b- defN 80-Jan-01 00:00 word/printerSettings/printerSettings1.bin
-rw----     4.5 fat      529 b- defN 80-Jan-01 00:00 word/webSettings.xml
-rw----     4.5 fat      751 b- defN 80-Jan-01 00:00 docProps/core.xml
-rw----     4.5 fat    29435 b- defN 80-Jan-01 00:00 word/styles.xml
-rw----     4.5 fat     1525 b- defN 80-Jan-01 00:00 word/fontTable.xml
-rw----     4.5 fat      966 b- defN 80-Jan-01 00:00 docProps/app.xml
-rw-rw-rw-  2.0 unx    14230 b- defN 19-Apr-18 12:45 word/document.xml
13 files, 83470 bytes uncompressed, 29345 bytes compressed:  64.8%

How to Reproduce

use \PhpOffice\PhpWord\TemplateProcessor;
use \PhpOffice\PhpWord\Settings;

function storage_path($fileName) {
    return __DIR__ . "/../storage/" . $fileName;
}

function resource_path($fileName) {
    return __DIR__ . "/../resources/" . $fileName;
}

function build($zipClass, $template, $outdir)
{
    assert($zipClass == Settings::ZIPARCHIVE || $zipClass == Settings::PCLZIP);

    Settings::setZipClass($zipClass);
    $builder = new TemplateProcessor($template);

    $path = $outdir . '/env-' . $zipClass . "-" . date('Y-m-d-His', time()) . '.docx';
    $builder->saveAs($path);

    return $path;
}

echo build(Settings::PCLZIP, resource_path('views/docx/EnvelopeTemplate_narrow.docx'),  storage_path('envelopes/'))  . "\n";

Example file (from #1242): EnvelopeTemplate_narrow.docx

Swap out Settings::PCLZIP for Settings::ZIPARCHIVE on the last line as needed.

Details

To recap, PHPWord's TemplateProcessor uses a ZipArchive wrapper that either uses PCLZip or ZipArchive behind the scenes to handle zip operations. PHPWord seems to use ZipArchive by default (as configured in Settings::getZipClass).

When TemplateProcessor does its replacements, it extracts various XML files from the zip file, modifies these XML files, and then adds them back to the archive. PclZip yields an archive with duplicate XML files. ZipArchive does not.

Duplicate XML files in the resulting docx file is incorrect behaviour per OOXML file format rule M3.3 "Package implementers shall create item names that are unique within a given archive". LibreOffice probably opens it fine because it's more lenient.

The key difference between PclZip and ZipArchive seems to be in the addFromString behaviour:

  • When adding to a zip archive, ZipArchive overwrites existing files:

    Note that this function overwrites existing files of the same name.

  • However, PclZip does not:

    If a file already exist in an archive it is added at the end of the archive, but not automatically replaced.

Context

  • PHP version: 7.2.15
  • PHPWord version: 0.16

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions