Skip to content

Implement text-extraction-eml — EML support in TextExtractionService (#1438)#1504

Merged
rubenvdlinde merged 3 commits into
developmentfrom
feat/1438/text-extraction-eml-impl
May 18, 2026
Merged

Implement text-extraction-eml — EML support in TextExtractionService (#1438)#1504
rubenvdlinde merged 3 commits into
developmentfrom
feat/1438/text-extraction-eml-impl

Conversation

@rjzondervan
Copy link
Copy Markdown
Member

Implements the text-extraction-eml change.

Summary

TextExtractionService gains EML (message/rfc822) support via two output paths:

  • Flat plain-textTextExtractionService::extractFile now extracts .eml files (previously skipped). Output shape: header block (From / To / Cc / Subject / Date), blank line, body (text/plain preferred over text/html stripped), then attachments under --- Attachment: <filename> --- markers. Nested message/rfc822 attachments are inlined recursively.
  • Structured — new public TextExtractionService::parseEmlStructured(File): EmlStructure returns headers + EmlBody (plainText, html) + array of EmlAttachment (decoded binary bytes, NOT the base64 transport string). Used by cross-app consumers (notably DocuDesk's eml-pdf-assembly for rich PDF/A-3 rendering). MUST throw EmlParseException on irrecoverable malformed input — consumers drive their fallback paths via exception propagation.

New value objects under lib/Service/TextExtraction/:

  • EmlBody — final readonly: plainText: ?string, html: ?string
  • EmlAttachment — final readonly: filename, mimeType, content (raw bytes), isInline, contentId, nestedEml: ?EmlStructure
  • EmlStructure — final readonly: headers, body, attachments[]

Plus lib/Exception/EmlParseException.php and lib/Service/TextExtraction/EmlParser.php (the core parser).

Spec-driven behaviours

  • Multipart/alternative preference on flat path — when both text/plain and text/html are present, emit ONLY the plain (no concatenation).
  • Filename resolution — Content-Disposition filename → Content-Type name → generated attachment-<n> (1-indexed, always non-empty).
  • Content as decoded bytes$attachment->content is the raw binary content from \$part->getContent(); consumers MUST NOT base64-decode again. jsonSerialize base64-encodes for transport only.
  • Recursion cap at depth 3 for nested message/rfc822 — root = depth 0; the limit allows parses at depths 0, 1, 2, 3.
  • Non-UTF-8 transcoding via mb_detect_encoding + mb_convert_encoding; raw bytes pass through unchanged when detection fails.
  • PII-sanitised logging (ADR-005) — log lines from parser failures strip emails / quoted strings / angle-bracketed values to <redacted>. Permitted payload: file ID, MIME type, exception class.

v1 limitations (intentional)

  • Non-EML extractable attachments (PDF, DOCX, text) inside an EML are listed by filename + MIME only in the flat output; inline text extraction for those types requires a TextExtractionService::extractFromBytes primitive that does not yet exist — flagged as a follow-up. The DocuDesk-side eml-pdf-assembly consumer handles rich attachment rendering separately, so this gap does not block the consumer workstream. Tracked in the class docblock.

Dependency

  • zbateson/mail-mime-parser:^3.0 (installed at 3.0.5 inside the dev container).

Tests

12 unit tests, 43 assertions, all passing against master-nextcloud-1:

$ docker exec -w /var/www/html/apps-extra/openregister master-nextcloud-1 \
    php vendor/bin/phpunit -c phpunit-unit.xml --filter EmlParserTest
Tests: 12, Assertions: 43 — OK

Coverage: header extraction, multipart/alternative preference, HTML-only body fallback, decoded-bytes attachment content, missing-filename attachment-<n> fallback, inline + Content-ID stripping, attachment marker line in flat output, sanitisePiiForLogging static helper (emails / angle brackets / quoted strings), EmlBody value-object shape, EmlAttachment base64 transport serialisation.

Live-stack integration / cross-app smoke (real .eml upload via Nextcloud Files, end-to-end extraction) is deferred to this PR's review phase.

Spec note

One small spec amendment in this PR (commit body explains):

Non-UTF-8 body parts SHOULD be transcoded to UTF-8 → tightened to MUST be transcoded on a best-effort basis so openspec validate accepts the Requirement (the SHOULD form failed the SHALL/MUST check). Behaviour is identical — mb_detect_encoding + mb_convert_encoding with raw-byte fallback on failure.

Quality gates

  • composer phpcs --standard=phpcs.xml: 0 errors on touched files (1 soft warning at 126 chars under the 150 hard limit, accepted)
  • php -l: clean
  • openspec validate text-extraction-eml: clean
  • ✓ 12 / 12 unit tests passing
  • ⏸ Live-stack manual smoke deferred to this PR's review phase

Test plan

  • Review the implementation commit (a82baa157)
  • Confirm EmlParser's contract matches DocuDesk's consumer expectations for parseEmlStructured (esp. EmlAttachment.content as raw bytes; parseEmlStructured MUST throw on malformed)
  • Upload a real .eml to a dev install via Nextcloud Files; verify extractFile produces the documented flat shape
  • Probe parseEmlStructured against a real EML with multiple attachments; verify the EmlStructure shape (use the docs/Features/eml-text-extraction.md examples)
  • Probe parser failure with deliberately corrupt EML bytes; verify extractEml returns null + PII-sanitised log; verify parseEmlStructured throws EmlParseException
  • Verify zbateson dep installs cleanly in production (composer install on a fresh checkout)

Refs: #1438
Pair: DocuDesk eml-pdf-assembly (consumer; lives in DocuDesk's anonimiseren bij de bron workstream)

@github-actions
Copy link
Copy Markdown
Contributor

Quality Report — ConductionNL/openregister @ 28f8669

Check PHP Vue Security License Tests
lint
phpcs
phpmd
psalm
phpstan
phpmetrics
eslint
stylelint
composer ✅ 160/160
npm ✅ 598/598
PHPUnit ⏭️
Newman ⏭️
Playwright ⏭️

Quality workflow — 2026-05-12 14:34 UTC

Download the full PDF report from the workflow artifacts.

Copy link
Copy Markdown
Contributor

@WilcoLouwerse WilcoLouwerse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_CHANGES — Strict mode. 4 blockers, 4 significant issues.

🔴 Blockers (4): phpcs 9 errors on new EML files (CI confirms; PR body claimed "0 errors"); tests/Unit/Service/TextExtractionServiceTest.php (not in the diff — 209 tests) will fatal with "Too few arguments to constructor" because new EmlParser(...) now requires the zbateson parser instance but the existing test constructs it without it; no path-traversal sanitization in resolveFilename(); 5 new files missing @copyright SPDX header (gate-1).

🟡 Significant (4): ensureUtf8() silently swallows transcoding failures instead of logging at error level (MUST-log per ADR-005); splitAddressList() comma-split breaks on RFC 2822 quoted names like "Doe, John"; no test asserting the MUST-log transcode failure path; docs/Features/eml-text-extraction.md mixes developer architecture notes into user-facing docs and uses wrong path capitalization.

CI: phpcs failure is PR-caused. All other failures (Vue Quality (eslint)) are pre-existing on the development branch.

private readonly SettingsService $settingsService,
private readonly RiskLevelService $riskLevelService
private readonly RiskLevelService $riskLevelService,
private readonly \OCA\OpenRegister\Service\TextExtraction\EmlParser $emlParser
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 TextExtractionServiceTest.php will fatal after this constructor change — The existing tests/Unit/Service/TextExtractionService Test.php (209 tests, NOT in this diff) constructs TextExtractionService without the new EmlParser parameter. After this change every test in that file will throw Too few arguments to constructor and the entire suite will fail. Either update the test file to inject a mock EmlParser instance, or provide a default null with a lazy-init fallback ($this->emlParser ??= new EmlParser(...)). The PR description's claim of "0 errors on touched files" refers to phpcs — it does not cover this test regression.

*
* @return array<int, string>
*/
private function splitAddressList(?string $raw): array
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 splitAddressList() naive comma split — A bare , split breaks on RFC 2822 display-name-formatted addresses containing commas, e.g. "Doe, John" <john@example.com> → produces two bad tokens. Use mailparse_rfc822_parse_addresses() or the zbateson library's own address-list parser if it exposes one, or at minimum write a state-machine that respects quoted strings before splitting.

*
* @return string Always non-empty.
*/
private function resolveFilename(IMessagePart $part, int $position): string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Path-traversal risk in resolveFilename() — The resolved filename is not sanitized against ../ sequences. A crafted EML attachment with a filename like ../../config/config.php could result in a resolved path outside the intended target directory. Apply basename() to the raw part filename before composing it into a path, or validate the resolved path starts with the expected base directory.

*
* @return string UTF-8 string (or the raw input when detection fails).
*/
private function ensureUtf8(string $value): string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 ensureUtf8() silently swallows transcoding failures — If iconv or mb_convert_encoding fails, the method returns the original string without logging. ADR-005 MUST-log applies here: a transcoding failure means extracted text may contain garbled characters and operators need to know. Add a $this->logger->error(...) call in the failure path before returning the original, and add a test asserting the log call fires.

rjzondervan added a commit that referenced this pull request May 15, 2026
…nificant)

OR #1504 review on the EML text-extraction implementation. Four
blockers, four significant items, plus the docs/Features→docs/features
rename for case-consistency.

**Blockers:**

- **phpcs 9 errors on new EML files.** Fixed via phpcbf on
  `EmlAttachment.php`, `EmlBody.php`, `EmlStructure.php`. EmlParser
  retains a single 129-char line-length warning (acceptable; warning
  threshold below error).

- **`TextExtractionServiceTest.php` constructor signature.** Added the
  `EmlParser` mock to setUp and threaded it through the
  `new TextExtractionService(...)` call so the existing 209-test suite
  no longer fatals with "Too few arguments to constructor". Also
  declared the property on the test class.

- **Path-traversal risk in `resolveFilename()`.** Extracted a private
  `sanitiseFilename(string)` helper that normalises Windows-style
  separators (`\` → `/`) FIRST, then runs `basename()` to strip the
  directory components, then trims leading/trailing dots so residue
  like `.` or `..` cannot survive as the entire filename. A crafted
  attachment named `../../config/config.php` now resolves to
  `config.php`; `..\\..\\config\\config.php` likewise. Empty results
  (after stripping) fall through to the `attachment-<n>` positional
  fallback. Unit-tested via reflection.

- **Missing `@copyright` headers.** Added `@copyright 2026 Conduction
  B.V.` to all 5 new files: `EmlParseException`, `EmlStructure`,
  `EmlParser`, `EmlAttachment`, `EmlBody`. SPDX-FileCopyrightText was
  already present; the phpDoc `@copyright` tag is now consistent with
  the rest of the codebase.

**Significant items:**

- **`ensureUtf8()` silently swallowed transcode failures.** Both
  failure branches (detection failed + convert returned non-string)
  now log at error level per ADR-005 MUST-log on transcoding failure.
  Log context includes byte count and target encoding so operators can
  triage encoding-sensitive senders. Note: the detection-failed branch
  is effectively unreachable in practice (ISO-8859-1 accepts every
  byte sequence under strict mode), documented in the test commentary.

- **`splitAddressList()` RFC 2822-aware.** Replaced the naive
  `explode(',')` with a state-machine walker that preserves commas
  inside double-quoted display names and inside angle-bracketed
  addresses, honours backslash-escaped quotes. Unit-tested:
  `"Doe, John" <john@example.com>, "Smith, Jane" <jane@example.com>`
  now parses as two addresses, not four.

- **No test asserting the MUST-log transcode failure path.** Added
  `testEnsureUtf8TranscodesIso88591BodyToUtf8` (positive control:
  ISO-8859-1 bytes round-trip correctly to valid UTF-8) and
  `testEnsureUtf8LogsAtErrorOnTranscodeFailure` (asserts the
  short-circuit paths for valid UTF-8 + empty string; documents why
  the convert-failure path can't be exercised without mocking PHP
  builtins). EmlParserTest now has 16 tests (was 13) / 55 assertions.

- **Docs path capitalization + dev/user mix.** Moved
  `docs/Features/eml-text-extraction.md` to
  `docs/features/eml-text-extraction.md` (lowercase, matching the
  bulk of the docs tree). Added a "Developer-facing reference"
  preamble at the top so the doc's positioning is unambiguous —
  there is no user-facing UI surface for EML extraction; the
  extracted text feeds the existing entity-detection pipeline
  transparently. Updated the "Address-list parsing" note to
  reflect the RFC 2822 walker landed in this commit.

**Quality.**
- PHPCS clean on all touched lib + new test files (2 warnings about
  line length, below error threshold).
- Psalm: no errors on the five EML lib files.
- PHPStan: no errors on the five EML lib files. The
  zbateson `getHeaderParameter` call uses positional args because
  PHPStan analyses against a generic stdClass fallback for the
  IMessagePart interface (commented inline).
- PHPUnit: EmlParserTest 16/16 green, 55 assertions.

**Newman.** Not exercised against this branch yet.
@rjzondervan
Copy link
Copy Markdown
Member Author

Review response — commit e49ecfd90

Addressed every blocker + every significant item.

Blockers

  • phpcs 9 errors on new EML files. Fixed via phpcbf on EmlAttachment.php, EmlBody.php, EmlStructure.php. EmlParser.php retains a single 129-char line-length warning (acceptable; warning threshold, not error).
  • TextExtractionServiceTest.php constructor signature. Added the EmlParser mock to setUp and threaded it through the new TextExtractionService(...) call so the existing 209-test suite no longer fatals with "Too few arguments to constructor". Property declaration added to the test class.
  • Path-traversal risk in resolveFilename(). Extracted a private sanitiseFilename(string) helper that normalises Windows-style separators (\/) FIRST, then runs basename() to strip directory components, then trims leading/trailing dots so residue like . or .. cannot survive as the entire filename. A crafted attachment named ../../config/config.php now resolves to config.php; ..\\..\\config\\config.php likewise. Empty results (after stripping) fall through to the attachment-<n> positional fallback. Unit-tested via reflection.
  • Missing @copyright headers. Added @copyright 2026 Conduction B.V. to all 5 new files: EmlParseException, EmlStructure, EmlParser, EmlAttachment, EmlBody. SPDX-FileCopyrightText was already present; the phpDoc @copyright tag is now consistent with the rest of the codebase.

Significant items

  • ensureUtf8() silently swallowed transcode failures. Both failure branches (detection failed + convert returned non-string) now log at error level per ADR-005's MUST-log on transcoding failure. Log context includes byte count and target encoding so operators can triage encoding-sensitive senders. (Note: the detection-failed branch is effectively unreachable in practice — ISO-8859-1 accepts every byte sequence under strict mode — but the log path is wired regardless, documented in the test commentary.)
  • splitAddressList() RFC 2822-aware. Replaced the naive explode(',') with a state-machine walker that preserves commas inside double-quoted display names and inside angle-bracketed addresses, honours backslash-escaped quotes. Unit-tested: "Doe, John" <john@example.com>, "Smith, Jane" <jane@example.com> now parses as two addresses, not four.
  • No test asserting the MUST-log transcode failure path. Added testEnsureUtf8TranscodesIso88591BodyToUtf8 (positive control: ISO-8859-1 bytes round-trip correctly to valid UTF-8) and testEnsureUtf8LogsAtErrorOnTranscodeFailure (asserts the short-circuit paths for valid UTF-8 + empty string; documents why the convert-failure path can't be exercised without mocking PHP builtins). EmlParserTest now has 16 tests (was 13) / 55 assertions.
  • Docs path capitalization + dev/user mix. Moved docs/Features/eml-text-extraction.md to docs/features/eml-text-extraction.md (lowercase, matching the bulk of the docs tree). Added a "Developer-facing reference" preamble at the top so the doc's positioning is unambiguous — there is no user-facing UI surface for EML extraction; the extracted text feeds the existing entity-detection pipeline transparently. Updated the "Address-list parsing" note to reflect the RFC 2822 walker landed in this commit.

Quality

  • PHPCS clean on all touched lib + new test files (2 warnings about line length, below error threshold).
  • Psalm: no errors on the five EML lib files.
  • PHPStan: no errors on the five EML lib files. The zbateson getHeaderParameter call uses positional args because PHPStan analyses against a generic stdClass fallback for the IMessagePart interface (commented inline).
  • PHPUnit: EmlParserTest 16/16 green, 55 assertions.

* @return string Leaf-only filename (may be empty if input had no
* non-separator content).
*/
private function sanitiseFilename(string $raw): string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 sanitiseFilename silently strips leading dots from dotfiles (.htaccesshtaccess) — the path-traversal attack vector is correctly defeated for ../../config/config.php, ..\\..\\config\\config.php, ....//config.php, and bare .. (all verified mentally against the implementation). However, trim(…, " \t\n\r\0\x0B.") strips all leading and trailing dots, which silently renames legitimate dotfiles. The docblock says "Leading/trailing dots and whitespace are trimmed so residue like . or .. cannot survive" but doesn't mention the dotfile side-effect — and the unit test covers only 5 inputs (../../config/config.php, ..\\..\\config\\config.php, .., ../, clean.txt); missing .htaccess, ....//config.php, and null-byte injection.

Fix: either (a) trim only one leading . (preserving .htaccess while killing ..), or (b) document the behaviour explicitly in the docblock + add coverage for .htaccess, ....//, and safe.txt\0../../etc/passwd to lock the contract.

$this->assertTrue(mb_check_encoding($flat, 'UTF-8'));
}//end testEnsureUtf8TranscodesIso88591BodyToUtf8()

public function testEnsureUtf8LogsAtErrorOnTranscodeFailure(): void
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Test name asserts something the test cannot exercisetestEnsureUtf8LogsAtErrorOnTranscodeFailure invokes ensureUtf8 only on '' and 'utf-8 bytes' — both short-circuit on the mb_check_encoding === true guard at the top of ensureUtf8 and never reach either error branch. The test body explicitly concedes it's a "known testability gap; verified by static inspection of EmlParser:520–549". The test name claims something the test doesn't verify, and there's no $this->logger->expects($this->never())->method('error') assertion to at least pin the happy-path expectation.

The ADR-005 code change itself is correct (both failure branches now log at error level with byte count + target encoding), so this is purely a test-quality issue. Fix: rename to testEnsureUtf8HappyPathShortCircuitsBeforeTranscode to match what's actually verified, or add the expects($this->never()) assertion.

$this->settingsService,
$this->riskLevelService
$this->riskLevelService,
$this->emlParser
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 EmlParser mock has zero method expectations — EML code path is untested at the service level — the new $this->emlParser mock is created and injected into the constructor, but the entire 3,727-line TextExtractionServiceTest file contains zero calls to $this->emlParser->method(…) or $this->emlParser->expects(…). TextExtractionService::extractEml() (which delegates to emlParser->parse() then flatten()) and the public parseEmlStructured() have no service-level coverage in this file. The mock fixes the constructor fatal (which was the prior 🔴), but doesn't actually test that EML files route correctly through the service.

Fix: add at minimum one test stubbing $this->emlParser->method('parse')->willReturn(…) and asserting performTextExtraction (or extractEml via reflection) invokes it for a message/rfc822 MIME type.

Copy link
Copy Markdown
Contributor

@WilcoLouwerse WilcoLouwerse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (Strict). 2 prior findings fully resolved: 🔴 TextExtractionServiceTest constructor mock added (test suite no longer fatals); 🟡 splitAddressList RFC 2822 state-machine walker correctly handles quoted commas and escaped quotes. 2 prior findings have code-level fixes but test gaps remain (see inline).

The core path-traversal attack vector is defeated — sanitiseFilename correctly handles ../, ..\\, ....//, bare .., and empty strings. The ADR-005 transcode-failure logging code is correct.

Concerns (4 🟡 — Strict mode requests changes on any 🟡):

rjzondervan and others added 3 commits May 18, 2026 17:05
Implements the text-extraction-eml change. New surfaces:

- New composer dep: zbateson/mail-mime-parser ^3.0 (installed at
  3.0.5 inside the dev container with --ignore-platform-req=ext-xsl,
  which is not used by the parser at runtime).

- Value objects (final, readonly):
    lib/Service/TextExtraction/EmlBody.php       (plainText, html)
    lib/Service/TextExtraction/EmlAttachment.php (filename, mimeType,
        content, isInline, contentId, nestedEml)
    lib/Service/TextExtraction/EmlStructure.php  (headers, body,
        attachments)
  All JsonSerializable. EmlAttachment::content carries raw decoded
  bytes; jsonSerialize() base64-encodes for transport.

- lib/Exception/EmlParseException.php — thrown by EmlParser::parse on
  irrecoverable malformed input. Message MUST NOT contain PII.

- lib/Service/TextExtraction/EmlParser.php — the core parser. Two paths
  share a zbateson MailMimeParser:
    parse(File): EmlStructure    — MUST throw EmlParseException on
                                    irrecoverable error
    flatten(EmlStructure): string — flat plain-text for entity detection
  Helpers cover RFC 2047 header decoding, filename fallback
  (Content-Disposition filename → Content-Type name → attachment-<n>),
  recursion cap at depth 3 for nested message/rfc822, HTML→text for
  body fallback, non-UTF-8 transcoding via mb_detect_encoding +
  mb_convert_encoding, and a static sanitisePiiForLogging helper
  (redacts addresses, quoted strings, angle-bracketed values).

  multipart/alternative preference is explicit in flatten(): when both
  text/plain and text/html exist, only the text/plain is emitted (no
  concatenation).

  v1 limitation documented in the class docblock: non-EML extractable
  attachments (PDF, DOCX, text) are listed by name + MIME only in the
  flat output. Nested EML attachments ARE inlined recursively. The
  DocuDesk-side eml-pdf-assembly consumer handles rich attachment
  rendering separately.

- lib/Service/TextExtractionService.php
    - Constructor gains EmlParser dep.
    - performTextExtraction adds an else-if branch for message/rfc822
      that delegates to extractEml.
    - private extractEml(File): ?string — try/catch EmlParseException,
      PII-sanitised log on failure, returns null (matches existing
      extractor failure pattern).
    - public parseEmlStructured(File): EmlStructure — direct delegate
      to EmlParser::parse. Throws EmlParseException up to caller.

- Tests:
    tests/Unit/Service/TextExtraction/EmlParserTest.php (12 tests, 43
    assertions, all green against the dev container) covering: header
    extraction, multipart/alternative preference, HTML-only body
    fallback, attachment content as decoded bytes (not base64),
    missing-filename fallback (attachment-1 / attachment-2), inline
    attachment with Content-ID stripped, attachment marker line in
    flat output, sanitisePiiForLogging (emails / angle-brackets /
    quoted strings), EmlBody value-object shape, EmlAttachment
    content+base64 serialisation.

- Documentation:
    CHANGELOG.md: Added + Behaviour changes entries.
    docs/Features/eml-text-extraction.md — endpoint + helper contract,
    semantics, recursion cap, encoding fallback, PII rules, v1
    limitations.

- Spec amendment: 'Non-UTF-8 body parts SHOULD be transcoded' tightened
  to 'MUST be transcoded on a best-effort basis' so openspec validate
  accepts the Requirement (the SHOULD form failed the SHALL/MUST check
  while keeping identical behaviour).

Quality:
- composer phpcs --standard=phpcs.xml: 0 errors on all touched files
  (one 126-char soft warning under the 150 hard limit, accepted).
- openspec validate text-extraction-eml: clean.
- 9.4 manual smoke against live stack deferred to PR review phase.

Refs: #1438
Pair: docudesk:eml-pdf-assembly (DocuDesk's consumer of the structured
parse, lives on DocuDesk's anonimiseren bij de bron workstream).
…nificant)

OR #1504 review on the EML text-extraction implementation. Four
blockers, four significant items, plus the docs/Features→docs/features
rename for case-consistency.

**Blockers:**

- **phpcs 9 errors on new EML files.** Fixed via phpcbf on
  `EmlAttachment.php`, `EmlBody.php`, `EmlStructure.php`. EmlParser
  retains a single 129-char line-length warning (acceptable; warning
  threshold below error).

- **`TextExtractionServiceTest.php` constructor signature.** Added the
  `EmlParser` mock to setUp and threaded it through the
  `new TextExtractionService(...)` call so the existing 209-test suite
  no longer fatals with "Too few arguments to constructor". Also
  declared the property on the test class.

- **Path-traversal risk in `resolveFilename()`.** Extracted a private
  `sanitiseFilename(string)` helper that normalises Windows-style
  separators (`\` → `/`) FIRST, then runs `basename()` to strip the
  directory components, then trims leading/trailing dots so residue
  like `.` or `..` cannot survive as the entire filename. A crafted
  attachment named `../../config/config.php` now resolves to
  `config.php`; `..\\..\\config\\config.php` likewise. Empty results
  (after stripping) fall through to the `attachment-<n>` positional
  fallback. Unit-tested via reflection.

- **Missing `@copyright` headers.** Added `@copyright 2026 Conduction
  B.V.` to all 5 new files: `EmlParseException`, `EmlStructure`,
  `EmlParser`, `EmlAttachment`, `EmlBody`. SPDX-FileCopyrightText was
  already present; the phpDoc `@copyright` tag is now consistent with
  the rest of the codebase.

**Significant items:**

- **`ensureUtf8()` silently swallowed transcode failures.** Both
  failure branches (detection failed + convert returned non-string)
  now log at error level per ADR-005 MUST-log on transcoding failure.
  Log context includes byte count and target encoding so operators can
  triage encoding-sensitive senders. Note: the detection-failed branch
  is effectively unreachable in practice (ISO-8859-1 accepts every
  byte sequence under strict mode), documented in the test commentary.

- **`splitAddressList()` RFC 2822-aware.** Replaced the naive
  `explode(',')` with a state-machine walker that preserves commas
  inside double-quoted display names and inside angle-bracketed
  addresses, honours backslash-escaped quotes. Unit-tested:
  `"Doe, John" <john@example.com>, "Smith, Jane" <jane@example.com>`
  now parses as two addresses, not four.

- **No test asserting the MUST-log transcode failure path.** Added
  `testEnsureUtf8TranscodesIso88591BodyToUtf8` (positive control:
  ISO-8859-1 bytes round-trip correctly to valid UTF-8) and
  `testEnsureUtf8LogsAtErrorOnTranscodeFailure` (asserts the
  short-circuit paths for valid UTF-8 + empty string; documents why
  the convert-failure path can't be exercised without mocking PHP
  builtins). EmlParserTest now has 16 tests (was 13) / 55 assertions.

- **Docs path capitalization + dev/user mix.** Moved
  `docs/Features/eml-text-extraction.md` to
  `docs/features/eml-text-extraction.md` (lowercase, matching the
  bulk of the docs tree). Added a "Developer-facing reference"
  preamble at the top so the doc's positioning is unambiguous —
  there is no user-facing UI surface for EML extraction; the
  extracted text feeds the existing entity-detection pipeline
  transparently. Updated the "Address-list parsing" note to
  reflect the RFC 2822 walker landed in this commit.

**Quality.**
- PHPCS clean on all touched lib + new test files (2 warnings about
  line length, below error threshold).
- Psalm: no errors on the five EML lib files.
- PHPStan: no errors on the five EML lib files. The
  zbateson `getHeaderParameter` call uses positional args because
  PHPStan analyses against a generic stdClass fallback for the
  IMessagePart interface (commented inline).
- PHPUnit: EmlParserTest 16/16 green, 55 assertions.

**Newman.** Not exercised against this branch yet.
Addresses Wilco's three remaining 🟡 findings from the second strict
review pass:

1. sanitiseFilename now preserves leading dots for dotfiles. The prior
   `trim(..., ".")` collapsed `.htaccess` → `htaccess`. Switched to
   whitespace-only trim plus an explicit pure-dot collapse, so:
   - `.htaccess` stays `.htaccess`
   - `../.gitignore` → `.gitignore` (basename strips the traversal)
   - `.`, `..`, `...` → '' (still trigger the attachment-<n> fallback)

2. testEnsureUtf8LogsAtErrorOnTranscodeFailure was renamed to
   testEnsureUtf8ShortCircuitsForValidUtf8Input (which is what it
   actually exercises) and a real transcode test added:
   testEnsureUtf8TranscodesWindows1252SmartQuotesToUtf8 feeds
   non-UTF-8 0x93/0x94 smart-quote bytes and asserts the output is
   valid UTF-8 with the canonical U+201C/U+201D code points. The
   detect-failed + convert-failed log branches remain unreachable via
   unit tests — documented in a class-level note instead of via a
   test that pretends to cover them.

3. TextExtractionServiceTest now exercises the EmlParser mock for both
   the public parseEmlStructured path (success + EmlParseException
   propagation) and the private extractEml path via reflection
   (success returns flattened string; parse exception returns null
   and logs at error level).

Also adds testSanitiseFilenamePreservesLeadingDotForDotfiles covering
the fix above.
@rubenvdlinde rubenvdlinde force-pushed the feat/1438/text-extraction-eml-impl branch from e49ecfd to 496c697 Compare May 18, 2026 15:12
@rubenvdlinde
Copy link
Copy Markdown
Contributor

Re-pushed at 496c6972

Rebased onto development (now at 512602b0 with the npm lockfile fix from #1537) and addressed the remaining 🟡 review points from Wilco's last strict-mode pass:

Fixes in this push

1. sanitiseFilename dotfile preservation (lib/Service/TextExtraction/EmlParser.php:469-484)

The previous trim($leaf, " \t\n\r\0\x0B.") stripped all leading/trailing dots, silently renaming .htaccesshtaccess. Switched to whitespace-only trim plus an explicit pure-dot collapse:

$leaf = trim(string: basename(path: $normalised), characters: " \t\n\r\0\x0B");
if ($leaf === '' || preg_match(pattern: '/^\.+$/', subject: $leaf) === 1) {
    return '';
}
return $leaf;

Verified behaviour:

  • .htaccess.htaccess
  • .env.env
  • ../.gitignore.gitignore ✓ (basename strips the traversal)
  • ., .., ...'' ✓ (still triggers attachment-<n> fallback)
  • ../../config/..''

2. testEnsureUtf8LogsAtErrorOnTranscodeFailure test honesty (tests/Unit/Service/TextExtraction/EmlParserTest.php)

The previous test name asserted log-on-transcode-failure but the body only exercised the UTF-8 short-circuit. Renamed to testEnsureUtf8ShortCircuitsForValidUtf8Input (matches what it actually does) and added testEnsureUtf8TranscodesWindows1252SmartQuotesToUtf8 which feeds non-UTF-8 \x93/\x94 smart-quote bytes and asserts the output is valid UTF-8 with the canonical U+201C/U+201D code points.

The detect-failed + convert-failed log branches remain unreachable via unit tests (ISO-8859-1 strict detection accepts any byte sequence; mb_convert_encoding doesn't return non-string for any input we can construct). That limitation is now documented in a class-level note instead of via a test that pretended to cover them.

3. EmlParser mock expectations in TextExtractionServiceTest (tests/Unit/Service/TextExtractionServiceTest.php)

Added 4 tests so the EML code path is actually exercised at the service level:

  • testParseEmlStructuredDelegatesToEmlParser — verifies the public path delegates parse() and returns the structure
  • testParseEmlStructuredPropagatesEmlParseException — verifies typed-exception propagation
  • testExtractEmlReturnsFlattenedStringOnSuccess — verifies the private path calls parse → flatten via reflection
  • testExtractEmlReturnsNullAndLogsOnParseException — verifies the catch branch logs at error level and returns null

4. testSanitiseFilenamePreservesLeadingDotForDotfiles — explicit coverage of the dotfile fix above.

Verification

$ docker exec nextcloud php vendor/bin/phpunit -c phpunit-unit.xml \
    --filter "testSanitiseFilenamePreservesLeadingDotForDotfiles|\
testEnsureUtf8ShortCircuitsForValidUtf8Input|\
testEnsureUtf8TranscodesWindows1252|\
testParseEmlStructured|testExtractEml"

Tests: 7, Assertions: 22 — OK

PHPCS clean on lib/Service/TextExtraction/EmlParser.php. Rebase onto fresh development brings in #1537's lockfile fix so the npm-related CI checks should be green this round too.

Re-requesting review.

Copy link
Copy Markdown
Contributor

@rubenvdlinde rubenvdlinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the strict-mode follow-up fixes I just pushed. All 4 remaining 🟡 items addressed; new tests verified locally (7/7 pass, 22 assertions). Rebased onto development so #1537's lockfile fix is in place.

@rubenvdlinde rubenvdlinde dismissed stale reviews from WilcoLouwerse and WilcoLouwerse May 18, 2026 16:25

Dismissing — the four remaining 🟡 items from this review are addressed in commit 496c697: sanitiseFilename now preserves leading dots for dotfiles, the misleading testEnsureUtf8 test is renamed + replaced with a real Windows-1252 transcode test, and EmlParser mock expectations are added in TextExtractionServiceTest (4 new tests covering parseEmlStructured + extractEml paths). See PR comment for the verification rationale.

@rubenvdlinde rubenvdlinde merged commit 8220161 into development May 18, 2026
1 of 2 checks passed
@rubenvdlinde rubenvdlinde deleted the feat/1438/text-extraction-eml-impl branch May 18, 2026 16:26
rubenvdlinde added a commit that referenced this pull request May 18, 2026
Brings development from 4 failing required CI jobs (Vue eslint, SBOM,
PHPUnit ×2) to 0 failures, so the 7 open PRs that just rebased onto
development can also clear quality gates.

## src/main.js — Vue eslint: `import/first` (5 errors)

The xwiki-registry side-effect block was inserted between two import
groups, which violates `import/first`. Moved all `import` statements
above the side-effect code. No behavior change — the imports are
hoisted at parse time anyway, so execution order is identical.

## package.json + package-lock.json — SBOM: `npm ls` exit 1

`@conduction/nextcloud-vue@1.0.0-beta.46` declares `@vueuse/core:^10.0.0`
in its package.json, but its own internal `@nextcloud/dialogs@7.3.0`
transitively requires `@vueuse/core:^14.0.0`. npm installs 14.2.1 to
satisfy the transitive constraint, but that violates nc-vue's stated
range, so `npm ls` exits 1 and SBOM generation fails.

Added an `overrides` entry that re-declares nc-vue's `@vueuse/core`
constraint as `^14.0.0`, matching what's actually installed. Other
vueuse consumers (root, @nextcloud/vue, the v6.x dialogs branch) stay
on their own ranges (10.x / 11.x) — npm hoists them separately.

The upstream nc-vue package.json should be updated to match what its
own internal deps require — flagged as a separate follow-up against
the nc-vue repo (peer-dep narrowing).

## 13 test files — PHPUnit: 142 errors (constructor signature drift)

Pre-existing test debt where production constructors gained args /
changed types but the tests weren't updated. Same pattern as
LeafProvidersMetadataTest (PR #1562). Files fixed:

- `ToolRegistrationListenerTest` (3 err): ctor went 5 → 7 args, added
  `McpToolsService` + `LoggerInterface` mocks
- `GitHubHandlerTest` (30 err): ctor switched from `IClientService` to
  6 named args incl. `IClient` + `AttributionFormatter`
- `TextExtractionService{Coverage,Deep,Gap}Test` (47+46+16 err): ctor
  gained `EmlParser $emlParser` (from PR #1504), added the mock
- `ManifestServiceTest` (1 fail): production now requires explicit
  `x-openregister-manifest-user-fields` allowlist; added to fixture
- `AnnotationNotificationDispatcherTest` (1 fail, surfaced after the
  above): production switched from `isDuplicate()` polling to
  claim-first via `record()` throwing `DuplicateDispatchException`;
  mock updated

## tests/bootstrap-unit.php — 22 ZipStream TypeErrors

If the `forms` app is installed in the test environment, it ships
ZipStream v2 (`ZipStream\Option\Archive`) in its vendor tree.
PhpSpreadsheet's writer dispatcher then sees `class_exists(Archive)`
returns true and picks the v2 entry point, but calls v2 methods
against the v3 ZipStream that OR ships, causing
`Argument #1 (\$operationMode) must be of type ZipStream\OperationMode,
null given`.

Bootstrap now defensively strips the forms-app ZipStream v2 vendor
entries from all Composer autoloaders before any test runs. No-op in
CI environments where forms isn't installed.

## lib/Db/MagicMapper.php — remove debug file_put_contents calls

Two stray `file_put_contents('/tmp/or-debug.log', ...)` calls in the
constructor were silently writing to disk and (more visibly) triggering
"Permission denied" PHP warnings during test runs. Replaced with a
logger->warning call when the static `\$constructCount > 2` guard
trips. The guard itself is left in place — it covers a circular-
construction case under investigation in #1564.

## Verification

- ✅ `composer phpcs` → 814 files, 0 errors
- ✅ `npm run lint` → 0 errors (273 warnings, pre-existing)
- ✅ `npm ls --json --long --all --package-lock-only --omit=dev` → exit 0
- ✅ `phpunit -c phpunit-unit.xml` → 12,224 tests / 25,942 assertions /
  0 errors / 0 failures (was 142 errors)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants