chore: modernize temp-path handling, harden ProcessRunner + Tesseract output#2
Conversation
… output
OcrService
----------
Replace per-file Path.GetTempPath() + Guid composition with one
Directory.CreateTempSubdirectory("createpdf-ocr-") per OCR run.
Benefits:
- Atomic 0700-perms creation (mkstemp(3)-style); no name-collision
race window against other tmp scribblers in /tmp.
- One Directory.Delete(recursive: true) replaces N×TryDeleteFile
calls, so a crash mid-OCR leaves an obviously prefixed leak dir
instead of orphan .png/.txt scattered across /tmp.
- Extract OcrAsync(pdfPath, workDir, ...) so the stream and file
entry points share the pipeline without nesting two temp dirs.
TryDeleteFile is dropped (its only callers are gone); replaced by
TryDeleteDirectory with the same swallow-if-missing contract.
ProcessRunner
-------------
Process.Start(startInfo)! traded the null-forgiving operator for a
real null-check that throws InvalidOperationException with the
offending FileName. Per docs, Process.Start can return null when a
process resource is reused (UseShellExecute=true) — we never opt
into that today, but the `!` was a footgun waiting for the first
shell-execute caller.
TesseractOcrProvider
--------------------
ExtractTextFromImageAsync stripped the ".txt" suffix via the
hardcoded slice txtPath[..^4]. Replaced with
Path.GetDirectoryName + Path.GetFileNameWithoutExtension so any
extension length works (e.g. a future .text or .out) and
malformed inputs without an extension don't silently corrupt the
base path.
Tests
-----
- OcrServiceTests: TryDeleteFile_* → TryDeleteDirectory_*, with
the existing-directory case populated to verify recursive delete.
- RuntimeSystemEnvironmentTests + DocumentTests: drop the two
Path.GetTempFileName() callers (Microsoft-discouraged: predictable
8.3 name, TOCTOU surface, 65535-file-per-dir limit) and the
Path.Combine(Path.GetTempPath(), guid) site, all in favor of
CreateTempSubdirectory + Delete(recursive: true) cleanup.
Coverage held at 100% (229/229 lines, 245 tests, all green on net10.0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (6)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
There was a problem hiding this comment.
Pull request overview
This PR modernizes temporary-path handling across OCR code paths and tests, and adds defensive hardening around process startup and Tesseract output path derivation.
Changes:
- Use
Directory.CreateTempSubdirectory(...)to allocate a per-run working directory for OCR and clean it up with a single recursive delete. - Add a null-guard for
Process.Start(...)inProcessRunner. - Replace Tesseract output base-path derivation from string slicing to
Path-based APIs, and update tests to avoidPath.GetTempFileName().
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| CreatePdf.NET/Internal/TesseractOcrProvider.cs | Adjusts how the Tesseract output base path is derived from the requested txt output path. |
| CreatePdf.NET/Internal/ProcessRunner.cs | Replaces null-forgiving Process.Start(...)! with an explicit null-check + exception. |
| CreatePdf.NET/Internal/OcrService.cs | Creates one temp subdirectory per OCR run and deletes it recursively; introduces OcrAsync(...) helper. |
| CreatePdf.NET.Tests/RuntimeSystemEnvironmentTests.cs | Updates temp-file usage to a temp subdirectory + explicit probe file. |
| CreatePdf.NET.Tests/OcrServiceTests.cs | Updates cleanup helper tests from file deletion to recursive directory deletion. |
| CreatePdf.NET.Tests/DocumentTests.cs | Updates temp output location to a temp subdirectory and deletes it in finally. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Tesseract appends ".txt" itself, so strip the extension to give it the base path. | ||
| var outputBase = Path.Combine( | ||
| Path.GetDirectoryName(txtPath) ?? string.Empty, | ||
| Path.GetFileNameWithoutExtension(txtPath)); |
| if (Directory.Exists(path)) | ||
| Directory.Delete(path, recursive: true); |
Address Copilot review comments from PR #2
…omplete ISystemEnvironment abstraction + CI hygiene Bundles PRs #3, #5, #6, #7, #8 that all landed at the 3.0.3 version label. Contents are bug fixes and internal refactoring only — no public API changes, no breaking behavior: - #3 Address Copilot review comments from PR #2 - TesseractOcrProvider.ExtractTextFromImageAsync derives the actual .txt path Tesseract writes (not the caller's input path) - OcrService.TryDeleteDirectory is now truly best-effort (catch (IOException or UnauthorizedAccessException)) - #5 Capture Tesseract stderr in OCR-failure exception message - New internal ProcessResult record struct - IProcessRunner.RunAsync now returns Task<ProcessResult> - ProcessRunner reads stdout + stderr concurrently to avoid the full-pipe deadlock - #6 Finish ISystemEnvironment abstraction - ISystemEnvironment.ReadAllTextAsync added so the success branch is unit-testable without touching the real filesystem - #7 CI hygiene - nuget-publish.yml: install Pillow before the icon-optimisation Python heredoc (was failing with ModuleNotFoundError) - DocumentTests.FluentApi: Guid-suffixed filename so parallel TFM hosts do not race on output/fluent-test.pdf - #8 Clean up nuspec dependency groups (NU5128) - Drop SuppressDependenciesWhenPacking + dead NU5128 NoWarn - dotnet pack now emits canonical empty <group> per TFM Tests: 256/256 pass per TFM × net8.0/net9.0/net10.0 Coverage: 100% line, 97.42% branch Pack: 0 warnings, 0 errors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
OcrService— replace ad-hocPath.GetTempPath() + Guidfiles with a singleDirectory.CreateTempSubdirectory(\"createpdf-ocr-\")per OCR run. One atomic 0700-perms dir, oneDirectory.Delete(recursive: true)cleanup, no name-collision race window. Extracted a privateOcrAsync(pdfPath, workDir, …)helper so the stream + file paths share the pipeline without nesting two temp dirs. Dropped now-unusedTryDeleteFilein favor ofTryDeleteDirectory.ProcessRunner—Process.Start(startInfo)!was a footgun: the docs sayProcess.Startcan returnnullwhen a process resource is reused underUseShellExecute=true. Today we don't, but the!would silently NRE the first time anyone flips that. Replaced with a real null-check that throwsInvalidOperationExceptionnaming the offendingFileName.TesseractOcrProvider—txtPath[..^4]hardcoded a 4-char.txtsuffix. Swapped forPath.GetDirectoryName+Path.GetFileNameWithoutExtensionso the base path is correct for any extension and won't silently corrupt the path on malformed inputs.Path.GetTempFileName()callers (Microsoft now discourages it: predictable 8.3 name, TOCTOU surface, 65535-file-per-dir cap) and thePath.Combine(Path.GetTempPath(), guid)site withCreateTempSubdirectory+Delete(recursive: true). RewroteTryDeleteFile_*tests asTryDeleteDirectory_*with a populated-then-deleted recursive case.Coverage held at 100% — 229/229 lines, 245 tests, all green on
net10.0(multi-target net8/net9/net10 also build clean).Observations on the other internals you flagged
RuntimeSystemEnvironment.cs— clean wrapper, nothing to do.FileOperations.cs—FindProjectRootwalks up fromAppContext.BaseDirectorylooking for.csproj/.sln. That's a dev-mode heuristic — when consumers run a published binary in prod, there's no project file adjacent, so it falls back toDirectory.GetCurrentDirectory(). Works fine, but it's a design choice worth noting (not a bug). Left as-is.TesseractOcrProvider— beyond the[..^4]fix, there's a minor asymmetry: line 62 reads_systemEnvironment.FileExists(txtPath)but line 65 does concreteFile.ReadAllTextAsync(txtPath). The abstraction is half-applied. Out of scope for this PR.ProcessRunner— null-guard added, no other smells.Test plan
dotnet buildclean acrossnet8.0,net9.0,net10.0dotnet test --framework net10.0— 245/245 passdotcov report --exclude-generatedagainst the cobertura output — 100% (229/229) line coverage🤖 Generated with Claude Code