Skip to content

[Epic] Performance & Scale — lazy load, image dedup, large-file handling #27

@MHoroszowski

Description

@MHoroszowski

Problem

Presentation() parses every part eagerly. For a 200-slide media-heavy deck, every image is fully decoded, every chart is fully parsed, and the entire OOXML tree sits in memory before the user calls a single method. Issues report "poor performance when creating a big presentation" (scanny/python-pptx#644, 3c), unclosed-file ResourceWarnings (#461), and Docker container errors (#796). For comparison: the recently-launched office-oxide Rust extractor is ~46× faster than python-pptx for read-only text extraction. We're not aiming to match that — we want to write — but lazy loading and image deduplication are real wins for the largest workloads.

Sub-features

  • Lazy part loading: Presentation() reads only the rels graph; individual slide parts are parsed on first access
  • Image-blob deduplication across packages: when copying slides between decks (overlaps with Slide CRUD epic), shared image blobs reuse the same image1.png part instead of duplicating
  • Image-blob deduplication within a session: when the same image is added twice via add_picture, only one part is created (already partial — extend across cross-package merging)
  • Streaming write: pres.save(stream) does not require the full document tree to be assembled in memory before writing; chunked zip-stream
  • Resource cleanup: ensure ZipFile objects are closed, eliminate ResourceWarning: unclosed file
  • Profiling instrumentation: optional Presentation(...) profile=True mode that emits per-part parse/serialize timings to stderr
  • Benchmark suite: tests/bench/ with reference 200-slide media-heavy fixture and threshold-asserting microbenchmarks

Prior art

  • Open PRs: none directly addressing lazy load.
  • Forks:
  • User issues this would close: #327, #461, #478, #548, #644, #732, #796, #813.
  • POI parity: XSLF lazy-loads slide parts via XSLFSlide.getXmlObject() (proof that it's possible inside an OOXML library).
  • Code paths: src/pptx/opc/package.py, src/pptx/opc/serialized.py, src/pptx/parts/image.py.

Acceptance criteria

  • Opening a 200-slide media-heavy benchmark deck completes in ≤30% of current wall time and ≤50% of current peak RSS.
  • Image dedup test: copying 50 slides each containing the same logo produces exactly 1 image part.
  • No ResourceWarning raised in the test suite under python -W error::ResourceWarning.
  • Existing 2986 pytest tests continue passing.
  • Behave scenarios for benchmark thresholds.

Effort: L

Cross-cutting, requires careful re-plumbing of OpcPackage parsing. Recommend benchmark-first delivery: ship the bench suite as Phase A so improvements are measurable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:performanceFeature area: performanceepicMulti-feature roadmap epicprior-art:forkActive community fork has shipped thispriority:P1Important but not urgent

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions