Skip to content

Add batch upload for private cloud audio chunks (#5418 Phase 2)#5556

Closed
beastoin wants to merge 6 commits intomainfrom
fix/batch-upload-storage-5418
Closed

Add batch upload for private cloud audio chunks (#5418 Phase 2)#5556
beastoin wants to merge 6 commits intomainfrom
fix/batch-upload-storage-5418

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

@beastoin beastoin commented Mar 11, 2026

Summary

Adds upload_audio_chunks_batch() to backend/utils/other/storage.py for #5418 Phase 2 — reduces GCS Class A write ops by ~12x by streaming multiple audio chunks into a single GCS upload.

Changes:

  • New upload_audio_chunks_batch(chunks, uid, conversation_id, data_protection_level) function
  • PRIVATE_CLOUD_BATCH_ENABLED feature flag (env var, default false) for safe rollout
  • When enabled: streams chunks via blob.open('wb') into single .batch.bin or .batch.enc object
  • When disabled: falls back to existing per-chunk upload_audio_chunk() calls
  • Sorts chunks by timestamp before upload for consistent ordering
  • Single DB lookup per batch (vs per-chunk) when protection level not cached
  • Memory-efficient streaming writes (no in-memory concatenation)

Batch extension fixes (kelvin audit):

  • list_audio_chunks() — batch-aware parsing: splits on .batch. prefix, handles timestamp ranges (first-last), returns is_batch field
  • delete_audio_chunks() — tries .batch.enc/.batch.bin extensions, scans range-named batch blobs by start timestamp
  • download_audio_chunks_and_merge() — resolves actual GCS paths via list_audio_chunks() instead of constructing from timestamps, deduplicates batch blob downloads, handles encrypted batch decryption

Files changed:

  • backend/utils/other/storage.py — batch upload + batch-aware list/delete/download
  • backend/tests/unit/test_batch_upload_storage.py — 29 unit tests
  • backend/test.sh — registered new tests

Testing

29 unit tests covering all code paths:

  • Multi-chunk batch upload (standard + encrypted)
  • Single chunk batch + large batch (50 chunks) streaming
  • Identical timestamps filename stability + streaming API call args
  • Flag disabled fallback to per-chunk (standard + encrypted + None level)
  • Empty batch edge case + DB lookup count (exactly 1 per batch)
  • Batch-aware list: .batch.bin, .batch.enc, single-chunk batch, mixed files, meta.json exclusion, is_batch flag
  • Batch-aware delete: per-chunk by timestamp, batch by start timestamp, batch extension lookup
  • Batch-aware download: batch blob resolution, per-chunk fallback, deduplication, encrypted batch decryption
tests/unit/test_batch_upload_storage.py ............................. 29 passed

kelvin re-verification: FULL PASS (4/4 steps, 80/80 unit tests)

Review cycle changes

  • Round 1: Switched from in-memory bytearray concatenation to streaming blob.open('wb') writes (reviewer feedback on memory spike)
  • Round 2: Added 4 tester-requested coverage tests (large batch, identical timestamps, API args, flag-disabled DB delegation)
  • Round 3: Fixed 3 CRITICAL batch extension bugs found by kelvin audit — list/delete/download now batch-aware, 14 new tests added

Risks / Edge Cases

  • Feature flag defaults to false — zero production impact until explicitly enabled
  • Larger failure blast radius per upload (60s vs 5s of audio per GCS op)
  • Opus encoding is not included (not a regression — per-chunk path also writes raw PCM)

Closes phase 2 scope of #5418.

by AI for @beastoin

Adds batch upload function for #5418 Phase 2 that concatenates
multiple audio chunks into a single GCS write operation (~12x
fewer Class A ops). Supports standard and encrypted paths.
Feature flag defaults to false for safe rollout.
11 tests covering: multi-chunk batch, single chunk, encrypted batch,
flag disabled fallback, empty batch, DB lookup count, ordering,
and timestamp-order preservation.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 11, 2026

Greptile Summary

This PR introduces upload_audio_chunks_batch() in backend/utils/other/storage.py as Phase 2 of #5418, reducing GCS Class A write operations by concatenating multiple audio chunks into a single object per batch. A PRIVATE_CLOUD_BATCH_ENABLED env-var flag (default false) gates the new behavior, with a per-chunk fallback that preserves existing behavior.

Key findings:

  • Storage leak when batch mode is enabled: delete_audio_chunks constructs deletion paths as {timestamp}.bin / {timestamp}.enc for each individual timestamp. Batch objects use the format {first_ts}-{last_ts}.batch.bin, so no deletion path will ever match. Batch files will accumulate in GCS indefinitely with no cleanup path (other than full-conversation deletion via delete_conversation_audio_files).

  • list_audio_chunks silently discards batch files: The .batch.bin extension passes the .endswith('.bin') check, but the subsequent float() parse on 1000.000-1010.000.batch raises ValueError and is silently swallowed. Any downstream consumer of list_audio_chunks (playback, re-download, migration) will see zero chunks for batch-uploaded conversations. While the PR description acknowledges this as a known follow-up, combined with the deletion issue above there is currently no reliable path for batch objects to be either read or cleaned up.

  • Fallback path still incurs N Firestore reads when data_protection_level is None: In the flag-disabled path, upload_audio_chunk is called for each chunk without a pre-resolved protection level, defeating the "single DB lookup per batch" optimization claimed in the PR for this path.

  • The feature flag defaults to false, so there is no immediate production impact. The 11 unit tests cover the batch logic well but do not test the deletion incompatibility.

Confidence Score: 2/5

  • Safe to merge as long as PRIVATE_CLOUD_BATCH_ENABLED remains false; enabling the flag before the follow-up fixes land will cause irreversible GCS storage accumulation.
  • The flag-disabled default prevents immediate production impact, but the batch path has two unresolved data-management bugs (deletion and listing incompatibility) that will cause GCS objects to leak permanently if the flag is ever turned on. These are not edge-case scenarios — they affect every batch upload.
  • Pay close attention to backend/utils/other/storage.py — specifically delete_audio_chunks (line 435) and list_audio_chunks (line 449), both of which are incompatible with the new batch filename format.

Important Files Changed

Filename Overview
backend/utils/other/storage.py Adds upload_audio_chunks_batch() with feature flag. Two significant issues: delete_audio_chunks uses per-timestamp path construction that will never match batch filenames, causing permanent GCS storage leaks when the flag is on; list_audio_chunks silently skips batch files via a swallowed ValueError, breaking all downstream reads/deletes that rely on it.
backend/tests/unit/test_batch_upload_storage.py 11 well-structured unit tests covering happy paths, encryption, flag-disabled fallback, empty batches, DB-lookup deduplication, and sort ordering. Missing coverage for: delete_audio_chunks behavior against batch-format filenames, and the N-DB-reads-per-chunk issue in the flag-disabled fallback when data_protection_level=None.
backend/test.sh Adds the new test file to the test runner — correct and complete.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant upload_audio_chunks_batch
    participant upload_audio_chunk
    participant users_db
    participant GCS

    Caller->>upload_audio_chunks_batch: chunks, uid, conv_id, protection_level

    alt chunks is empty
        upload_audio_chunks_batch-->>Caller: []
    else PRIVATE_CLOUD_BATCH_ENABLED = false (fallback)
        upload_audio_chunks_batch->>upload_audio_chunks_batch: sort chunks by timestamp
        loop each sorted chunk
            upload_audio_chunks_batch->>upload_audio_chunk: chunk_data, uid, conv_id, ts, level
            alt protection_level is None
                upload_audio_chunk->>users_db: get_data_protection_level(uid) [once per chunk!]
                users_db-->>upload_audio_chunk: level
            end
            upload_audio_chunk->>GCS: upload {ts}.bin or {ts}.enc
            GCS-->>upload_audio_chunk: ok
            upload_audio_chunk-->>upload_audio_chunks_batch: path
        end
        upload_audio_chunks_batch-->>Caller: [path1, path2, ...]
    else PRIVATE_CLOUD_BATCH_ENABLED = true (batch)
        upload_audio_chunks_batch->>upload_audio_chunks_batch: sort chunks by timestamp
        alt protection_level is None
            upload_audio_chunks_batch->>users_db: get_data_protection_level(uid) [once per batch]
            users_db-->>upload_audio_chunks_batch: level
        end
        alt level == enhanced
            loop each chunk
                upload_audio_chunks_batch->>upload_audio_chunks_batch: encrypt_audio_chunk(data, uid)
            end
            upload_audio_chunks_batch->>GCS: upload {first_ts}-{last_ts}.batch.enc
        else standard
            upload_audio_chunks_batch->>upload_audio_chunks_batch: concatenate raw PCM
            upload_audio_chunks_batch->>GCS: upload {first_ts}-{last_ts}.batch.bin
        end
        GCS-->>upload_audio_chunks_batch: ok
        upload_audio_chunks_batch->>upload_audio_chunks_batch: del batch_data
        upload_audio_chunks_batch-->>Caller: [batch_path]
    end
Loading

Comments Outside Diff (2)

  1. backend/utils/other/storage.py, line 435-446 (link)

    Batch files are never deleted by delete_audio_chunks

    When PRIVATE_CLOUD_BATCH_ENABLED=True, chunks are stored as a single object named {first_ts}-{last_ts}.batch.bin (or .batch.enc). However, delete_audio_chunks constructs paths as {timestamp}.enc / {timestamp}.bin for each individual timestamp in the list — no such file exists in batch mode. Every call to delete_audio_chunks will silently find nothing to delete, and the batch objects will accumulate in GCS indefinitely.

    delete_conversation_audio_files (which does a prefix list_blobs) will still work for full-conversation cleanup, but any code path that calls delete_audio_chunks with specific timestamps after enabling batch mode will silently leak storage.

    At minimum, the function needs a complementary lookup for batch-format paths, or the PR description should explicitly note that delete_audio_chunks is incompatible with batch mode and callers must use delete_conversation_audio_files instead.

  2. backend/utils/other/storage.py, line 462-477 (link)

    list_audio_chunks silently discards batch files

    Batch filenames like 1000.000-1010.000.batch.bin do pass the .endswith('.bin') check, but filename.rsplit('.', 1)[0] yields 1000.000-1010.000.batch, which fails float() and is silently swallowed by the except ValueError: continue. Callers that rely on list_audio_chunks to enumerate available audio (e.g., playback, re-download, migration tooling) will see an empty result set for any conversation that used batch upload.

    The PR description acknowledges a follow-up is needed for list_audio_chunks and download_audio_chunks_and_merge, but given that delete_audio_chunks also won't find these files (see sibling comment), the batch objects have no reliable cleanup path until that follow-up lands. Enabling the flag before those follow-ups are in place will leave orphaned GCS objects.

Last reviewed commit: adefec2

Comment on lines +384 to +396
if not PRIVATE_CLOUD_BATCH_ENABLED:
# Flag disabled — fall back to per-chunk uploads
paths = []
for chunk in sorted_chunks:
path = upload_audio_chunk(
chunk_data=chunk['data'],
uid=uid,
conversation_id=conversation_id,
timestamp=chunk['timestamp'],
data_protection_level=data_protection_level,
)
paths.append(path)
return paths
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback path still does one DB read per chunk when data_protection_level is None

When the flag is disabled, the fallback delegates directly to upload_audio_chunk, which calls users_db.get_data_protection_level(uid) for each chunk individually. The "Single DB lookup per batch" benefit described in the PR description only applies to the batch-enabled code path.

If the feature flag is expected to be toggled back off (e.g., for a gradual rollout), this means callers that omit data_protection_level will incur N Firestore reads again. Consider resolving the protection level once before the loop in the fallback path, the same way the batch path does on line 399–401:

if not PRIVATE_CLOUD_BATCH_ENABLED:
    protection_level = (
        data_protection_level if data_protection_level is not None
        else users_db.get_data_protection_level(uid)
    )
    paths = []
    for chunk in sorted_chunks:
        path = upload_audio_chunk(
            chunk_data=chunk['data'],
            uid=uid,
            conversation_id=conversation_id,
            timestamp=chunk['timestamp'],
            data_protection_level=protection_level,
        )
        paths.append(path)
    return paths

Reviewer feedback: blob.open('wb') streams chunk-by-chunk instead
of accumulating full bytearray + bytes() copy in memory.
Tests now verify blob.open() + write() calls instead of
upload_from_string for batch mode.
… args, flag-disabled DB delegation

4 new tests covering: 50-chunk large batch streaming, identical
timestamp filename stability, blob.open() call args + no
upload_from_string assertion, flag-disabled level=None DB delegation.
@beastoin
Copy link
Copy Markdown
Collaborator Author

all checkpoints passed (CP0-CP8). 15 tests, streaming writes, flag defaults false. ready for merge when manager approves.

known follow-up: list/download/delete paths need batch-aware updates before PRIVATE_CLOUD_BATCH_ENABLED can be flipped to true.

by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

CP4 Codex analysis summary:

  1. Compose vs concatenated upload — use concatenated upload, not GCS compose. Compose still requires uploading each chunk as an object first then composing, so ops are same or worse. For enhanced, concatenate encrypt_audio_chunk() outputs (length-prefixed frames) — matches existing framing in encryption.py.

  2. Mixed standard/encrypted — treat one batch call as homogeneous protection level (single decision per batch, same as per-session caching in pusher.py). Do not infer encrypted-vs-plain from bytes.

  3. Risks of batching encrypted chunks — larger failure blast radius (60s vs 5s per failed upload), ordering bugs more damaging (must sort by timestamp), corruption affects larger span, memory spikes during concat/encrypt (addressed by streaming writes in review round 1).

  4. Test coverage beyond the 4 requested — empty batch, DB lookup count, ordering, encrypted payload integrity, flag-disabled fallback delegation count. All addressed in final 15 tests.

Approach chosen: streaming blob.open('wb') writes, timestamp-sorted, single DB lookup per batch, feature flag defaults false.

by AI for @beastoin

@beastoin
Copy link
Copy Markdown
Collaborator Author

Independent Verification Result (kelvin)

Verdict: PASS

Tests

15/15 pass on combined branch verify/combined-5554-5555-5556. Zero cross-PR interference.

Mechanism Review

  • upload_audio_chunks_batch() streams sorted chunks to single GCS object
  • DB lookup once per batch (not per chunk) when level not provided
  • Falls back to per-chunk upload_audio_chunk() when PRIVATE_CLOUD_BATCH_ENABLED=false
  • Feature flag defaults false — zero behavior change

Codex Audit Findings

  • WARNING: No production caller — dead code at merge time (future-ready infrastructure).
  • WARNING: Batch function does not Opus-encode. Must be addressed when a caller is added.
  • WARNING: .batch.bin/.batch.enc not in PRIVATE_CLOUD_EXTENSIONS — list/download/delete won't find batch files. Author acknowledged follow-up.

Combined PR

#5558

@beastoin
Copy link
Copy Markdown
Collaborator Author

Closing — code merged into PR #5555 (yuki's Phase 2 PR). Hiro's upload_audio_chunks_batch() is integrated and verified (46/46 tests + live test PASS). No longer needed as separate PR.


by AI for @beastoin

@beastoin beastoin closed this Mar 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @beastoin 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

beastoin added a commit that referenced this pull request Mar 12, 2026
Handle .batch.bin/.batch.enc from upload_audio_chunks_batch (PR #5556):
- PRIVATE_CLOUD_EXTENSIONS: add batch extensions (longest-first order)
- _strip_extension(): handle batch filenames with timestamp ranges
- _get_extension_for_path(): recognize batch.bin/batch.enc
- list_audio_chunks(): parse range timestamps, add is_batch field
- delete_audio_chunks(): scan for range-named batch blobs
- download_audio_chunks_and_merge(): resolve actual GCS paths via
  list_audio_chunks() for batch-aware download with deduplication

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
beastoin added a commit that referenced this pull request Mar 13, 2026
Cherry-pick from hiro's PR #5556 (commit 44117f9).

- list_audio_chunks: parse .batch. prefix before stripping extension,
  handle timestamp ranges, add is_batch field
- delete_audio_chunks: try .batch.enc/.batch.bin extensions, scan
  range-named batch blobs by start timestamp
- download_audio_chunks_and_merge: resolve actual GCS paths via
  list_audio_chunks to handle batch blobs, deduplicate downloads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
beastoin added a commit that referenced this pull request Mar 13, 2026
Cherry-pick from hiro's PR #5556 (commit 88bf2df), conflict resolved
to remove flag-disabled tests (flag already removed in prior commit).

- TestListAudioChunksBatchAware: 7 tests for .batch.bin/.batch.enc parsing
- TestDeleteAudioChunksBatchAware: 3 tests for batch file deletion
- TestDownloadAudioChunksMergeBatchAware: 4 tests for batch blob resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
…dware#5418)

Handle .batch.bin/.batch.enc from upload_audio_chunks_batch (PR BasedHardware#5556):
- PRIVATE_CLOUD_EXTENSIONS: add batch extensions (longest-first order)
- _strip_extension(): handle batch filenames with timestamp ranges
- _get_extension_for_path(): recognize batch.bin/batch.enc
- list_audio_chunks(): parse range timestamps, add is_batch field
- delete_audio_chunks(): scan for range-named batch blobs
- download_audio_chunks_and_merge(): resolve actual GCS paths via
  list_audio_chunks() for batch-aware download with deduplication

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
Cherry-pick from hiro's PR BasedHardware#5556 (commit 44117f9).

- list_audio_chunks: parse .batch. prefix before stripping extension,
  handle timestamp ranges, add is_batch field
- delete_audio_chunks: try .batch.enc/.batch.bin extensions, scan
  range-named batch blobs by start timestamp
- download_audio_chunks_and_merge: resolve actual GCS paths via
  list_audio_chunks to handle batch blobs, deduplicate downloads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
Cherry-pick from hiro's PR BasedHardware#5556 (commit 88bf2df), conflict resolved
to remove flag-disabled tests (flag already removed in prior commit).

- TestListAudioChunksBatchAware: 7 tests for .batch.bin/.batch.enc parsing
- TestDeleteAudioChunksBatchAware: 3 tests for batch file deletion
- TestDownloadAudioChunksMergeBatchAware: 4 tests for batch blob resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant