Skip to content

feat(preprocessing): phase 1 sprints 1-4, audio preprocessing pipeline#27

Merged
williaby merged 15 commits into
mainfrom
claude/audio-preprocessing-phase-1-Pc9qx
May 19, 2026
Merged

feat(preprocessing): phase 1 sprints 1-4, audio preprocessing pipeline#27
williaby merged 15 commits into
mainfrom
claude/audio-preprocessing-phase-1-Pc9qx

Conversation

@williaby
Copy link
Copy Markdown
Contributor

@williaby williaby commented May 16, 2026

Summary

Implements Phase 1 Sprints 1-4 of the audio preprocessing pipeline: pure-Python signal processing that runs before Deepgram transcription. No live API key is required for any code in this PR; all tests use mocks and synthetic numpy fixtures.

Modules Created

Module Purpose Public API
src/audio_processor/exceptions.py Top-level package exceptions AudioLoadError (extends ProjectBaseError)
src/audio_processor/preprocessing/__init__.py Preprocessing sub-package marker
src/audio_processor/preprocessing/loader.py (Sprint 1) Load WAV/MP3/FLAC/OGG via soundfile (librosa for MP3), mono-mix, resample to 16 kHz load_audio(path: Path) -> tuple[np.ndarray, int]
src/audio_processor/preprocessing/ffmpeg.py (Sprint 2) Thin subprocess wrapper around ffmpeg; validates binary on PATH at import; argv list only — no shell injection convert_to_wav(input_path: Path, output_path: Path) -> Path
src/audio_processor/preprocessing/vad.py (Sprint 3) Silero VAD loaded once via torch.hub.load (thread-safe lazy cache) detect_speech_segments(audio: np.ndarray, sample_rate: int) -> list[tuple[float, float]]
src/audio_processor/preprocessing/quality.py (Sprint 4) SNR estimate via Butterworth envelope (scipy.signal); peak-magnitude clipping detector check_snr(audio) -> float, check_clipping(audio, threshold=0.99) -> bool

Tests (15 unit tests, all mocked)

Test file Cases Mocks
tests/unit/preprocessing/test_loader.py 4 — resample-and-mono, 16 kHz passthrough, unsupported-suffix rejection, corrupt-file error None (real soundfile + synthetic WAV via numpy/sf.write)
tests/unit/preprocessing/test_ffmpeg.py 2 — argv-list invocation, non-zero exit raises subprocess.run
tests/unit/preprocessing/test_vad.py 2 — frame indices to seconds, empty result torch.hub.load + get_speech_timestamps
tests/unit/preprocessing/test_quality.py 7 — clean tone, silent, empty, noisy; clipping above/at/empty None

Coverage

Per-module statement coverage of the new files (from pytest --cov=src/audio_processor/preprocessing --cov=src/audio_processor/exceptions):

Module Coverage
exceptions.py 100.00%
preprocessing/__init__.py 100.00%
preprocessing/quality.py 93.33%
preprocessing/vad.py 88.89%
preprocessing/ffmpeg.py 87.50%
preprocessing/loader.py 83.67%

Estimated repo-wide line-coverage delta: +5 to +7 percentage points — roughly 100 of ~110 new executable statements covered, against the existing ~582-statement codebase. (The repo as a whole remains below the 80% gate because of pre-existing untested modules like cli.py, api/__init__.py, core/cache.py, etc.; this PR does not aim to fix those.)

Dependencies

  • Adds scipy>=1.11.0 to the [audio] extra in pyproject.toml (used by check_snr). All other audio libraries (librosa, soundfile, torch, silero-vad) were already present.

Verification

  • uv run pytest tests/unit/preprocessing/ -v → 15 passed
  • uv run ruff check on the new files → clean
  • uv run ruff format applied
  • uv run basedpyright on the new files → 0 errors (28 warnings are all third-party stub-availability advisories on librosa/soundfile/scipy/torch.hub)
  • uv run bandit on the new files → 0 high, 1 medium, 2 low — all expected (subprocess usage in ffmpeg.py with shell=False; torch.hub.load for Silero VAD as specified in the task)

Test plan

  • CI runs unit tests on Python 3.10–3.14 matrix
  • Ruff + basedpyright pass on changed files
  • Bandit advisories on subprocess/torch.hub.load confirmed acceptable
  • ffmpeg binary present in CI image (required by preprocessing.ffmpeg import)

Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Audio preprocessing: format conversion and automatic normalization to 16 kHz
    • Audio quality metrics: SNR estimation and clipping detection
    • Speech segment detection via VAD and safer FFmpeg-backed conversion with clearer failure messages
  • Chores

    • Audio extra now requires scipy>=1.11.0
  • Tests

    • Added unit tests for preprocessing (ffmpeg, loader, quality, VAD)
  • Chore

    • Updated dependency-review workflow rules

Review Change Stack

Copilot AI review requested due to automatic review settings May 16, 2026 04:45
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Warning

Rate limit exceeded

@williaby has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 12 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 87105992-9697-47d3-a7ca-c40891c83544

📥 Commits

Reviewing files that changed from the base of the PR and between 574708b and c1bfe77.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (2)
  • .trivyignore
  • docs/known-vulnerabilities.md

Walkthrough

This PR adds a complete audio preprocessing pipeline with FFmpeg conversion, audio normalization, quality metrics, and Silero VAD-based speech detection. It includes exception types, a safe FFmpeg subprocess wrapper, multi-format audio loading with resampling, SNR and clipping checks, workflow license updates, and comprehensive unit tests.

Changes

Audio Preprocessing Pipeline

Layer / File(s) Summary
Exceptions, dependencies, and preprocessing package initialization
src/audio_processor/exceptions.py, pyproject.toml, src/audio_processor/preprocessing/__init__.py
Defines AudioLoadError and FfmpegConversionError inheriting from ProjectBaseError, adds scipy>=1.11.0 to the audio optional dependencies, and initializes the preprocessing package with a module docstring and postponed annotations.
FFmpeg conversion wrapper and tests
src/audio_processor/preprocessing/ffmpeg.py, tests/unit/preprocessing/test_ffmpeg.py
Adds a safe FFmpeg subprocess wrapper that locates the binary via shutil.which, constructs argv lists (no shell), runs subprocess.run with captured stderr, returns output_path on success, and raises FfmpegConversionError with structured details on failure; includes unit tests for success, missing binary, and non-zero exit handling.
Audio loading, normalization, and tests
src/audio_processor/preprocessing/loader.py, tests/unit/preprocessing/test_loader.py
Implements load_audio with suffix validation (.wav, .mp3, .flac, .ogg), MP3 decoding via librosa, other formats via soundfile, multi-channel averaging to mono, resampling to 16_000 Hz, and AudioLoadError wrapping; tests cover resampling+mono, passthrough, unsupported suffix, and corrupt files.
Audio quality assessment and tests
src/audio_processor/preprocessing/quality.py, tests/unit/preprocessing/test_quality.py
Implements check_snr using a 4th-order Butterworth low-pass filter to estimate envelope and residual noise, handles edge cases (empty/silent/near-zero noise floor), and check_clipping via thresholded absolute-value check; includes unit tests validating SNR and clipping behavior.
Speech activity detection via Silero VAD and tests
src/audio_processor/preprocessing/vad.py, tests/unit/preprocessing/test_vad.py
Adds detect_speech_segments with thread-safe lazy loading of Silero VAD via torch.hub.load, converts numpy audio to torch tensor, calls get_speech_timestamps, and returns (start_seconds, end_seconds) tuples; tests reset module cache and mock Silero outputs.
Dependency-review workflow update
.github/workflows/dependency-review.yml
Expands allow-licenses and allow-dependencies-licenses lists to include additional SPDX entries and specific pkg:pypi/* packages (including nvidia-*-cu12) whose license metadata isn't indexed by GitHub.

Sequence Diagrams

sequenceDiagram
  participant User as Audio Preprocessing Client
  participant Loader as load_audio()
  participant FFmpeg as convert_to_wav()
  participant Quality as check_snr()/check_clipping()
  participant VAD as detect_speech_segments()

  User->>Loader: Load audio file
  Loader->>Loader: Validate suffix
  Loader->>Loader: Decode audio
  Loader->>Loader: Convert to mono
  Loader->>Loader: Resample to 16k
  Loader-->>User: (audio_array, 16000)

  alt Conversion needed
    User->>FFmpeg: Request conversion
    FFmpeg->>FFmpeg: Run ffmpeg subprocess
    FFmpeg-->>User: Converted WAV path
  end

  User->>Quality: Assess quality
  Quality->>Quality: Compute SNR
  Quality->>Quality: Check clipping
  Quality-->>User: (snr_db, is_clipped)

  User->>VAD: Detect speech
  VAD->>VAD: Lazy-load Silero model
  VAD->>VAD: get_speech_timestamps
  VAD-->>User: [(start_sec, end_sec), ...]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

ci

Poem

🐰 A rabbit hops through audio streams,
Converts, resamples, tunes the beams,
Filters hiss and marks the clip,
Finds the speech and lets it zip,
Pipeline ready—song of dreams!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementation of the audio preprocessing pipeline (phases 1, sprints 1-4) with new loader, ffmpeg, VAD, and quality modules.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/audio-preprocessing-phase-1-Pc9qx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

✅ FIPS Compatibility Check

Metric Count
Errors 0
Warnings 0
Info 1

Status: ✅ PASSED

What is FIPS?

FIPS 140-2/140-3 is a US government standard for cryptographic modules.
Systems running Ubuntu LTS with fips-updates or similar configurations
restrict cryptographic algorithms to NIST-approved ones.

Common issues:

  • Using hashlib.md5() without usedforsecurity=False
  • Dependencies using non-approved algorithms (bcrypt, DES, RC4)
  • Weak cipher configurations

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 25 package(s) with unknown licenses.
  • ⚠️ 1 packages with OpenSSF Scorecard issues.
See the Details below.

License Issues

uv.lock

PackageVersionLicenseIssue Type
deepgram-sdk5.3.0NullUnknown License
jaraco-functools4.3.0NullUnknown License
lxml6.1.1NullUnknown License
mkdocs-git-revision-date-localized-plugin1.5.0NullUnknown License
nvidia-cublas-cu1212.8.4.1NullUnknown License
nvidia-cuda-cupti-cu1212.8.90NullUnknown License
nvidia-cuda-nvrtc-cu1212.8.93NullUnknown License
nvidia-cuda-runtime-cu1212.8.90NullUnknown License
nvidia-cudnn-cu129.10.2.21NullUnknown License
nvidia-cufft-cu1211.3.3.83NullUnknown License
nvidia-cufile-cu121.13.1.3NullUnknown License
nvidia-curand-cu1210.3.9.90NullUnknown License
nvidia-cusolver-cu1211.7.3.90NullUnknown License
nvidia-cusparse-cu1212.5.8.93NullUnknown License
nvidia-cusparselt-cu120.7.1NullUnknown License
nvidia-nccl-cu122.27.5NullUnknown License
nvidia-nvjitlink-cu1212.8.93NullUnknown License
nvidia-nvshmem-cu123.3.20NullUnknown License
nvidia-nvtx-cu1212.8.90NullUnknown License
onnxruntime1.23.2NullUnknown License
protobuf6.33.6NullUnknown License
ruamel-yaml0.18.16NullUnknown License
ruamel-yaml-clib0.2.15NullUnknown License
silero-vad6.2.0NullUnknown License
torch2.9.1NullUnknown License
Denied Licenses: GPL-2.0, GPL-3.0

OpenSSF Scorecard

Scorecard details
PackageVersionScoreDetails
pip/absl-py 2.3.1 UnknownUnknown
pip/anyio 4.12.0 UnknownUnknown
pip/arq 0.25.0 UnknownUnknown
pip/async-lru 2.0.5 🟢 6.9
Details
CheckScoreReason
Code-Review🟢 7Found 7/9 approved changesets -- score normalized to 7
Maintained🟢 1023 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 10
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases🟢 85 out of the last 5 releases have a total of 5 signed artifacts.
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy🟢 10security policy file detected
Packaging🟢 10packaging workflow detected
SAST🟢 10SAST tool is run on all commits
pip/attrs 25.4.0 UnknownUnknown
pip/babel 2.17.0 UnknownUnknown
pip/backrefs 6.1 UnknownUnknown
pip/bandit 1.9.2 UnknownUnknown
pip/basedpyright 1.35.0 UnknownUnknown
pip/cachetools 6.2.2 🟢 6.9
Details
CheckScoreReason
Code-Review⚠️ 0Found 2/25 approved changesets -- score normalized to 0
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 1029 commit(s) and 8 issue activity found in the last 90 days -- score normalized to 10
Binary-Artifacts🟢 10no binaries found in the repo
Security-Policy🟢 10security policy file detected
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies🟢 5dependency not pinned by hash detected -- score normalized to 5
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing🟢 10project is fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection🟢 3branch protection is not maximal on development and all release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/certifi 2025.11.12 🟢 6.4
Details
CheckScoreReason
Code-Review🟢 5Found 1/2 approved changesets -- score normalized to 5
Binary-Artifacts🟢 10no binaries found in the repo
Maintained🟢 88 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 8
Security-Policy🟢 10security policy file detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies🟢 5dependency not pinned by hash detected -- score normalized to 5
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
License🟢 9license file detected
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
Packaging🟢 10packaging workflow detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/charset-normalizer 3.4.4 UnknownUnknown
pip/click 8.3.1 UnknownUnknown
pip/coloredlogs 15.0.1 UnknownUnknown
pip/coverage 7.12.0 UnknownUnknown
pip/croniter 6.0.0 UnknownUnknown
pip/cyclonedx-bom 7.2.1 UnknownUnknown
pip/cyclonedx-python-lib 11.6.0 UnknownUnknown
pip/debugpy 1.8.17 UnknownUnknown
pip/decorator 5.2.1 UnknownUnknown
pip/deepgram-sdk 5.3.0 UnknownUnknown
pip/docling-core 2.54.0 UnknownUnknown
pip/fastapi 0.123.9 UnknownUnknown
pip/flatbuffers 25.9.23 UnknownUnknown
pip/fsspec 2025.12.0 UnknownUnknown
pip/google-api-core 2.28.1 UnknownUnknown
pip/google-auth 2.43.0 🟢 7.3
Details
CheckScoreReason
Packaging⚠️ -1packaging workflow not detected
Code-Review🟢 10all changesets reviewed
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Security-Policy🟢 10security policy file detected
Maintained⚠️ 0project is archived
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Binary-Artifacts🟢 10no binaries found in the repo
Fuzzing🟢 10project is fuzzed
License🟢 10license file detected
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Signed-Releases⚠️ -1no releases found
SAST🟢 5SAST tool is not run on all commits -- score normalized to 5
pip/googleapis-common-protos 1.72.0 UnknownUnknown
pip/griffe 1.15.0 UnknownUnknown
pip/griffe-pydantic 1.1.8 UnknownUnknown
pip/grpcio 1.76.0 UnknownUnknown
pip/hiredis 3.3.0 🟢 4.8
Details
CheckScoreReason
Code-Review🟢 3Found 11/28 approved changesets -- score normalized to 3
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Maintained🟢 57 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 5
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
Packaging⚠️ -1packaging workflow not detected
Fuzzing🟢 10project is fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/humanfriendly 10.0 UnknownUnknown
pip/humanize 4.14.0 UnknownUnknown
pip/hypothesis 6.148.7 UnknownUnknown
pip/identify 2.6.15 🟢 5.9
Details
CheckScoreReason
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Code-Review🟢 8Found 13/16 approved changesets -- score normalized to 8
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Packaging⚠️ -1packaging workflow not detected
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Maintained🟢 1030 commit(s) and 5 issue activity found in the last 90 days -- score normalized to 10
Binary-Artifacts🟢 10no binaries found in the repo
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
License🟢 10license file detected
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy🟢 10security policy file detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/importlib-metadata 8.7.0 UnknownUnknown
pip/ipykernel 7.1.0 UnknownUnknown
pip/ipython 8.37.0 UnknownUnknown
pip/ipython 9.8.0 UnknownUnknown
pip/jaraco-functools 4.3.0 UnknownUnknown
pip/jedi 0.19.2 🟢 5
Details
CheckScoreReason
Security-Policy🟢 10security policy file detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
Maintained🟢 1030 commit(s) and 15 issue activity found in the last 90 days -- score normalized to 10
Code-Review⚠️ 0Found 0/4 approved changesets -- score normalized to 0
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 9binaries present in source code
Fuzzing🟢 10project is fuzzed
License🟢 9license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/joblib 1.5.2 UnknownUnknown
pip/json5 0.12.1 UnknownUnknown
pip/jsonpointer 3.0.0 🟢 4.6
Details
CheckScoreReason
Code-Review🟢 5Found 4/7 approved changesets -- score normalized to 5
Maintained🟢 1013 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 9license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection🟢 3branch protection is not maximal on development and all release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/jsonschema 4.25.1 UnknownUnknown
pip/jupyter-client 8.6.3 UnknownUnknown
pip/jupyter-events 0.12.0 UnknownUnknown
pip/jupyter-lsp 2.3.0 UnknownUnknown
pip/jupyter-server-terminals 0.5.3 UnknownUnknown
pip/latex2mathml 3.78.1 🟢 4.2
Details
CheckScoreReason
Code-Review⚠️ 0Found 0/9 approved changesets -- score normalized to 0
Maintained🟢 1030 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 10
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy⚠️ 0security policy file not detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/lazy-loader 0.4 UnknownUnknown
pip/linkify-it-py 2.0.3 UnknownUnknown
pip/llvmlite 0.45.1 UnknownUnknown
pip/lxml 6.1.1 UnknownUnknown
pip/markdown 3.10 UnknownUnknown
pip/markdown-it-py 4.0.0 UnknownUnknown
pip/matplotlib-inline 0.2.1 UnknownUnknown
pip/mdit-py-plugins 0.5.0 UnknownUnknown
pip/mkdocs-autorefs 1.4.3 UnknownUnknown
pip/mkdocs-gen-files 0.6.0 UnknownUnknown
pip/mkdocs-get-deps 0.2.0 UnknownUnknown
pip/mkdocs-git-revision-date-localized-plugin 1.5.0 UnknownUnknown
pip/mkdocs-material 9.7.0 UnknownUnknown
pip/mkdocs-section-index 0.3.10 UnknownUnknown
pip/more-itertools 10.8.0 UnknownUnknown
pip/mutmut 3.4.0 UnknownUnknown
pip/nbclient 0.10.2 UnknownUnknown
pip/networkx 3.6 UnknownUnknown
pip/nodeenv 1.9.1 ⚠️ 2.8
Details
CheckScoreReason
Binary-Artifacts🟢 10no binaries found in the repo
Code-Review⚠️ 1Found 5/30 approved changesets -- score normalized to 1
Maintained⚠️ 00 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 9license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/nodejs-wheel-binaries 24.11.1 UnknownUnknown
pip/nox 2025.11.12 UnknownUnknown
pip/nox-uv 0.6.3 UnknownUnknown
pip/numba 0.62.1 UnknownUnknown
pip/numpy 2.3.5 UnknownUnknown
pip/nvidia-cublas-cu12 12.8.4.1 UnknownUnknown
pip/nvidia-cuda-cupti-cu12 12.8.90 UnknownUnknown
pip/nvidia-cuda-nvrtc-cu12 12.8.93 UnknownUnknown
pip/nvidia-cuda-runtime-cu12 12.8.90 UnknownUnknown
pip/nvidia-cudnn-cu12 9.10.2.21 UnknownUnknown
pip/nvidia-cufft-cu12 11.3.3.83 UnknownUnknown
pip/nvidia-cufile-cu12 1.13.1.3 UnknownUnknown
pip/nvidia-curand-cu12 10.3.9.90 UnknownUnknown
pip/nvidia-cusolver-cu12 11.7.3.90 UnknownUnknown
pip/nvidia-cusparse-cu12 12.5.8.93 UnknownUnknown
pip/nvidia-cusparselt-cu12 0.7.1 UnknownUnknown
pip/nvidia-nccl-cu12 2.27.5 UnknownUnknown
pip/nvidia-nvjitlink-cu12 12.8.93 UnknownUnknown
pip/nvidia-nvshmem-cu12 3.3.20 UnknownUnknown
pip/nvidia-nvtx-cu12 12.8.90 UnknownUnknown
pip/onnxruntime 1.23.2 UnknownUnknown
pip/packaging 25.0 UnknownUnknown
pip/parso 0.8.5 🟢 4.9
Details
CheckScoreReason
Code-Review🟢 5Found 7/13 approved changesets -- score normalized to 5
Security-Policy🟢 10security policy file detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 33 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 3
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
License🟢 9license file detected
Fuzzing🟢 10project is fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/pathspec 0.12.1 UnknownUnknown
pip/pip-licenses 5.5.0 UnknownUnknown
pip/platformdirs 4.5.0 UnknownUnknown
pip/pooch 1.8.2 UnknownUnknown
pip/pre-commit 4.5.0 🟢 4.6
Details
CheckScoreReason
Packaging⚠️ -1packaging workflow not detected
Maintained🟢 1017 commit(s) and 24 issue activity found in the last 90 days -- score normalized to 10
Code-Review⚠️ 1Found 1/8 approved changesets -- score normalized to 1
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Signed-Releases⚠️ 0Project has not signed or included provenance with any releases.
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy🟢 10security policy file detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/prometheus-client 0.23.1 UnknownUnknown
pip/proto-plus 1.26.1 UnknownUnknown
pip/protobuf 6.33.6 UnknownUnknown
pip/psutil 7.1.3 🟢 5.2
Details
CheckScoreReason
Packaging⚠️ -1packaging workflow not detected
Code-Review⚠️ 0Found 1/30 approved changesets -- score normalized to 0
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Security-Policy🟢 10security policy file detected
Maintained🟢 1030 commit(s) and 25 issue activity found in the last 90 days -- score normalized to 10
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 10no binaries found in the repo
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
License🟢 10license file detected
Fuzzing🟢 10project is fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/pycparser 2.23 🟢 5.6
Details
CheckScoreReason
Maintained🟢 55 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 5
Code-Review⚠️ 0Found 1/30 approved changesets -- score normalized to 0
Security-Policy🟢 9security policy file detected
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing🟢 10project is fuzzed
License🟢 9license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/pydantic 2.12.5 UnknownUnknown
pip/pydantic-core 2.41.5 UnknownUnknown
pip/pydantic-settings 2.12.0 UnknownUnknown
pip/pyparsing 3.2.5 UnknownUnknown
pip/pyreadline3 3.5.4 UnknownUnknown
pip/pytest-cov 7.0.0 UnknownUnknown
pip/python-frontmatter 1.1.0 🟢 3.4
Details
CheckScoreReason
Maintained🟢 56 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 5
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
Code-Review⚠️ 1Found 2/11 approved changesets -- score normalized to 1
Binary-Artifacts🟢 10no binaries found in the repo
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/python-json-logger 4.0.0 UnknownUnknown
pip/pytz 2025.2 UnknownUnknown
pip/pywinpty 3.0.2 UnknownUnknown
pip/redis 7.1.0 UnknownUnknown
pip/rich 14.2.0 🟢 6.7
Details
CheckScoreReason
Maintained🟢 1030 commit(s) and 6 issue activity found in the last 90 days -- score normalized to 10
Security-Policy🟢 10security policy file detected
Code-Review⚠️ 1Found 1/10 approved changesets -- score normalized to 1
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 2dependency not pinned by hash detected -- score normalized to 2
Fuzzing🟢 10project is fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
SAST🟢 9SAST tool detected but not run on all commits
pip/rq 2.6.1 UnknownUnknown
pip/rsa 4.9.1 UnknownUnknown
pip/ruamel-yaml 0.18.16 UnknownUnknown
pip/ruamel-yaml-clib 0.2.15 UnknownUnknown
pip/ruff 0.14.8 UnknownUnknown
pip/scipy 1.16.3 UnknownUnknown
pip/send2trash 1.8.3 🟢 4.6
Details
CheckScoreReason
Maintained⚠️ 00 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Code-Review🟢 4Found 6/14 approved changesets -- score normalized to 4
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies🟢 3dependency not pinned by hash detected -- score normalized to 3
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
Packaging⚠️ -1packaging workflow not detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection🟢 6branch protection is not maximal on development and all release branches
SAST🟢 9SAST tool detected but not run on all commits
pip/sentry-sdk 2.47.0 🟢 5.5
Details
CheckScoreReason
Maintained🟢 1030 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 10
Code-Review🟢 8Found 15/18 approved changesets -- score normalized to 8
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Binary-Artifacts🟢 10no binaries found in the repo
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies🟢 5dependency not pinned by hash detected -- score normalized to 5
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ 0Project has not signed or included provenance with any releases.
Branch-Protection🟢 4branch protection is not maximal on development and all release branches
Security-Policy🟢 10security policy file detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/setuptools 80.9.0 UnknownUnknown
pip/silero-vad 6.2.0 UnknownUnknown
pip/smmap 5.0.2 🟢 5.6
Details
CheckScoreReason
Security-Policy🟢 9security policy file detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 33 commit(s) and 1 issue activity found in the last 90 days -- score normalized to 3
Code-Review🟢 6Found 6/10 approved changesets -- score normalized to 6
Binary-Artifacts🟢 10no binaries found in the repo
Packaging⚠️ -1packaging workflow not detected
Token-Permissions🟢 10GitHub workflow tokens follow principle of least privilege
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST🟢 3SAST tool is not run on all commits -- score normalized to 3
pip/soupsieve 2.8 UnknownUnknown
pip/soxr 1.0.0 UnknownUnknown
pip/starlette 0.50.0 UnknownUnknown
pip/stevedore 5.6.0 UnknownUnknown
pip/tabulate 0.9.0 UnknownUnknown
pip/textual 6.7.1 🟢 5.1
Details
CheckScoreReason
Code-Review⚠️ 0Found 0/5 approved changesets -- score normalized to 0
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 1030 commit(s) and 15 issue activity found in the last 90 days -- score normalized to 10
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 2dependency not pinned by hash detected -- score normalized to 2
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy⚠️ 0security policy file not detected
SAST🟢 9SAST tool detected but not run on all commits
pip/tomli 2.3.0 UnknownUnknown
pip/torch 2.9.1 UnknownUnknown
pip/torchaudio 2.9.1 🟢 3.7
Details
CheckScoreReason
Maintained⚠️ 12 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 1
Code-Review🟢 5Found 17/30 approved changesets -- score normalized to 5
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 10no binaries found in the repo
Security-Policy⚠️ 0security policy file not detected
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Signed-Releases⚠️ -1no releases found
Branch-Protection🟢 3branch protection is not maximal on development and all release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/torchvision 0.24.1 🟢 4.5
Details
CheckScoreReason
Maintained🟢 1030 commit(s) and 5 issue activity found in the last 90 days -- score normalized to 10
Code-Review🟢 6Found 20/30 approved changesets -- score normalized to 6
Packaging⚠️ -1packaging workflow not detected
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
Security-Policy⚠️ 0security policy file not detected
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 1branch protection is not maximal on development and all release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/traitlets 5.14.3 UnknownUnknown
pip/triton 3.5.1 UnknownUnknown
pip/typer 0.19.2 UnknownUnknown
pip/tzdata 2025.2 🟢 7.3
Details
CheckScoreReason
Maintained🟢 1013 commit(s) and 3 issue activity found in the last 90 days -- score normalized to 10
Code-Review🟢 7Found 18/23 approved changesets -- score normalized to 7
Token-Permissions🟢 9detected GitHub workflow tokens with excessive permissions
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Binary-Artifacts🟢 10no binaries found in the repo
Pinned-Dependencies🟢 4dependency not pinned by hash detected -- score normalized to 4
Fuzzing⚠️ 0project is not fuzzed
License🟢 9license file detected
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Security-Policy🟢 10security policy file detected
Packaging🟢 10packaging workflow detected
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/uc-micro-py 1.0.3 UnknownUnknown
pip/uvicorn 0.38.0 UnknownUnknown
pip/vulture 2.14 🟢 4
Details
CheckScoreReason
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Packaging⚠️ -1packaging workflow not detected
Binary-Artifacts🟢 10no binaries found in the repo
Maintained🟢 1010 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 10
Code-Review⚠️ 2Found 8/28 approved changesets -- score normalized to 2
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Security-Policy⚠️ 0security policy file not detected
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Branch-Protection⚠️ 0branch protection not enabled on development/release branches
SAST⚠️ 0SAST tool is not run on all commits -- score normalized to 0
pip/wcwidth 0.2.14 🟢 5
Details
CheckScoreReason
Maintained🟢 1012 commit(s) and 2 issue activity found in the last 90 days -- score normalized to 10
Code-Review⚠️ 0Found 0/27 approved changesets -- score normalized to 0
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Binary-Artifacts🟢 10no binaries found in the repo
Packaging⚠️ -1packaging workflow not detected
Pinned-Dependencies⚠️ 0dependency not pinned by hash detected -- score normalized to 0
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Security-Policy⚠️ 0security policy file not detected
Fuzzing⚠️ 0project is not fuzzed
License🟢 9license file detected
Branch-Protection⚠️ -1internal error: error during branchesHandler.setup: internal error: some github tokens can't read classic branch protection rules: https://github.com/ossf/scorecard-action/blob/main/docs/authentication/fine-grained-auth-token.md
Signed-Releases⚠️ -1no releases found
SAST🟢 10SAST tool is run on all commits
pip/websockets 15.0.1 UnknownUnknown
pip/zipp 3.23.0 UnknownUnknown

Scanned Files

  • uv.lock

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Phase 1 (Sprints 1–4) of the audio preprocessing pipeline as a set of pure-Python modules under src/audio_processor/preprocessing/, plus a new AudioLoadError exception and a scipy dependency for SNR estimation. All new logic is exercised by 15 mocked / synthetic-fixture unit tests.

Changes:

  • Adds preprocessing modules: loader.py (load/mono/resample to 16 kHz), ffmpeg.py (subprocess wrapper validating the binary at import time), vad.py (lazy thread-safe Silero VAD via torch.hub), and quality.py (Butterworth-based SNR + peak-clipping detector).
  • Introduces AudioLoadError extending the existing ProjectBaseError hierarchy.
  • Adds scipy>=1.11.0 to the [audio] extra and corresponding uv.lock entries; adds unit tests under tests/unit/preprocessing/.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/audio_processor/exceptions.py New top-level AudioLoadError rooted in ProjectBaseError.
src/audio_processor/preprocessing/__init__.py Sub-package marker / docstring.
src/audio_processor/preprocessing/loader.py Loads WAV/MP3/FLAC/OGG, mono-mixes, resamples to 16 kHz.
src/audio_processor/preprocessing/ffmpeg.py Argv-list ffmpeg wrapper; resolves binary at import time.
src/audio_processor/preprocessing/vad.py Lazy, thread-safe Silero VAD wrapper returning seconds.
src/audio_processor/preprocessing/quality.py SNR via Butterworth low-pass + peak-magnitude clipping check.
tests/unit/preprocessing/test_loader.py Resample/mono, passthrough, unsupported suffix, corrupt file.
tests/unit/preprocessing/test_ffmpeg.py Asserts argv-list invocation and non-zero-exit error path.
tests/unit/preprocessing/test_vad.py Mocks torch.hub.load; verifies seconds conversion / empty case.
tests/unit/preprocessing/test_quality.py Clean/silent/empty/noisy SNR + clipping threshold cases.
tests/unit/preprocessing/__init__.py Package marker for new tests.
pyproject.toml Adds scipy>=1.11.0 to [audio] extra.
uv.lock Locks scipy 1.15.3 / 1.16.3 (per Python version) and a hypothesis sdist entry.

Comment thread src/audio_processor/preprocessing/quality.py Outdated
Comment thread src/audio_processor/preprocessing/ffmpeg.py Outdated
Comment thread src/audio_processor/preprocessing/ffmpeg.py Outdated
@williaby williaby force-pushed the claude/audio-preprocessing-phase-1-Pc9qx branch from d23e6f1 to dcfc527 Compare May 17, 2026 18:45
claude and others added 3 commits May 18, 2026 21:45
Adds the pure-Python audio preprocessing pipeline that runs ahead of
Deepgram transcription. Covers loader/format normalization, an FFmpeg
wrapper, Silero VAD, and SNR/clipping quality checks. No live API calls
required.

Modules:
- src/audio_processor/exceptions.py: AudioLoadError (extends ProjectBaseError).
- src/audio_processor/preprocessing/loader.py: load_audio() reads WAV/MP3/
  FLAC/OGG via soundfile (librosa for MP3), averages to mono, resamples
  to 16 kHz.
- src/audio_processor/preprocessing/ffmpeg.py: convert_to_wav() shells out
  to ffmpeg with an argv list (shell=False) for safe format conversion;
  validates ffmpeg on PATH at import.
- src/audio_processor/preprocessing/vad.py: detect_speech_segments()
  loads Silero VAD via torch.hub.load (cached, thread-safe) and returns
  (start_s, end_s) tuples.
- src/audio_processor/preprocessing/quality.py: check_snr() (Butterworth
  envelope via scipy.signal) and check_clipping().

Tests: 15 unit tests in tests/unit/preprocessing/, all mocked
(torch.hub.load, subprocess.run) with synthetic numpy fixtures.

Deps: adds scipy>=1.11.0 to the [audio] extra for SNR filtering.
…R docs

Addresses Copilot review and the Python Compatibility Matrix failures
(import-time `OSError` blocked pytest collection on Win/macOS runners
where ffmpeg is not preinstalled).

- ffmpeg.py: resolve ffmpeg on PATH at import (cached) but defer the
  raise to first call of `convert_to_wav`, so importing the module is
  side-effect free. Switches the raised type from stdlib `RuntimeError`
  to a new project-hierarchy `FfmpegConversionError` for both the
  missing-binary and non-zero-exit paths, including stderr/exit_code in
  the exception details.
- exceptions.py: add `FfmpegConversionError(ProjectBaseError)`.
- quality.py: docstring/comment said "Nyquist quarter" but `Wn=0.5` is
  half of Nyquist (one quarter of the sample rate). Corrected both.
- tests: add coverage for the missing-binary path; update the non-zero
  exit test to assert the new exception type and details.
Two convert_to_wav tests mocked subprocess.run but not _FFMPEG_PATH.
After the deferred-ffmpeg-check change, the new guard fires before
subprocess.run is reached when ffmpeg is not on PATH (which is the
case on most CI runners and dev boxes that haven't apt-installed it).

Add patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg")
to both tests so the guard passes and the mocked subprocess.run
gets exercised.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@williaby williaby force-pushed the claude/audio-preprocessing-phase-1-Pc9qx branch from fdcc7be to 13d7e47 Compare May 19, 2026 04:48
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

✅ Mutation Testing Results

Metric Value
Mutation Score 100.0%
Threshold 80%
Status Passed
What is Mutation Testing?

Mutation testing introduces small changes (mutations) to your code and checks if your tests detect them. A high mutation score indicates your tests are effective at catching bugs.

  • Killed mutants: Tests detected the change
  • Survived mutants: Tests did not detect the change (potential gap)

@williaby williaby changed the title feat(preprocessing): Phase 1 Sprints 1-4 — audio preprocessing pipeline feat(preprocessing): phase 1 sprints 1-4, audio preprocessing pipeline May 19, 2026
Targeted single-package upgrade (uv lock --upgrade-package urllib3) to
patch:

- GHSA-38jv-5279-wg99 (HIGH) — decompression-bomb safeguards bypassed
  when following HTTP redirects in the streaming API
- GHSA-qccp-gfcp-xxvc (HIGH) — sensitive headers forwarded across
  origins in proxied low-level redirects
- GHSA-mf9v-mfxr-j63j (HIGH) — decompression-bomb safeguards bypassed
  in parts of the streaming API

PR #27 was blocked on these via Dependency Review, OSV Vulnerability
Scanner, SBOM Runtime Scan, and Security Gate Validation. The original
author tried `uv lock --upgrade` (wholesale) twice and reverted both
times because of collateral damage; this targeted upgrade only touches
urllib3 and matches the precise CVE remediation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unit/preprocessing/test_loader.py (1)

57-77: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a boundary test for “path outside upload directory” rejection.

Current tests validate suffix/corruption, but not the required trust-boundary path check. Add a case that passes a resolved path outside configured upload root and asserts a typed failure.

As per coding guidelines, “Test for edge cases … and file system edge cases.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/preprocessing/test_loader.py` around lines 57 - 77, Add a unit
test that verifies load_audio rejects paths outside the configured upload root
by constructing a Path that resolves to a location outside the expected upload
directory and asserting it raises AudioLoadError; locate the test file
tests/unit/preprocessing/test_loader.py and add a new pytest test (similar style
to test_load_audio_rejects_unsupported_suffix and
test_load_audio_raises_on_corrupt_file) that calls load_audio(...) with a
resolved outside-path and checks for AudioLoadError to ensure the trust-boundary
path check is enforced.
tests/unit/preprocessing/test_vad.py (1)

21-69: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add invalid-input tests for detect_speech_segments.

Please cover sample_rate=0/negative and non-1D audio inputs, asserting clear typed exceptions. This will lock in the validation contract and prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/preprocessing/test_vad.py` around lines 21 - 69, Add unit tests
for detect_speech_segments to assert it raises clear typed exceptions on invalid
inputs: (1) call detect_speech_segments(audio, sample_rate=0) and
detect_speech_segments(audio, sample_rate=-16000) with a 1-D numpy audio array
and assert it raises ValueError (or the library's documented exception) for
invalid sample_rate; (2) call detect_speech_segments with non-1D audio (e.g., a
2-D numpy array) and assert it raises TypeError (or the documented exception)
for invalid audio shape. Add these tests into
tests/unit/preprocessing/test_vad.py near the existing tests and use the same
patching of vad_module.torch.hub.load (returning a fake utils tuple) so the
error is triggered by input validation before model calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/audio_processor/preprocessing/ffmpeg.py`:
- Around line 88-93: Add a timeout to the subprocess.run call that invokes
ffmpeg (the call using cmd) and catch subprocess.TimeoutExpired; on timeout
raise FfmpegConversionError with a descriptive message and details including
"input_path" and the timeout value, and chain the original exception; ensure
timeout is configurable or a reasonable fixed value and do not remove existing
capture_output/text/check behavior.
- Around line 50-110: Resolve and validate paths at the start of convert_to_wav:
call .resolve() on input_path and output_path, ensure _FFMPEG_PATH is checked as
before, verify resolved input_path.exists() and .is_file() and raise
FfmpegConversionError with a clear message if not, ensure resolved
output_path.parent.exists() (create or raise as per project policy), and
validate both resolved paths lie within the configured upload/working directory
(e.g. UPLOAD_DIR or WORKING_DIR) to prevent traversal; update the error details
to include resolved paths when raising FfmpegConversionError.

In `@src/audio_processor/preprocessing/loader.py`:
- Around line 27-69: Ensure load_audio and _read_samples canonicalize and
validate input paths before any I/O: call Path.resolve() on the incoming path
and verify the resolved path is contained within the configured upload root
(reject otherwise) before calling _read_samples or performing reads; do not
include the raw resolved path in exception details or log messages—use a
sanitized identifier (e.g., relative path under the upload root or a
redacted/hashed token) when constructing AudioLoadError details and error
messages so no user-derived absolute paths are leaked.
- Around line 16-17: The module currently imports and raises AudioLoadError from
audio_processor.exceptions; replace that with imports from
src.audio_processor.core.exceptions and raise the appropriate centralized types
instead: import ValidationError, ResourceNotFoundError and ExternalServiceError
and update all raises in the loader functions (e.g., the loader module's
functions that currently raise AudioLoadError at the import site and at the
error sites around lines indicated) so that file-not-found or missing resource
errors raise ResourceNotFoundError, invalid/unsupported audio or parameter
validation issues raise ValidationError, and failures coming from external
libraries or IO should raise ExternalServiceError; update the import statement
to reference these three exception classes and change each raise to the correct
class with the original error message preserved.

In `@src/audio_processor/preprocessing/quality.py`:
- Around line 40-53: The check_snr function calls scipy_signal.sosfiltfilt (via
sosfiltfilt) without guarding very short arrays, which raises ValueError; before
creating sos or calling scipy_signal.sosfiltfilt, check len(samples) against a
small threshold (e.g. 16) or compute the required padlen and compare to
samples.size, and if the input is too short, use a safe fallback (for example
compute envelope with a simple moving-average convolution or skip filtering and
treat residual as zero / return float("-inf") for unreliable SNR) so that
envelope and noise are always defined; update the branch around sos, envelope
and noise in check_snr to use this guarded fallback path.

In `@src/audio_processor/preprocessing/vad.py`:
- Around line 58-87: In detect_speech_segments validate inputs before calling
_load_silero_vad: check that sample_rate is an int > 0 and one of Silero's
supported rates (e.g., 8000 or 16000), check that audio is a 1-D numpy array
(audio.ndim == 1) and that its length > 0; if any check fails raise a typed
validation error (ValueError) with a clear message mentioning the offending
parameter; perform these guards at the top of detect_speech_segments so
subsequent calls to _load_silero_vad, get_speech_timestamps and the final
division are safe.

In `@tests/unit/preprocessing/test_ffmpeg.py`:
- Around line 19-84: Add unit tests to cover the missing edge cases for
convert_to_wav: add tests that call ffmpeg_module.convert_to_wav with a
non-existent input Path (assert it raises FfmpegConversionError or
FileNotFoundError), with an output whose parent directory does not exist (assert
proper error is raised), with filenames containing spaces/unicode/quotes (assert
subprocess.run is invoked with the exact paths), with an empty zero-byte input
file (assert ffmpeg error handling attaches stderr/exit code), and a test that
simulates subprocess.TimeoutExpired by patching ffmpeg_module.subprocess.run to
raise TimeoutExpired and asserting convert_to_wav surfaces a
FfmpegConversionError or timeout-specific behavior; reference ffmpeg_module,
convert_to_wav, FfmpegConversionError and subprocess.run in your new tests.

In `@tests/unit/preprocessing/test_quality.py`:
- Around line 25-51: Add a regression unit test to
tests/unit/preprocessing/test_quality.py that exercises check_snr with a very
short non-empty array (e.g., length 2–4, values not all zero) to ensure it does
not raise from internal filtering/pad logic and returns a deterministic result;
create a fixed tiny input (like np.array([0.1, -0.1], dtype=np.float32)), call
check_snr(tiny_input) and assert the result is deterministic (e.g.,
math.isfinite(snr) and/or equals a specific expected float) in a new test
function named test_check_snr_short_non_empty_input_returns_deterministic_value.

---

Outside diff comments:
In `@tests/unit/preprocessing/test_loader.py`:
- Around line 57-77: Add a unit test that verifies load_audio rejects paths
outside the configured upload root by constructing a Path that resolves to a
location outside the expected upload directory and asserting it raises
AudioLoadError; locate the test file tests/unit/preprocessing/test_loader.py and
add a new pytest test (similar style to
test_load_audio_rejects_unsupported_suffix and
test_load_audio_raises_on_corrupt_file) that calls load_audio(...) with a
resolved outside-path and checks for AudioLoadError to ensure the trust-boundary
path check is enforced.

In `@tests/unit/preprocessing/test_vad.py`:
- Around line 21-69: Add unit tests for detect_speech_segments to assert it
raises clear typed exceptions on invalid inputs: (1) call
detect_speech_segments(audio, sample_rate=0) and detect_speech_segments(audio,
sample_rate=-16000) with a 1-D numpy audio array and assert it raises ValueError
(or the library's documented exception) for invalid sample_rate; (2) call
detect_speech_segments with non-1D audio (e.g., a 2-D numpy array) and assert it
raises TypeError (or the documented exception) for invalid audio shape. Add
these tests into tests/unit/preprocessing/test_vad.py near the existing tests
and use the same patching of vad_module.torch.hub.load (returning a fake utils
tuple) so the error is triggered by input validation before model calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e9cdc4a7-e337-40bf-9106-bf043a05cb37

📥 Commits

Reviewing files that changed from the base of the PR and between 1be65b9 and 79d6b5b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (12)
  • pyproject.toml
  • src/audio_processor/exceptions.py
  • src/audio_processor/preprocessing/__init__.py
  • src/audio_processor/preprocessing/ffmpeg.py
  • src/audio_processor/preprocessing/loader.py
  • src/audio_processor/preprocessing/quality.py
  • src/audio_processor/preprocessing/vad.py
  • tests/unit/preprocessing/__init__.py
  • tests/unit/preprocessing/test_ffmpeg.py
  • tests/unit/preprocessing/test_loader.py
  • tests/unit/preprocessing/test_quality.py
  • tests/unit/preprocessing/test_vad.py

Comment on lines +50 to +110
def convert_to_wav(input_path: Path, output_path: Path) -> Path:
"""Convert an audio or video file to a WAV file using ffmpeg.

Invokes ffmpeg via ``subprocess.run`` with an argument list (no shell
interpolation) and overwrites ``output_path`` if it already exists.

Args:
input_path: Filesystem path to the source media file.
output_path: Filesystem path for the resulting WAV file. The parent
directory must already exist.

Returns:
The ``output_path`` argument, returned for call-chaining convenience.

Raises:
FfmpegConversionError: If the ``ffmpeg`` binary is not on ``PATH``
(resolved once at import; raised on first call so test suites
that mock ``subprocess`` can still import the module), or if
the ffmpeg invocation exits with a non-zero status. The
original ``stderr`` output is included in ``details`` for
diagnostics.
"""
if _FFMPEG_PATH is None:
raise FfmpegConversionError(_FFMPEG_MISSING_MSG)

cmd: list[str] = [
_FFMPEG_PATH,
"-y", # overwrite output without prompting
"-i",
str(input_path),
"-vn", # drop any video stream
"-acodec",
"pcm_s16le", # standard 16-bit PCM WAV
str(output_path),
]

# `check=False` so we can surface ffmpeg's stderr verbatim; `shell=False`
# (the default) is the security-critical guarantee, never pass a string.
result = subprocess.run( # noqa: S603 - argv list, shell=False
cmd,
capture_output=True,
text=True,
check=False,
)
if result.returncode != 0:
stderr = result.stderr.strip() or "<no stderr output>"
msg = (
f"ffmpeg failed (exit {result.returncode}) converting "
f"{input_path} -> {output_path}"
)
raise FfmpegConversionError(
msg,
details={
"input_path": str(input_path),
"output_path": str(output_path),
"exit_code": result.returncode,
"stderr": stderr,
},
)

return output_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add path resolution and validation to prevent path traversal attacks.

The function accepts arbitrary Path objects and passes them directly to ffmpeg without validation. As per coding guidelines: "all file I/O involving audio assets must use pathlib.Path.resolve() and validate against the configured upload directory before proceeding."

Consider:

  1. Resolve both paths to absolute paths using .resolve() to prevent .. traversal
  2. Validate that input_path exists and is a file (not a directory or device)
  3. Validate that output_path.parent exists before invoking ffmpeg
  4. If there's a configured upload/working directory, validate that resolved paths are within those boundaries
🛡️ Example validation pattern
 def convert_to_wav(input_path: Path, output_path: Path) -> Path:
     """Convert an audio or video file to a WAV file using ffmpeg.
 
     Invokes ffmpeg via ``subprocess.run`` with an argument list (no shell
     interpolation) and overwrites ``output_path`` if it already exists.
 
     Args:
         input_path: Filesystem path to the source media file.
         output_path: Filesystem path for the resulting WAV file. The parent
             directory must already exist.
 
     Returns:
         The ``output_path`` argument, returned for call-chaining convenience.
 
     Raises:
         FfmpegConversionError: If the ``ffmpeg`` binary is not on ``PATH``
             (resolved once at import; raised on first call so test suites
             that mock ``subprocess`` can still import the module), or if
             the ffmpeg invocation exits with a non-zero status. The
             original ``stderr`` output is included in ``details`` for
             diagnostics.
+        ValueError: If input_path does not exist or output_path parent 
+            directory does not exist.
     """
     if _FFMPEG_PATH is None:
         raise FfmpegConversionError(_FFMPEG_MISSING_MSG)
 
+    # Resolve paths to prevent traversal attacks
+    input_path = input_path.resolve()
+    output_path = output_path.resolve()
+    
+    # Validate input exists and is a file
+    if not input_path.is_file():
+        msg = f"Input path does not exist or is not a file: {input_path}"
+        raise FfmpegConversionError(msg, details={"input_path": str(input_path)})
+    
+    # Validate output parent directory exists
+    if not output_path.parent.exists():
+        msg = f"Output directory does not exist: {output_path.parent}"
+        raise FfmpegConversionError(
+            msg, 
+            details={"output_parent": str(output_path.parent)}
+        )
 
     cmd: list[str] = [
         _FFMPEG_PATH,
         "-y",  # overwrite output without prompting
         "-i",
         str(input_path),
         "-vn",  # drop any video stream
         "-acodec",
         "pcm_s16le",  # standard 16-bit PCM WAV
         str(output_path),
     ]

As per coding guidelines, audio file handling must validate paths before proceeding.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/ffmpeg.py` around lines 50 - 110, Resolve
and validate paths at the start of convert_to_wav: call .resolve() on input_path
and output_path, ensure _FFMPEG_PATH is checked as before, verify resolved
input_path.exists() and .is_file() and raise FfmpegConversionError with a clear
message if not, ensure resolved output_path.parent.exists() (create or raise as
per project policy), and validate both resolved paths lie within the configured
upload/working directory (e.g. UPLOAD_DIR or WORKING_DIR) to prevent traversal;
update the error details to include resolved paths when raising
FfmpegConversionError.

Comment on lines +88 to +93
result = subprocess.run( # noqa: S603 - argv list, shell=False
cmd,
capture_output=True,
text=True,
check=False,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding timeout parameter to subprocess.run.

The subprocess.run call has no timeout, which could cause the process to hang indefinitely if ffmpeg encounters an issue or a malicious input file causes ffmpeg to run indefinitely. As per coding guidelines for external calls: blocking calls without timeouts on request threads can cause cascading failures.

⏱️ Add timeout parameter
     # `check=False` so we can surface ffmpeg's stderr verbatim; `shell=False`
     # (the default) is the security-critical guarantee, never pass a string.
     result = subprocess.run(  # noqa: S603 - argv list, shell=False
         cmd,
         capture_output=True,
         text=True,
         check=False,
+        timeout=300,  # 5 minutes max for conversion
     )

Don't forget to handle subprocess.TimeoutExpired and wrap it in FfmpegConversionError:

try:
    result = subprocess.run(...)
except subprocess.TimeoutExpired as exc:
    msg = f"ffmpeg conversion timed out after {exc.timeout}s: {input_path}"
    raise FfmpegConversionError(
        msg,
        details={
            "input_path": str(input_path),
            "timeout": exc.timeout,
        },
    ) from exc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/ffmpeg.py` around lines 88 - 93, Add a
timeout to the subprocess.run call that invokes ffmpeg (the call using cmd) and
catch subprocess.TimeoutExpired; on timeout raise FfmpegConversionError with a
descriptive message and details including "input_path" and the timeout value,
and chain the original exception; ensure timeout is configurable or a reasonable
fixed value and do not remove existing capture_output/text/check behavior.

Comment on lines +16 to +17
from audio_processor.exceptions import AudioLoadError

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Align raised exceptions with the centralized core exception hierarchy.

This module currently raises AudioLoadError from audio_processor.exceptions, but project rules require exceptions from src/audio_processor/core/exceptions.py (e.g., ValidationError, ResourceNotFoundError, ExternalServiceError) for consistency and cross-layer handling.

As per coding guidelines, “Use centralized exception hierarchy from src/audio_processor/core/exceptions.py with specific exception types.”

Also applies to: 51-52, 66-67

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/loader.py` around lines 16 - 17, The module
currently imports and raises AudioLoadError from audio_processor.exceptions;
replace that with imports from src.audio_processor.core.exceptions and raise the
appropriate centralized types instead: import ValidationError,
ResourceNotFoundError and ExternalServiceError and update all raises in the
loader functions (e.g., the loader module's functions that currently raise
AudioLoadError at the import site and at the error sites around lines indicated)
so that file-not-found or missing resource errors raise ResourceNotFoundError,
invalid/unsupported audio or parameter validation issues raise ValidationError,
and failures coming from external libraries or IO should raise
ExternalServiceError; update the import statement to reference these three
exception classes and change each raise to the correct class with the original
error message preserved.

Comment on lines +27 to +69
def load_audio(path: Path) -> tuple[np.ndarray, int]:
"""Load an audio file, convert to mono, and resample to 16 kHz.

Reads the file with ``soundfile`` for efficient PCM decoding and falls
back to ``librosa`` for compressed formats (e.g. MP3) that soundfile
cannot always decode. Multi-channel input is averaged to mono.
Resampling uses librosa's polyphase resampler.

Args:
path: Filesystem path to the audio file. The suffix must be one of
``.wav``, ``.mp3``, ``.flac``, or ``.ogg``.

Returns:
Tuple ``(audio, sample_rate)`` where ``audio`` is a 1-D ``float32``
numpy array of mono samples and ``sample_rate`` is always
``TARGET_SAMPLE_RATE`` (16_000).

Raises:
AudioLoadError: If the file suffix is unsupported, the file is
missing, or the underlying decoder fails on a corrupt file.
"""
suffix = path.suffix.lower()
if suffix not in SUPPORTED_SUFFIXES:
msg = f"Unsupported audio format: {suffix!r}"
raise AudioLoadError(
msg,
details={
"path": str(path),
"suffix": suffix,
"supported": ", ".join(sorted(SUPPORTED_SUFFIXES)),
},
)

try:
audio, sample_rate = _read_samples(path, suffix)
except AudioLoadError:
raise
except (OSError, ValueError, RuntimeError) as exc:
msg = f"Failed to load audio file: {path}"
raise AudioLoadError(
msg,
details={"path": str(path), "reason": str(exc)},
) from exc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Enforce trusted-path resolution and boundary validation before any audio I/O.

load_audio and _read_samples read and report path directly. This misses the required trust-boundary check (Path.resolve() + upload-root containment) and leaks raw user-derived paths in exception context.

🔧 Suggested hardening sketch
 def load_audio(path: Path) -> tuple[np.ndarray, int]:
+    resolved = path.resolve(strict=False)
+    # Validate resolved is under configured upload root before any I/O.
+    # Example:
+    # upload_root = settings.upload_dir.resolve(strict=True)
+    # if upload_root not in resolved.parents and resolved != upload_root:
+    #     raise AudioLoadError("Audio path is outside allowed directory", details={...})
+
-    suffix = path.suffix.lower()
+    suffix = resolved.suffix.lower()
@@
-        audio, sample_rate = _read_samples(path, suffix)
+        audio, sample_rate = _read_samples(resolved, suffix)
@@
-        msg = f"Failed to load audio file: {path}"
+        msg = "Failed to load audio file"
         raise AudioLoadError(
             msg,
-            details={"path": str(path), "reason": str(exc)},
+            details={"reason": str(exc)},
         ) from exc

As per coding guidelines, “all file I/O involving audio assets must use pathlib.Path.resolve() and validate against the configured upload directory before proceeding” and “agents must never write or log raw audio file paths derived from user input.”

Also applies to: 92-100

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/loader.py` around lines 27 - 69, Ensure
load_audio and _read_samples canonicalize and validate input paths before any
I/O: call Path.resolve() on the incoming path and verify the resolved path is
contained within the configured upload root (reject otherwise) before calling
_read_samples or performing reads; do not include the raw resolved path in
exception details or log messages—use a sanitized identifier (e.g., relative
path under the upload root or a redacted/hashed token) when constructing
AudioLoadError details and error messages so no user-derived absolute paths are
leaked.

Comment on lines +40 to +53
if audio.size == 0:
return float("-inf")

samples = np.asarray(audio, dtype=np.float64)
signal_power = float(np.mean(samples**2))
if signal_power <= _NOISE_FLOOR:
return float("-inf")

# 4th-order Butterworth low-pass at Wn=0.5 (half of Nyquist, one
# quarter of the sample rate) approximates the slow-varying signal
# envelope; the residual captures broadband noise.
sos = scipy_signal.butter(N=4, Wn=0.5, btype="low", output="sos")
envelope = scipy_signal.sosfiltfilt(sos, samples)
noise = samples - envelope
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, let's read the full context of the quality.py file
cat -n src/audio_processor/preprocessing/quality.py

Repository: ByronWilliamsCPA/audio-processor

Length of output: 3441


🏁 Script executed:

# Check the scipy.signal.sosfiltfilt documentation and test its behavior with short arrays
python << 'EOF'
import numpy as np
import scipy.signal as scipy_signal

# Test sosfiltfilt with different array lengths
sos = scipy_signal.butter(N=4, Wn=0.5, btype="low", output="sos")

test_lengths = [1, 2, 3, 4, 5, 10, 20, 100]
for length in test_lengths:
    samples = np.random.randn(length).astype(np.float64)
    try:
        result = scipy_signal.sosfiltfilt(sos, samples)
        print(f"Length {length}: SUCCESS")
    except ValueError as e:
        print(f"Length {length}: ValueError - {e}")
    except Exception as e:
        print(f"Length {length}: {type(e).__name__} - {e}")
EOF

Repository: ByronWilliamsCPA/audio-processor

Length of output: 194


🏁 Script executed:

# Check if there are any tests for the check_snr function with short arrays
find . -name "*test*.py" -type f | head -20 | xargs grep -l "check_snr\|quality" 2>/dev/null || echo "No matching test files found"

Repository: ByronWilliamsCPA/audio-processor

Length of output: 119


🏁 Script executed:

# Also check for any error handling around sosfiltfilt in the codebase
rg "sosfiltfilt" -B 5 -A 5

Repository: ByronWilliamsCPA/audio-processor

Length of output: 1030


🏁 Script executed:

# Read the test file to see if there are tests for short arrays
cat -n ./tests/unit/preprocessing/test_quality.py

Repository: ByronWilliamsCPA/audio-processor

Length of output: 2773


🌐 Web query:

scipy.signal.sosfiltfilt minimum length padlen requirements ValueError short arrays

💡 Result:

When using scipy.signal.sosfiltfilt, the function requires that the length of the input signal along the filtering axis be greater than the padlen value [1][2][3]. Specifically, the documentation states that padlen must be less than x.shape[axis] - 1 [1][2]. If this condition is not met, a ValueError is raised [3]. The default padlen is calculated based on the filter coefficients (the number of second-order sections) [1][2]. If you encounter this error with a signal that you believe is sufficiently long, it is often because the function is attempting to filter along the wrong axis [4][3]. By default, sosfiltfilt operates on the last axis (axis=-1) [1][2]. If your input array is multidimensional (e.g., shape (N, 1)), the function may be trying to filter along the dimension with length 1, which is almost certainly shorter than the default padlen [4][3]. To resolve this, you can: 1. Explicitly specify the correct axis (e.g., axis=0) in the function call [4][3]. 2. Ensure your input signal is a one-dimensional array (e.g., using np.ravel or indexing) [4][3]. 3. Manually set a smaller padlen if appropriate for your application, provided it remains consistent with the requirements of the filter [5]. Note that while the error message explicitly mentions padlen, the root cause is frequently a mismatch between the intended filtering dimension and the array's shape [4][3].

Citations:


🌐 Web query:

scipy.signal.butter 4th order second order sections default padlen sosfiltfilt

💡 Result:

scipy.signal.butter(..., output='sos') and scipy.signal.sosfiltfilt: what happens with a “default padlen” and 4th-order SOS 1) butter: “4th order” with SOS - scipy.signal.butter(N,..., output='sos') returns a second-order sections array shaped (n_sections, 6). [1] - For bandpass/bandstop, SciPy states the resulting order for the final SOS matrix is 2*N; for simple low/high-pass, you typically get Nth-order realized as N biquad/sections in the SOS output representation. [1] 2) sosfiltfilt: default padlen is computed from the SOS structure - scipy.signal.sosfiltfilt(sos, x,..., padtype='odd', padlen=None) defaults to padtype='odd'. If padtype is None, no padding is used. [2] - If padlen is None, SciPy computes a default padlen value as: 3 * (2 * len(sos) + 1 - min((sos[:, 2] == 0).sum, (sos[:, 5] == 0).sum)) [2] (i.e., it depends on the number of SOS sections and whether there are zeros/poles at the origin for the implemented SOS representation). [2] - Note: padlen=0 implies no padding. [2] 3) Key implication for the edge artifact you mentioned (flattening near the ends) - If you set padtype=None in sosfiltfilt, you disable padding even though sosfiltfilt is a forward-backward filter that otherwise uses padding/extension to reduce edge transients. With no padding, the backward/forward initialization and signal extension assumptions can lead to boundary artifacts (often observed as flattening or other end effects). This is consistent with the fact that sosfiltfilt’s API explicitly ties edge handling to padding via padtype/padlen. [2] 4) Practical fixes - Prefer the default edge handling: do NOT set padtype=None; let padtype default to 'odd' and keep padlen=None so SciPy chooses the computed default padlen. [2] - If artifacts persist, increase padlen (not padtype=None). sosfiltfilt requires padlen < x.shape[axis] - 1. [2] - If you want “no delay” + better numerical robustness for higher orders, sosfiltfilt is generally the right approach; SciPy’s filtfilt documentation notes that sosfiltfilt (and SOS form) should be preferred over filtfilt for most filtering tasks due to fewer numerical problems. [3] 5) Your specific code snippet vs defaults - Your call sosfiltfilt(sos, data, padtype=None) forces “no padding” (padtype=None means no padding is used). [2] - To use the default padlen behavior, use padtype='odd' (or omit it) and padlen=None (or omit it). Example: data_filtered = scipy.signal.sosfiltfilt(sos, data, padtype='odd', padlen=None) (Default padtype='odd' and padlen is computed when padlen=None.) [2]

Citations:


Guard sosfiltfilt for short inputs to prevent ValueError on arrays shorter than the default padlen.

check_snr handles empty and silent audio but does not validate array length before calling scipy_signal.sosfiltfilt. Very short but non-empty, non-silent arrays (fewer than ~16 samples, depending on default padlen for 4th-order Butterworth) will raise ValueError. Add a minimum-length check before filtering, or use an alternate SNR estimator for short clips.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/quality.py` around lines 40 - 53, The
check_snr function calls scipy_signal.sosfiltfilt (via sosfiltfilt) without
guarding very short arrays, which raises ValueError; before creating sos or
calling scipy_signal.sosfiltfilt, check len(samples) against a small threshold
(e.g. 16) or compute the required padlen and compare to samples.size, and if the
input is too short, use a safe fallback (for example compute envelope with a
simple moving-average convolution or skip filtering and treat residual as zero /
return float("-inf") for unreliable SNR) so that envelope and noise are always
defined; update the branch around sos, envelope and noise in check_snr to use
this guarded fallback path.

Comment on lines +58 to +87
def detect_speech_segments(
audio: np.ndarray,
sample_rate: int,
) -> list[tuple[float, float]]:
"""Detect speech segments in mono audio using Silero VAD.

Args:
audio: 1-D mono audio array of floating-point samples in the range
``[-1.0, 1.0]``. Silero VAD natively supports 8 kHz and 16 kHz
sample rates; 16 kHz is recommended.
sample_rate: Sample rate of ``audio`` in Hz.

Returns:
Ordered list of ``(start_seconds, end_seconds)`` tuples describing
the detected speech intervals. Returns an empty list if no speech
is detected.
"""
model, utils = _load_silero_vad()
# Silero's `utils` is a tuple; `get_speech_timestamps` is the first
# element by convention.
get_speech_timestamps = utils[0]

tensor = torch.as_tensor(audio, dtype=torch.float32)
timestamps: list[dict[str, int]] = get_speech_timestamps(
tensor,
model,
sampling_rate=sample_rate,
)

return [(ts["start"] / sample_rate, ts["end"] / sample_rate) for ts in timestamps]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate sample_rate and audio shape before model invocation.

sample_rate is used both in Silero call and final division without guards. For sample_rate <= 0 this can fail with unclear errors (or divide by zero). Add explicit checks (sample_rate > 0, supported rates, audio.ndim == 1) and raise a typed validation error.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/audio_processor/preprocessing/vad.py` around lines 58 - 87, In
detect_speech_segments validate inputs before calling _load_silero_vad: check
that sample_rate is an int > 0 and one of Silero's supported rates (e.g., 8000
or 16000), check that audio is a 1-D numpy array (audio.ndim == 1) and that its
length > 0; if any check fails raise a typed validation error (ValueError) with
a clear message mentioning the offending parameter; perform these guards at the
top of detect_speech_segments so subsequent calls to _load_silero_vad,
get_speech_timestamps and the final division are safe.

Comment on lines +19 to +84
@pytest.mark.unit
def test_convert_to_wav_invokes_ffmpeg_with_argv_list(tmp_path: Path) -> None:
"""Successful invocation passes an argv list (never a shell string)."""
src = tmp_path / "in.mp3"
src.write_bytes(b"\x00")
dst = tmp_path / "out.wav"

fake = MagicMock(spec=subprocess.CompletedProcess)
fake.returncode = 0
fake.stderr = ""

with (
patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
patch.object(ffmpeg_module.subprocess, "run", return_value=fake) as run_mock,
):
result = convert_to_wav(src, dst)

assert result == dst
run_mock.assert_called_once()
args, kwargs = run_mock.call_args
cmd = args[0]
assert isinstance(cmd, list), "command must be argv list to avoid shell injection"
assert cmd[0] == "/usr/bin/ffmpeg"
assert str(src) in cmd
assert str(dst) in cmd
assert kwargs.get("shell", False) is False


@pytest.mark.unit
def test_convert_to_wav_raises_when_ffmpeg_missing(tmp_path: Path) -> None:
"""Calling ``convert_to_wav`` without ffmpeg on PATH raises OSError."""
src = tmp_path / "in.mp3"
src.write_bytes(b"\x00")
dst = tmp_path / "out.wav"

with (
patch.object(ffmpeg_module, "_FFMPEG_PATH", None),
pytest.raises(FfmpegConversionError, match="ffmpeg binary not found"),
):
convert_to_wav(src, dst)


@pytest.mark.unit
def test_convert_to_wav_raises_on_nonzero_exit(tmp_path: Path) -> None:
"""A non-zero ffmpeg exit surfaces FfmpegConversionError with stderr context."""
src = tmp_path / "in.mp3"
src.write_bytes(b"\x00")
dst = tmp_path / "out.wav"

fake = MagicMock(spec=subprocess.CompletedProcess)
fake.returncode = 1
fake.stderr = "Invalid data found when processing input"

with (
patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
patch.object(ffmpeg_module.subprocess, "run", return_value=fake),
pytest.raises(FfmpegConversionError) as exc_info,
):
convert_to_wav(src, dst)

assert "exit 1" in str(exc_info.value)
assert (
exc_info.value.details.get("stderr")
== "Invalid data found when processing input"
)
assert exc_info.value.details.get("exit_code") == 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add tests for edge cases and error conditions.

The current tests cover the happy path and basic error cases, but several edge cases are missing per the coding guidelines requirement to "test for edge cases including: special characters in paths, network timeouts, partial failures":

Consider adding tests for:

  1. Input file doesn't exist: Verify behavior when input_path points to non-existent file
  2. Output parent directory doesn't exist: The docstring states "parent directory must already exist" but this isn't validated or tested
  3. Special characters in paths: Unicode characters, spaces, quotes in filenames
  4. Empty input file: Zero-byte or minimal invalid audio file
  5. Subprocess timeout: If timeout is added to subprocess.run (per my other comment), test TimeoutExpired
  6. Path traversal attempts: Test with paths containing .. if validation is added
📋 Example additional test cases
`@pytest.mark.unit`
def test_convert_to_wav_with_special_chars_in_path(tmp_path: Path) -> None:
    """Handle filenames with spaces and Unicode correctly."""
    src = tmp_path / "audio with spaces 音频.mp3"
    src.write_bytes(b"\x00")
    dst = tmp_path / "output 输出.wav"

    fake = MagicMock(spec=subprocess.CompletedProcess)
    fake.returncode = 0
    fake.stderr = ""

    with (
        patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
        patch.object(ffmpeg_module.subprocess, "run", return_value=fake) as run_mock,
    ):
        result = convert_to_wav(src, dst)

    assert result == dst
    # Verify paths are properly passed to subprocess
    cmd = run_mock.call_args[0][0]
    assert str(src) in cmd
    assert str(dst) in cmd


`@pytest.mark.unit`
def test_convert_to_wav_missing_output_parent_dir(tmp_path: Path) -> None:
    """Fail gracefully when output parent directory doesn't exist."""
    src = tmp_path / "in.mp3"
    src.write_bytes(b"\x00")
    dst = tmp_path / "nonexistent" / "out.wav"  # parent doesn't exist

    with (
        patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
        # Depending on implementation, this might raise before subprocess
        pytest.raises((FfmpegConversionError, FileNotFoundError)),
    ):
        convert_to_wav(src, dst)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/preprocessing/test_ffmpeg.py` around lines 19 - 84, Add unit tests
to cover the missing edge cases for convert_to_wav: add tests that call
ffmpeg_module.convert_to_wav with a non-existent input Path (assert it raises
FfmpegConversionError or FileNotFoundError), with an output whose parent
directory does not exist (assert proper error is raised), with filenames
containing spaces/unicode/quotes (assert subprocess.run is invoked with the
exact paths), with an empty zero-byte input file (assert ffmpeg error handling
attaches stderr/exit code), and a test that simulates subprocess.TimeoutExpired
by patching ffmpeg_module.subprocess.run to raise TimeoutExpired and asserting
convert_to_wav surfaces a FfmpegConversionError or timeout-specific behavior;
reference ffmpeg_module, convert_to_wav, FfmpegConversionError and
subprocess.run in your new tests.

Comment on lines +25 to +51
@pytest.mark.unit
def test_check_snr_silent_input_returns_neg_inf() -> None:
"""All-zero input has no signal energy and returns -inf."""
audio = np.zeros(1_000, dtype=np.float32)
assert check_snr(audio) == float("-inf")


@pytest.mark.unit
def test_check_snr_empty_input_returns_neg_inf() -> None:
"""Empty input returns -inf rather than raising."""
assert check_snr(np.array([], dtype=np.float32)) == float("-inf")


@pytest.mark.unit
def test_check_snr_returns_finite_for_noisy_signal() -> None:
"""A sine tone plus broadband noise yields a finite, lower SNR."""
rng = np.random.default_rng(seed=0)
sr = 16_000
t = np.linspace(0, 1.0, sr, endpoint=False)
tone = 0.5 * np.sin(2 * np.pi * 440.0 * t)
noise = rng.normal(scale=0.1, size=sr)
noisy = (tone + noise).astype(np.float32)

snr_db = check_snr(noisy)

assert math.isfinite(snr_db)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add a regression test for short non-empty inputs in check_snr.

Please add a tiny array case (e.g., length < filtfilt pad length) and assert deterministic behavior instead of exception.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/preprocessing/test_quality.py` around lines 25 - 51, Add a
regression unit test to tests/unit/preprocessing/test_quality.py that exercises
check_snr with a very short non-empty array (e.g., length 2–4, values not all
zero) to ensure it does not raise from internal filtering/pad logic and returns
a deterministic result; create a fixed tiny input (like np.array([0.1, -0.1],
dtype=np.float32)), call check_snr(tiny_input) and assert the result is
deterministic (e.g., math.isfinite(snr) and/or equals a specific expected float)
in a new test function named
test_check_snr_short_non_empty_input_returns_deterministic_value.

williaby and others added 9 commits May 19, 2026 06:39
…qm-q3c2

Targeted single-package upgrade to patch:

- GHSA-r6ph-v2qm-q3c2 (HIGH) — cryptography vulnerable to a subgroup
  attack due to missing subgroup validation for SECT curves

Surfaced after the urllib3 fix landed; this was the next blocker on
Dependency Review and OSV Vulnerability Scanner. Same targeted-upgrade
pattern as the urllib3 fix to avoid wholesale-upgrade collateral.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… GHSAs

Targeted single-package upgrade to patch:

- GHSA-x2qx-6953-8485 (HIGH) — unsafe option check before shlex.split
- GHSA-rpm5-65cw-6hj4 (HIGH) — command injection via Git options bypass
- GHSA-7545-fcxq-7j24 (HIGH) — path traversal in reference APIs
- GHSA-v87r-6q3f-2j67 (HIGH) — RCE via core.hooksPath newline injection
- GHSA-mv93-w799-cj2w (HIGH) — bypass of CVE-2026-42215 patch

Surfaced after cryptography fix; was the next Dependency Review blocker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GHSA-5789-5fc7-67v3 (HIGH) — path traversal via incorrect startswith() check
- GHSA-24qx-w28j-9m6p (HIGH) — CORS origin validation bypass via re.match()
- GHSA-5mrq-x3x5-8v8f (HIGH) — auth cookies remain valid after password reset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GHSA-rch3-82jr-f9w9 (HIGH) — auth token theft via CommandLinker XSS
- GHSA-37w4-hwhx-4rc4 (HIGH) — extension manager API/GUI policy bypass
- GHSA-mqcg-5x36-vfcg (HIGH) — command linker HTML attribute exec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…default config)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ITLE_RE)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cklog

Targeted multi-package upgrade via uv lock --upgrade-package (per package,
no wholesale upgrade) to clear the remaining HIGH-severity findings flagged
by Dependency Review, OSV Vulnerability Scanner, and pip-audit:

- nbconvert 7.16.6 -> 7.17.1
- notebook 7.5.0 -> 7.5.6
- pillow 12.0.0 -> 12.2.0
- pip 25.3 -> 26.1.1
- protobuf 6.33.1 -> 6.33.6
- pyasn1 0.6.1 -> 0.6.3
- pygments 2.19.2 -> 2.20.0
- pytest 9.0.1 -> 9.0.3
- python-dotenv 1.2.1 -> 1.2.2
- python-multipart 0.0.20 -> 0.0.29
- requests 2.32.5 -> 2.34.2
- tornado 6.5.2 -> 6.5.5
- virtualenv 20.35.4 -> 21.3.3
- werkzeug 3.1.4 -> 3.1.8
- filelock 3.20.0 -> 3.29.0
- jaraco-context 6.0.1 -> 6.1.2

Local verification:
- uv run pip-audit: only py 1.11.0 PYSEC-2022-42969 remains (DISPUTED ReDoS,
  no upstream fix, already filtered by OSV scanner)
- uv run pytest tests/unit/preprocessing/: 16 passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sing deps

Audio preprocessing introduces a transitive dep tree (librosa -> numba,
silero-vad -> torch -> nvidia-cu12-*) that surfaces licenses not in the
existing allow-list, plus packages whose license metadata is not indexed
by GitHub's dependency API.

Adds to allow-licenses:
  MPL-1.1, 0BSD, BSD-4-Clause, BSD-2-Clause-Views, Zlib,
  LicenseRef-scancode-python-cwi,
  LicenseRef-scancode-secret-labs-2011,
  LicenseRef-scancode-unicode

Adds allow-dependencies-licenses for packages with no metadata-known
license (torch, lxml, silero-vad, onnxruntime, deepgram-sdk, protobuf,
jaraco-functools, ruamel-yaml{,-clib}, mkdocs-git-revision-date-localized
plus 15 nvidia-cu12-* CUDA packages).

Follows the documented precedent in this file (2026-05-17 entries for
MIT-CMU / ZPL-2.1 / HPND-Markus-Kuhn / LicenseRef-scancode-protobuf).
Per-license rationale documented inline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surfaced by CI pip-audit after the batch upgrade landed; my local
pip-audit had picked up an idna 3.11 venv even though uv.lock already
had 3.15 (uv sync didn't auto-reinstall). CI uses --frozen so it
reads the lock directly and saw the actual issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai coderabbitai Bot removed the python label May 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/dependency-review.yml:
- Around line 63-99: The allow-dependencies-licenses block currently contains
versionless PURLs (e.g., pkg:pypi/torch) which act as wildcards; update each
pkg:pypi/* entry in the allow-dependencies-licenses list to include the exact
approved version (e.g., pkg:pypi/torch@2.0.1) that was reviewed, or remove
entries that exist only to silence "no license" metadata so the action can
handle undetected licenses normally; also revisit the allow-licenses list (the
allow-licenses symbol) to remove or narrow any broad copyleft entries if you
intend to restrict repo-wide license acceptance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 33dcabc8-f193-4fde-9946-056b4c05d3ad

📥 Commits

Reviewing files that changed from the base of the PR and between 79d6b5b and 574708b.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (1)
  • .github/workflows/dependency-review.yml

Comment on lines +63 to +99
allow-licenses: MIT, MIT-CMU, Apache-2.0, BSD-2-Clause, BSD-2-Clause-Views, BSD-3-Clause, BSD-4-Clause, 0BSD, ISC, MPL-1.1, MPL-2.0, LGPL-2.1, LGPL-3.0, Python-2.0, Unlicense, CC0-1.0, GPL-3.0-or-later, ZPL-2.1, Zlib, HPND-Markus-Kuhn, LicenseRef-scancode-protobuf, LicenseRef-scancode-python-cwi, LicenseRef-scancode-secret-labs-2011, LicenseRef-scancode-unicode
# Packages whose license metadata is not indexed by GitHub's dependency
# API (action sees "no license"). These all have known permissive
# licenses that would be allowed under allow-licenses if metadata was
# present:
# - torch, lxml, silero-vad, onnxruntime, deepgram-sdk: BSD-3-Clause / MIT
# - protobuf: BSD-3-Clause (LicenseRef-scancode-protobuf upstream)
# - jaraco-functools, ruamel-yaml, ruamel-yaml-clib: MIT
# - mkdocs-git-revision-date-localized-plugin: MIT
# - nvidia-* CUDA stack: NVIDIA Software License Agreement (proprietary
# EULA, accepted as dependency of torch which is required for VAD)
allow-dependencies-licenses: >-
pkg:pypi/torch,
pkg:pypi/lxml,
pkg:pypi/silero-vad,
pkg:pypi/onnxruntime,
pkg:pypi/deepgram-sdk,
pkg:pypi/protobuf,
pkg:pypi/jaraco-functools,
pkg:pypi/ruamel-yaml,
pkg:pypi/ruamel-yaml-clib,
pkg:pypi/mkdocs-git-revision-date-localized-plugin,
pkg:pypi/nvidia-cublas-cu12,
pkg:pypi/nvidia-cuda-cupti-cu12,
pkg:pypi/nvidia-cuda-nvrtc-cu12,
pkg:pypi/nvidia-cuda-runtime-cu12,
pkg:pypi/nvidia-cudnn-cu12,
pkg:pypi/nvidia-cufft-cu12,
pkg:pypi/nvidia-cufile-cu12,
pkg:pypi/nvidia-curand-cu12,
pkg:pypi/nvidia-cusolver-cu12,
pkg:pypi/nvidia-cusparse-cu12,
pkg:pypi/nvidia-cusparselt-cu12,
pkg:pypi/nvidia-nccl-cu12,
pkg:pypi/nvidia-nvjitlink-cu12,
pkg:pypi/nvidia-nvshmem-cu12,
pkg:pypi/nvidia-nvtx-cu12
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/dependency-review.yml | sed -n '54,100p'

Repository: ByronWilliamsCPA/audio-processor

Length of output: 3284


🌐 Web query:

actions/dependency-review-action allow-dependencies-licenses PURL versioning behavior

💡 Result:

In the GitHub actions/dependency-review-action, the allow-dependencies-licenses configuration option accepts a list of packages in Package URL (PURL) format to exclude them from license checks [1][2][3]. Regarding versioning and matching behavior: 1. Version Specificity: When a specific version is included in the PURL (e.g., pkg:githubactions/owner/repo@2.0.0), the action matches that specific version [1]. 2. Wildcard/Omission Behavior: If no version is specified in the PURL, the action treats it as a wildcard, effectively allowing all versions of that package [3]. 3. PURL Parsing: The action uses a permissive, custom parser for PURLs rather than strictly adhering to the standard packageurl-js library [4]. This allows it to handle certain non-standard or partial PURLs that users commonly provide, though it generally expects valid PURL structures [4]. 4. Limitations: As of May 2026, the action does not natively support PURL qualifiers (such as?license=...) to filter by specific licenses for a given package, nor does it support native wildcard characters (like *) within the PURL string itself for name matching, although these have been requested as enhancements [5][6]. In summary, providing a PURL without a version acts as a blanket allow for all versions of that package, while providing a versioned PURL restricts the exclusion to that specific version [1][3].

Citations:


Pin PURL versions to prevent blanket license waivers for future upgrades.

The current configuration uses versionless PURLs in allow-dependencies-licenses (e.g., pkg:pypi/torch instead of pkg:pypi/torch@2.0.1), which act as wildcards matching all versions of each package indefinitely. This turns metadata gaps for the current PR into permanent exceptions for all future versions, even after dependency updates. Additionally, allow-licenses now admits copyleft licenses (GPL-3.0-or-later, LGPL-2.1, LGPL-3.0, MPL-1.1, MPL-2.0) repo-wide.

Scope approvals by adding explicit versions to the PURLs (e.g., pkg:pypi/torch@2.0.1) to limit exceptions to the exact packages reviewed in this PR. Alternatively, drop entries that only suppress missing metadata, and instead rely on the action's existing handling of undetected licenses.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/dependency-review.yml around lines 63 - 99, The
allow-dependencies-licenses block currently contains versionless PURLs (e.g.,
pkg:pypi/torch) which act as wildcards; update each pkg:pypi/* entry in the
allow-dependencies-licenses list to include the exact approved version (e.g.,
pkg:pypi/torch@2.0.1) that was reviewed, or remove entries that exist only to
silence "no license" metadata so the action can handle undetected licenses
normally; also revisit the allow-licenses list (the allow-licenses symbol) to
remove or narrow any broad copyleft entries if you intend to restrict repo-wide
license acceptance.

williaby and others added 2 commits May 19, 2026 08:42
Trivy on PR #27 surfaced one new HIGH CVE in the python:3.12-slim base
image (already had 5 documented CVEs in the same libgnutls30t64 package):

- CVE-2026-42009 (HIGH, affected) — DoS via DTLS packet reordering

No upstream Debian fix available. Same compensating-control argument as
the other libgnutls entries: GnuTLS is transitive via curl/ffmpeg only,
Python TLS uses OpenSSL via cryptography/urllib3, no untrusted TLS
endpoints flow through GnuTLS in the application request path.

Adds one .trivyignore entry and one row to the existing libgnutls30t64
table in docs/known-vulnerabilities.md (Discovered 2026-05-19, reassess
60-day cap per CLAUDE.md unfixed-CVE policy).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…L escape

CI exposed a pre-existing docs build regression once PR #27 touched
docs/: pygments 2.20.0 tightened html.escape() to reject None, breaking
pymdown-extensions <= 10.17.2's code-highlight formatter which passed
filename=None unconditionally. Affected any code block in any page.

The bug was on main (verified by uv-syncing main's exact lock and
running mkdocs build --strict locally) but the docs build job is path-
filtered to docs-touching PRs, so it had stayed undetected.

Upgrade pymdown-extensions v10.17.2 -> v10.21.3 (which fixes the
filename handling) while keeping pygments 2.20.0 (which fixes CVE-2026-4539).

Local verification:
  - uv run mkdocs build --strict: built in 1.84s, no errors
  - uv run pip-audit: only py 1.11.0 PYSEC-2022-42969 (already documented)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@williaby williaby merged commit 1b1a036 into main May 19, 2026
60 checks passed
@williaby williaby deleted the claude/audio-preprocessing-phase-1-Pc9qx branch May 19, 2026 16:38
williaby added a commit that referenced this pull request May 30, 2026
…API hardening, decomposition) (#53)

* refactor: remove dead preprocessing pipeline and orphaned exceptions

The preprocessing/ package (PR #27) was superseded by the services/
implementations (PR #43) and is imported nowhere in production code.
The two pipelines computed quality metrics with different algorithms
(Butterworth-filter SNR vs spectral-percentile SNR) and maintained a
second, independent Silero VAD model cache, creating a correctness
ambiguity and duplicate maintenance burden.

This removes:
- src/audio_processor/preprocessing/{loader,ffmpeg,vad,quality}.py
- src/audio_processor/exceptions.py (AudioLoadError, FfmpegConversionError
  were only used by the preprocessing modules; the canonical hierarchy
  lives in core/exceptions.py)
- tests/unit/preprocessing/

The wired-in services/ implementations remain the single source of truth.
Suite: 436 passing, coverage 89.98%.

* feat: unify job lifecycle behind a shared JobStore and enable ARQ enqueue

The API stored jobs in a process-local in-memory dict and never enqueued
them, while the worker read/wrote a separate Redis store. A submitted job
could never reach COMPLETED from the API's perspective, and /results and
/artifacts were structurally unreachable (issue #50).

Introduce core/job_store.py with a JobStore abstraction and two backends:
- InMemoryJobStore (dev/tests; also exposes a sync mapping interface for
  direct record injection)
- RedisJobStore (JSON-in-Redis keyed by job:{id})

Both backends share one key scheme and serialization, so the API and the
separate-process worker observe the same state.

- routes.py resolves the store via app.state (falling back to the in-memory
  store) and enqueues to ARQ when enqueue_enabled and a pool is attached.
- audio_tasks._update_job_status delegates to RedisJobStore (single source
  of truth; removes the duplicated get/merge/set + key literal).
- api/__init__.py gains a lifespan that opens an ARQ pool and attaches a
  RedisJobStore when job_store_backend == "redis". Defaults (memory backend,
  enqueue disabled) preserve existing behavior; arq is imported lazily so the
  API does not hard-depend on the 'jobs' extra.

New config: job_store_backend ("memory"|"redis"), enqueue_enabled (bool).

Tests: add test_job_store.py (both backends) and test_audio_tasks.py, which
gives process_audio_job its first coverage and verifies a job reaches
COMPLETED with result+artifacts in the shared store. 449 passing, 92.76%.

* feat(security): add API-key auth, rate limiting, and safe streaming uploads

Addresses two findings from the architecture review:

F3 (no authn/authz): every endpoint was anonymous, so unauthenticated
callers could drive FFmpeg work and paid Deepgram calls.
F4 (unbounded upload): the size guard trusted the client Content-Length
header and the body was read fully into memory before validation, enabling
a memory-exhaustion DoS; orphaned temp files were only cleaned on the
validation-error path.

Changes:
- api/security.py: require_api_key (constant-time X-API-Key check) and a
  per-client fixed-window rate_limit dependency. Both are gated by config and
  default OFF, so existing behavior is preserved.
  - require_api_key is attached at the /api/v1 router; rate_limit guards the
    expensive POST /process endpoint.
- routes.py: stream uploads to disk in 1 MiB chunks with a hard byte cap
  enforced on bytes actually read (no full-body buffering, header not
  trusted). A finally block guarantees the temp file is removed on every
  failure path; on success the worker owns and deletes it.
- config: auth_required, api_keys (+ api_key_set), rate_limit_enabled,
  rate_limit_requests, rate_limit_window_seconds.

No new third-party dependency is added (keeps the audited dependency set
stable); the rate limiter is process-local and documented as a safety net to
complement a gateway/Redis limiter in multi-process deployments.

Tests: test_security.py and test_routes_hardening.py cover auth (open/401/
wrong/ok/misconfig-500), rate limiting (under/over/per-client), the streaming
cap, and temp-file cleanup on failure. 462 passing, coverage 92.79%.

* docs: document job/security settings in .env.example; apply ruff format

- Add JOB_STORE_BACKEND, ENQUEUE_ENABLED, AUTH_REQUIRED, API_KEYS,
  RATE_LIMIT_* to .env.example so the new configuration is discoverable.
- Apply ruff format to api/__init__.py and test_security.py (whitespace only).

* fix(test): move UploadFile import into TYPE_CHECKING block

CodeQL flagged UploadFile as an unused runtime import; it is only used in a
cast() string annotation, so it belongs in the type-checking block.

* refactor: remove dead Python 3.10 UTC compatibility shims

requires-python is >=3.11, so datetime.UTC is always available and the
sys.version_info guard's else branch was unreachable dead code, duplicated
across four modules (models, routes, audio_tasks, worker). Replace each with
a direct `from datetime import UTC` and drop the now-unused sys/timezone
imports.

* refactor: remove template scaffolding from worker and utils

Cookiecutter residue that the audio pipeline never used:
- jobs/worker.py: drop the example/stub tasks (example_background_task,
  send_email_task, process_file_upload) and the no-op cleanup_old_data cron
  (it returned a hard-coded 0 yet was registered on the worker). Remove the
  two large triple-quoted "example" blocks (FastAPI integration + Celery
  alternative) and the now-unused asyncio/datetime/cron imports.
  WorkerSettings now registers only process_audio_job.
- utils/financial.py: delete the empty placeholder module (no references).
- Update test_worker.py to cover the real surface (process_audio_job
  registration, lifecycle hooks, enqueue_task) and CLAUDE.md project tree.

456 passing, coverage 92.96%.

* refactor: decompose process_audio_job into staged helpers

The 260-line task mixed orchestration, conversion, transcription, artifact
generation, cleanup, and six near-identical inlined progress dicts. Extract
focused, individually-readable helpers:
- _progress(): builds a progress payload (collapses the 6 repeated blocks).
- _convert_audio(): converter is_video/extract vs convert_for_asr branch.
- _transcribe(): lazy Deepgram import + ConfigurationError tolerance.
- _build_transcription_payload(): result serialization.
- _generate_artifacts(): lazy import + generation-failure tolerance.

process_audio_job is now a ~80-line orchestrator with the same external
behavior and status transitions. Add tests for the two previously-uncovered
error branches (generic-exception wrapping to AudioProcessorError + FAILED,
and artifact-generation failure still completing the job).

458 passing, coverage 93.06%.

* fix: address PR review (rate-limiter bound, config invariant, test coverage)

- security.py: bound _RATE_WINDOWS with opportunistic eviction of expired
  windows above a soft cap, so a flood of unique client identifiers cannot
  grow the per-process map without bound.
- config.py: add a model_validator enforcing the documented invariant that
  enqueue_enabled requires job_store_backend='redis', so the misconfiguration
  fails fast at startup instead of silently never processing jobs.
- test_job_store.py: cover the bytes-decode branch of RedisJobStore._decode.

* fix(test): resolve CodeQL unused-import on UploadFile

CodeQL flagged UploadFile as unused because its only reference was the string
in cast("UploadFile", ...), and ruff TC006 conversely requires the cast type to
be quoted -- a direct conflict. Drop the cast/import entirely and pass the
duck-typed _FakeUpload directly with a precise reportArgumentType ignore on the
call. Also rename the autouse fixture _clear_store -> clear_store to match the
suite convention and avoid reportUnusedFunction.

* fix(core): make RedisJobStore.update atomic with per-field HSET (#57)

Closes #54.

RedisJobStore stored each job as a single JSON string and updated it with a
read-modify-write (GET, merge in Python, SET). Because the API and the ARQ
worker write the same job from separate processes, two concurrent updates
could clobber each other (lost update): whichever SET landed last overwrote
every field the other writer had changed. This is the failure mode the F2
"shared job store" fix is meant to prevent, so the store must not reintroduce
it.

Store each job as a Redis hash (job:{id}) with individually JSON-encoded
field values:
- create: MULTI/EXEC of DELETE + HSET(mapping) + EXPIRE (replaces any prior
  record, clears stale fields, reapplies TTL).
- update: HSET only the provided non-None fields + EXPIRE. HSET is atomic, so
  writers touching different fields no longer clobber each other; same-field
  writes remain last-writer-wins (the expected contract).
- get: HGETALL + per-field decode, tolerant of bytes or str (clients created
  without decode_responses return bytes).

Consolidate the two duplicated get/set FakeRedis stand-ins into a single
hash-capable tests/unit/_fake_redis.py (adds hset/hgetall/expire/delete and a
buffered transactional pipeline), and add a regression test asserting two
disjoint concurrent field updates both survive.

Suite: 460 passing, coverage 93.03%. ruff + ruff format clean;
basedpyright src/: 0 errors.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address PR review — secrets, API hardening, error handling

Re-applied on top of the merged atomic store (#57), which already implemented
the per-field HSET RedisJobStore and the shared test fake. This commit adds the
remaining review items that #57 did not cover:

Security:
- api_keys is now SecretStr (was plain str) so keys never leak via repr /
  model_dump / logs; api_key_set reads via get_secret_value().
- require_api_key compares against all keys without short-circuiting (avoids a
  timing side channel); the rate limiter hashes the API key before using it as
  an in-memory map key.
- Fail-fast Settings validator: auth_required requires at least one api_key.

Memory:
- Rate-limiter window map is hard-capped: after evicting expired windows it
  evicts the oldest entries when still over the cap.

Error handling:
- _generate_artifacts catches any exception (DOM/json.dumps can raise
  TypeError/ValueError/KeyError) and records an artifacts_error field instead
  of failing a completed transcription.
- Status-route timestamp parsing is defensive (no opaque 500 on a bad record).
- _maybe_enqueue raises when enabled without a pool, and enqueue failures mark
  the job FAILED instead of stranding it QUEUED.

Tests: test_config.py (validators, SecretStr non-leak) and _maybe_enqueue seam
tests. 489 passing, coverage 92.69%.

* style: drop unused noqa(BLE001) on re-raising except in _maybe_enqueue guard

The except re-raises, so BLE001 does not apply and the suppression was unused
(ruff RUF100), failing the Code Quality CI checks.

* fix(security): use non-crypto hash for rate-limit bucket id

CodeQL flagged hashlib.sha256(api_key) as weak hashing of sensitive data
(it treats the API key as a password requiring a slow KDF). The hash here is
only an in-process rate-limit bucket id, not password storage. Use the builtin
non-cryptographic hash(), which keeps per-key buckets and still avoids holding
the raw secret as a map key, without tripping the crypto sink.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants