feat(preprocessing): phase 1 sprints 1-4, audio preprocessing pipeline#27
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
WalkthroughThis PR adds a complete audio preprocessing pipeline with FFmpeg conversion, audio normalization, quality metrics, and Silero VAD-based speech detection. It includes exception types, a safe FFmpeg subprocess wrapper, multi-format audio loading with resampling, SNR and clipping checks, workflow license updates, and comprehensive unit tests. ChangesAudio Preprocessing Pipeline
Sequence DiagramssequenceDiagram
participant User as Audio Preprocessing Client
participant Loader as load_audio()
participant FFmpeg as convert_to_wav()
participant Quality as check_snr()/check_clipping()
participant VAD as detect_speech_segments()
User->>Loader: Load audio file
Loader->>Loader: Validate suffix
Loader->>Loader: Decode audio
Loader->>Loader: Convert to mono
Loader->>Loader: Resample to 16k
Loader-->>User: (audio_array, 16000)
alt Conversion needed
User->>FFmpeg: Request conversion
FFmpeg->>FFmpeg: Run ffmpeg subprocess
FFmpeg-->>User: Converted WAV path
end
User->>Quality: Assess quality
Quality->>Quality: Compute SNR
Quality->>Quality: Check clipping
Quality-->>User: (snr_db, is_clipped)
User->>VAD: Detect speech
VAD->>VAD: Lazy-load Silero model
VAD->>VAD: get_speech_timestamps
VAD-->>User: [(start_sec, end_sec), ...]
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ FIPS Compatibility Check
Status: ✅ PASSED What is FIPS?FIPS 140-2/140-3 is a US government standard for cryptographic modules. Common issues:
|
Dependency ReviewThe following issues were found:
License Issuesuv.lock
OpenSSF ScorecardScorecard details
Scanned Files
|
There was a problem hiding this comment.
Pull request overview
Implements Phase 1 (Sprints 1–4) of the audio preprocessing pipeline as a set of pure-Python modules under src/audio_processor/preprocessing/, plus a new AudioLoadError exception and a scipy dependency for SNR estimation. All new logic is exercised by 15 mocked / synthetic-fixture unit tests.
Changes:
- Adds preprocessing modules:
loader.py(load/mono/resample to 16 kHz),ffmpeg.py(subprocess wrapper validating the binary at import time),vad.py(lazy thread-safe Silero VAD viatorch.hub), andquality.py(Butterworth-based SNR + peak-clipping detector). - Introduces
AudioLoadErrorextending the existingProjectBaseErrorhierarchy. - Adds
scipy>=1.11.0to the[audio]extra and correspondinguv.lockentries; adds unit tests undertests/unit/preprocessing/.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/audio_processor/exceptions.py |
New top-level AudioLoadError rooted in ProjectBaseError. |
src/audio_processor/preprocessing/__init__.py |
Sub-package marker / docstring. |
src/audio_processor/preprocessing/loader.py |
Loads WAV/MP3/FLAC/OGG, mono-mixes, resamples to 16 kHz. |
src/audio_processor/preprocessing/ffmpeg.py |
Argv-list ffmpeg wrapper; resolves binary at import time. |
src/audio_processor/preprocessing/vad.py |
Lazy, thread-safe Silero VAD wrapper returning seconds. |
src/audio_processor/preprocessing/quality.py |
SNR via Butterworth low-pass + peak-magnitude clipping check. |
tests/unit/preprocessing/test_loader.py |
Resample/mono, passthrough, unsupported suffix, corrupt file. |
tests/unit/preprocessing/test_ffmpeg.py |
Asserts argv-list invocation and non-zero-exit error path. |
tests/unit/preprocessing/test_vad.py |
Mocks torch.hub.load; verifies seconds conversion / empty case. |
tests/unit/preprocessing/test_quality.py |
Clean/silent/empty/noisy SNR + clipping threshold cases. |
tests/unit/preprocessing/__init__.py |
Package marker for new tests. |
pyproject.toml |
Adds scipy>=1.11.0 to [audio] extra. |
uv.lock |
Locks scipy 1.15.3 / 1.16.3 (per Python version) and a hypothesis sdist entry. |
d23e6f1 to
dcfc527
Compare
Adds the pure-Python audio preprocessing pipeline that runs ahead of Deepgram transcription. Covers loader/format normalization, an FFmpeg wrapper, Silero VAD, and SNR/clipping quality checks. No live API calls required. Modules: - src/audio_processor/exceptions.py: AudioLoadError (extends ProjectBaseError). - src/audio_processor/preprocessing/loader.py: load_audio() reads WAV/MP3/ FLAC/OGG via soundfile (librosa for MP3), averages to mono, resamples to 16 kHz. - src/audio_processor/preprocessing/ffmpeg.py: convert_to_wav() shells out to ffmpeg with an argv list (shell=False) for safe format conversion; validates ffmpeg on PATH at import. - src/audio_processor/preprocessing/vad.py: detect_speech_segments() loads Silero VAD via torch.hub.load (cached, thread-safe) and returns (start_s, end_s) tuples. - src/audio_processor/preprocessing/quality.py: check_snr() (Butterworth envelope via scipy.signal) and check_clipping(). Tests: 15 unit tests in tests/unit/preprocessing/, all mocked (torch.hub.load, subprocess.run) with synthetic numpy fixtures. Deps: adds scipy>=1.11.0 to the [audio] extra for SNR filtering.
…R docs Addresses Copilot review and the Python Compatibility Matrix failures (import-time `OSError` blocked pytest collection on Win/macOS runners where ffmpeg is not preinstalled). - ffmpeg.py: resolve ffmpeg on PATH at import (cached) but defer the raise to first call of `convert_to_wav`, so importing the module is side-effect free. Switches the raised type from stdlib `RuntimeError` to a new project-hierarchy `FfmpegConversionError` for both the missing-binary and non-zero-exit paths, including stderr/exit_code in the exception details. - exceptions.py: add `FfmpegConversionError(ProjectBaseError)`. - quality.py: docstring/comment said "Nyquist quarter" but `Wn=0.5` is half of Nyquist (one quarter of the sample rate). Corrected both. - tests: add coverage for the missing-binary path; update the non-zero exit test to assert the new exception type and details.
Two convert_to_wav tests mocked subprocess.run but not _FFMPEG_PATH. After the deferred-ffmpeg-check change, the new guard fires before subprocess.run is reached when ffmpeg is not on PATH (which is the case on most CI runners and dev boxes that haven't apt-installed it). Add patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg") to both tests so the guard passes and the mocked subprocess.run gets exercised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fdcc7be to
13d7e47
Compare
✅ Mutation Testing Results
What is Mutation Testing?Mutation testing introduces small changes (mutations) to your code and checks if your tests detect them. A high mutation score indicates your tests are effective at catching bugs.
|
Targeted single-package upgrade (uv lock --upgrade-package urllib3) to patch: - GHSA-38jv-5279-wg99 (HIGH) — decompression-bomb safeguards bypassed when following HTTP redirects in the streaming API - GHSA-qccp-gfcp-xxvc (HIGH) — sensitive headers forwarded across origins in proxied low-level redirects - GHSA-mf9v-mfxr-j63j (HIGH) — decompression-bomb safeguards bypassed in parts of the streaming API PR #27 was blocked on these via Dependency Review, OSV Vulnerability Scanner, SBOM Runtime Scan, and Security Gate Validation. The original author tried `uv lock --upgrade` (wholesale) twice and reverted both times because of collateral damage; this targeted upgrade only touches urllib3 and matches the precise CVE remediation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/unit/preprocessing/test_loader.py (1)
57-77: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd a boundary test for “path outside upload directory” rejection.
Current tests validate suffix/corruption, but not the required trust-boundary path check. Add a case that passes a resolved path outside configured upload root and asserts a typed failure.
As per coding guidelines, “Test for edge cases … and file system edge cases.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/preprocessing/test_loader.py` around lines 57 - 77, Add a unit test that verifies load_audio rejects paths outside the configured upload root by constructing a Path that resolves to a location outside the expected upload directory and asserting it raises AudioLoadError; locate the test file tests/unit/preprocessing/test_loader.py and add a new pytest test (similar style to test_load_audio_rejects_unsupported_suffix and test_load_audio_raises_on_corrupt_file) that calls load_audio(...) with a resolved outside-path and checks for AudioLoadError to ensure the trust-boundary path check is enforced.tests/unit/preprocessing/test_vad.py (1)
21-69: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd invalid-input tests for
detect_speech_segments.Please cover
sample_rate=0/negative and non-1D audio inputs, asserting clear typed exceptions. This will lock in the validation contract and prevent regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/preprocessing/test_vad.py` around lines 21 - 69, Add unit tests for detect_speech_segments to assert it raises clear typed exceptions on invalid inputs: (1) call detect_speech_segments(audio, sample_rate=0) and detect_speech_segments(audio, sample_rate=-16000) with a 1-D numpy audio array and assert it raises ValueError (or the library's documented exception) for invalid sample_rate; (2) call detect_speech_segments with non-1D audio (e.g., a 2-D numpy array) and assert it raises TypeError (or the documented exception) for invalid audio shape. Add these tests into tests/unit/preprocessing/test_vad.py near the existing tests and use the same patching of vad_module.torch.hub.load (returning a fake utils tuple) so the error is triggered by input validation before model calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/audio_processor/preprocessing/ffmpeg.py`:
- Around line 88-93: Add a timeout to the subprocess.run call that invokes
ffmpeg (the call using cmd) and catch subprocess.TimeoutExpired; on timeout
raise FfmpegConversionError with a descriptive message and details including
"input_path" and the timeout value, and chain the original exception; ensure
timeout is configurable or a reasonable fixed value and do not remove existing
capture_output/text/check behavior.
- Around line 50-110: Resolve and validate paths at the start of convert_to_wav:
call .resolve() on input_path and output_path, ensure _FFMPEG_PATH is checked as
before, verify resolved input_path.exists() and .is_file() and raise
FfmpegConversionError with a clear message if not, ensure resolved
output_path.parent.exists() (create or raise as per project policy), and
validate both resolved paths lie within the configured upload/working directory
(e.g. UPLOAD_DIR or WORKING_DIR) to prevent traversal; update the error details
to include resolved paths when raising FfmpegConversionError.
In `@src/audio_processor/preprocessing/loader.py`:
- Around line 27-69: Ensure load_audio and _read_samples canonicalize and
validate input paths before any I/O: call Path.resolve() on the incoming path
and verify the resolved path is contained within the configured upload root
(reject otherwise) before calling _read_samples or performing reads; do not
include the raw resolved path in exception details or log messages—use a
sanitized identifier (e.g., relative path under the upload root or a
redacted/hashed token) when constructing AudioLoadError details and error
messages so no user-derived absolute paths are leaked.
- Around line 16-17: The module currently imports and raises AudioLoadError from
audio_processor.exceptions; replace that with imports from
src.audio_processor.core.exceptions and raise the appropriate centralized types
instead: import ValidationError, ResourceNotFoundError and ExternalServiceError
and update all raises in the loader functions (e.g., the loader module's
functions that currently raise AudioLoadError at the import site and at the
error sites around lines indicated) so that file-not-found or missing resource
errors raise ResourceNotFoundError, invalid/unsupported audio or parameter
validation issues raise ValidationError, and failures coming from external
libraries or IO should raise ExternalServiceError; update the import statement
to reference these three exception classes and change each raise to the correct
class with the original error message preserved.
In `@src/audio_processor/preprocessing/quality.py`:
- Around line 40-53: The check_snr function calls scipy_signal.sosfiltfilt (via
sosfiltfilt) without guarding very short arrays, which raises ValueError; before
creating sos or calling scipy_signal.sosfiltfilt, check len(samples) against a
small threshold (e.g. 16) or compute the required padlen and compare to
samples.size, and if the input is too short, use a safe fallback (for example
compute envelope with a simple moving-average convolution or skip filtering and
treat residual as zero / return float("-inf") for unreliable SNR) so that
envelope and noise are always defined; update the branch around sos, envelope
and noise in check_snr to use this guarded fallback path.
In `@src/audio_processor/preprocessing/vad.py`:
- Around line 58-87: In detect_speech_segments validate inputs before calling
_load_silero_vad: check that sample_rate is an int > 0 and one of Silero's
supported rates (e.g., 8000 or 16000), check that audio is a 1-D numpy array
(audio.ndim == 1) and that its length > 0; if any check fails raise a typed
validation error (ValueError) with a clear message mentioning the offending
parameter; perform these guards at the top of detect_speech_segments so
subsequent calls to _load_silero_vad, get_speech_timestamps and the final
division are safe.
In `@tests/unit/preprocessing/test_ffmpeg.py`:
- Around line 19-84: Add unit tests to cover the missing edge cases for
convert_to_wav: add tests that call ffmpeg_module.convert_to_wav with a
non-existent input Path (assert it raises FfmpegConversionError or
FileNotFoundError), with an output whose parent directory does not exist (assert
proper error is raised), with filenames containing spaces/unicode/quotes (assert
subprocess.run is invoked with the exact paths), with an empty zero-byte input
file (assert ffmpeg error handling attaches stderr/exit code), and a test that
simulates subprocess.TimeoutExpired by patching ffmpeg_module.subprocess.run to
raise TimeoutExpired and asserting convert_to_wav surfaces a
FfmpegConversionError or timeout-specific behavior; reference ffmpeg_module,
convert_to_wav, FfmpegConversionError and subprocess.run in your new tests.
In `@tests/unit/preprocessing/test_quality.py`:
- Around line 25-51: Add a regression unit test to
tests/unit/preprocessing/test_quality.py that exercises check_snr with a very
short non-empty array (e.g., length 2–4, values not all zero) to ensure it does
not raise from internal filtering/pad logic and returns a deterministic result;
create a fixed tiny input (like np.array([0.1, -0.1], dtype=np.float32)), call
check_snr(tiny_input) and assert the result is deterministic (e.g.,
math.isfinite(snr) and/or equals a specific expected float) in a new test
function named test_check_snr_short_non_empty_input_returns_deterministic_value.
---
Outside diff comments:
In `@tests/unit/preprocessing/test_loader.py`:
- Around line 57-77: Add a unit test that verifies load_audio rejects paths
outside the configured upload root by constructing a Path that resolves to a
location outside the expected upload directory and asserting it raises
AudioLoadError; locate the test file tests/unit/preprocessing/test_loader.py and
add a new pytest test (similar style to
test_load_audio_rejects_unsupported_suffix and
test_load_audio_raises_on_corrupt_file) that calls load_audio(...) with a
resolved outside-path and checks for AudioLoadError to ensure the trust-boundary
path check is enforced.
In `@tests/unit/preprocessing/test_vad.py`:
- Around line 21-69: Add unit tests for detect_speech_segments to assert it
raises clear typed exceptions on invalid inputs: (1) call
detect_speech_segments(audio, sample_rate=0) and detect_speech_segments(audio,
sample_rate=-16000) with a 1-D numpy audio array and assert it raises ValueError
(or the library's documented exception) for invalid sample_rate; (2) call
detect_speech_segments with non-1D audio (e.g., a 2-D numpy array) and assert it
raises TypeError (or the documented exception) for invalid audio shape. Add
these tests into tests/unit/preprocessing/test_vad.py near the existing tests
and use the same patching of vad_module.torch.hub.load (returning a fake utils
tuple) so the error is triggered by input validation before model calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e9cdc4a7-e337-40bf-9106-bf043a05cb37
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (12)
pyproject.tomlsrc/audio_processor/exceptions.pysrc/audio_processor/preprocessing/__init__.pysrc/audio_processor/preprocessing/ffmpeg.pysrc/audio_processor/preprocessing/loader.pysrc/audio_processor/preprocessing/quality.pysrc/audio_processor/preprocessing/vad.pytests/unit/preprocessing/__init__.pytests/unit/preprocessing/test_ffmpeg.pytests/unit/preprocessing/test_loader.pytests/unit/preprocessing/test_quality.pytests/unit/preprocessing/test_vad.py
| def convert_to_wav(input_path: Path, output_path: Path) -> Path: | ||
| """Convert an audio or video file to a WAV file using ffmpeg. | ||
|
|
||
| Invokes ffmpeg via ``subprocess.run`` with an argument list (no shell | ||
| interpolation) and overwrites ``output_path`` if it already exists. | ||
|
|
||
| Args: | ||
| input_path: Filesystem path to the source media file. | ||
| output_path: Filesystem path for the resulting WAV file. The parent | ||
| directory must already exist. | ||
|
|
||
| Returns: | ||
| The ``output_path`` argument, returned for call-chaining convenience. | ||
|
|
||
| Raises: | ||
| FfmpegConversionError: If the ``ffmpeg`` binary is not on ``PATH`` | ||
| (resolved once at import; raised on first call so test suites | ||
| that mock ``subprocess`` can still import the module), or if | ||
| the ffmpeg invocation exits with a non-zero status. The | ||
| original ``stderr`` output is included in ``details`` for | ||
| diagnostics. | ||
| """ | ||
| if _FFMPEG_PATH is None: | ||
| raise FfmpegConversionError(_FFMPEG_MISSING_MSG) | ||
|
|
||
| cmd: list[str] = [ | ||
| _FFMPEG_PATH, | ||
| "-y", # overwrite output without prompting | ||
| "-i", | ||
| str(input_path), | ||
| "-vn", # drop any video stream | ||
| "-acodec", | ||
| "pcm_s16le", # standard 16-bit PCM WAV | ||
| str(output_path), | ||
| ] | ||
|
|
||
| # `check=False` so we can surface ffmpeg's stderr verbatim; `shell=False` | ||
| # (the default) is the security-critical guarantee, never pass a string. | ||
| result = subprocess.run( # noqa: S603 - argv list, shell=False | ||
| cmd, | ||
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| ) | ||
| if result.returncode != 0: | ||
| stderr = result.stderr.strip() or "<no stderr output>" | ||
| msg = ( | ||
| f"ffmpeg failed (exit {result.returncode}) converting " | ||
| f"{input_path} -> {output_path}" | ||
| ) | ||
| raise FfmpegConversionError( | ||
| msg, | ||
| details={ | ||
| "input_path": str(input_path), | ||
| "output_path": str(output_path), | ||
| "exit_code": result.returncode, | ||
| "stderr": stderr, | ||
| }, | ||
| ) | ||
|
|
||
| return output_path |
There was a problem hiding this comment.
Add path resolution and validation to prevent path traversal attacks.
The function accepts arbitrary Path objects and passes them directly to ffmpeg without validation. As per coding guidelines: "all file I/O involving audio assets must use pathlib.Path.resolve() and validate against the configured upload directory before proceeding."
Consider:
- Resolve both paths to absolute paths using
.resolve()to prevent..traversal - Validate that
input_pathexists and is a file (not a directory or device) - Validate that
output_path.parentexists before invoking ffmpeg - If there's a configured upload/working directory, validate that resolved paths are within those boundaries
🛡️ Example validation pattern
def convert_to_wav(input_path: Path, output_path: Path) -> Path:
"""Convert an audio or video file to a WAV file using ffmpeg.
Invokes ffmpeg via ``subprocess.run`` with an argument list (no shell
interpolation) and overwrites ``output_path`` if it already exists.
Args:
input_path: Filesystem path to the source media file.
output_path: Filesystem path for the resulting WAV file. The parent
directory must already exist.
Returns:
The ``output_path`` argument, returned for call-chaining convenience.
Raises:
FfmpegConversionError: If the ``ffmpeg`` binary is not on ``PATH``
(resolved once at import; raised on first call so test suites
that mock ``subprocess`` can still import the module), or if
the ffmpeg invocation exits with a non-zero status. The
original ``stderr`` output is included in ``details`` for
diagnostics.
+ ValueError: If input_path does not exist or output_path parent
+ directory does not exist.
"""
if _FFMPEG_PATH is None:
raise FfmpegConversionError(_FFMPEG_MISSING_MSG)
+ # Resolve paths to prevent traversal attacks
+ input_path = input_path.resolve()
+ output_path = output_path.resolve()
+
+ # Validate input exists and is a file
+ if not input_path.is_file():
+ msg = f"Input path does not exist or is not a file: {input_path}"
+ raise FfmpegConversionError(msg, details={"input_path": str(input_path)})
+
+ # Validate output parent directory exists
+ if not output_path.parent.exists():
+ msg = f"Output directory does not exist: {output_path.parent}"
+ raise FfmpegConversionError(
+ msg,
+ details={"output_parent": str(output_path.parent)}
+ )
cmd: list[str] = [
_FFMPEG_PATH,
"-y", # overwrite output without prompting
"-i",
str(input_path),
"-vn", # drop any video stream
"-acodec",
"pcm_s16le", # standard 16-bit PCM WAV
str(output_path),
]As per coding guidelines, audio file handling must validate paths before proceeding.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/ffmpeg.py` around lines 50 - 110, Resolve
and validate paths at the start of convert_to_wav: call .resolve() on input_path
and output_path, ensure _FFMPEG_PATH is checked as before, verify resolved
input_path.exists() and .is_file() and raise FfmpegConversionError with a clear
message if not, ensure resolved output_path.parent.exists() (create or raise as
per project policy), and validate both resolved paths lie within the configured
upload/working directory (e.g. UPLOAD_DIR or WORKING_DIR) to prevent traversal;
update the error details to include resolved paths when raising
FfmpegConversionError.
| result = subprocess.run( # noqa: S603 - argv list, shell=False | ||
| cmd, | ||
| capture_output=True, | ||
| text=True, | ||
| check=False, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Consider adding timeout parameter to subprocess.run.
The subprocess.run call has no timeout, which could cause the process to hang indefinitely if ffmpeg encounters an issue or a malicious input file causes ffmpeg to run indefinitely. As per coding guidelines for external calls: blocking calls without timeouts on request threads can cause cascading failures.
⏱️ Add timeout parameter
# `check=False` so we can surface ffmpeg's stderr verbatim; `shell=False`
# (the default) is the security-critical guarantee, never pass a string.
result = subprocess.run( # noqa: S603 - argv list, shell=False
cmd,
capture_output=True,
text=True,
check=False,
+ timeout=300, # 5 minutes max for conversion
)Don't forget to handle subprocess.TimeoutExpired and wrap it in FfmpegConversionError:
try:
result = subprocess.run(...)
except subprocess.TimeoutExpired as exc:
msg = f"ffmpeg conversion timed out after {exc.timeout}s: {input_path}"
raise FfmpegConversionError(
msg,
details={
"input_path": str(input_path),
"timeout": exc.timeout,
},
) from exc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/ffmpeg.py` around lines 88 - 93, Add a
timeout to the subprocess.run call that invokes ffmpeg (the call using cmd) and
catch subprocess.TimeoutExpired; on timeout raise FfmpegConversionError with a
descriptive message and details including "input_path" and the timeout value,
and chain the original exception; ensure timeout is configurable or a reasonable
fixed value and do not remove existing capture_output/text/check behavior.
| from audio_processor.exceptions import AudioLoadError | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Align raised exceptions with the centralized core exception hierarchy.
This module currently raises AudioLoadError from audio_processor.exceptions, but project rules require exceptions from src/audio_processor/core/exceptions.py (e.g., ValidationError, ResourceNotFoundError, ExternalServiceError) for consistency and cross-layer handling.
As per coding guidelines, “Use centralized exception hierarchy from src/audio_processor/core/exceptions.py with specific exception types.”
Also applies to: 51-52, 66-67
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/loader.py` around lines 16 - 17, The module
currently imports and raises AudioLoadError from audio_processor.exceptions;
replace that with imports from src.audio_processor.core.exceptions and raise the
appropriate centralized types instead: import ValidationError,
ResourceNotFoundError and ExternalServiceError and update all raises in the
loader functions (e.g., the loader module's functions that currently raise
AudioLoadError at the import site and at the error sites around lines indicated)
so that file-not-found or missing resource errors raise ResourceNotFoundError,
invalid/unsupported audio or parameter validation issues raise ValidationError,
and failures coming from external libraries or IO should raise
ExternalServiceError; update the import statement to reference these three
exception classes and change each raise to the correct class with the original
error message preserved.
| def load_audio(path: Path) -> tuple[np.ndarray, int]: | ||
| """Load an audio file, convert to mono, and resample to 16 kHz. | ||
|
|
||
| Reads the file with ``soundfile`` for efficient PCM decoding and falls | ||
| back to ``librosa`` for compressed formats (e.g. MP3) that soundfile | ||
| cannot always decode. Multi-channel input is averaged to mono. | ||
| Resampling uses librosa's polyphase resampler. | ||
|
|
||
| Args: | ||
| path: Filesystem path to the audio file. The suffix must be one of | ||
| ``.wav``, ``.mp3``, ``.flac``, or ``.ogg``. | ||
|
|
||
| Returns: | ||
| Tuple ``(audio, sample_rate)`` where ``audio`` is a 1-D ``float32`` | ||
| numpy array of mono samples and ``sample_rate`` is always | ||
| ``TARGET_SAMPLE_RATE`` (16_000). | ||
|
|
||
| Raises: | ||
| AudioLoadError: If the file suffix is unsupported, the file is | ||
| missing, or the underlying decoder fails on a corrupt file. | ||
| """ | ||
| suffix = path.suffix.lower() | ||
| if suffix not in SUPPORTED_SUFFIXES: | ||
| msg = f"Unsupported audio format: {suffix!r}" | ||
| raise AudioLoadError( | ||
| msg, | ||
| details={ | ||
| "path": str(path), | ||
| "suffix": suffix, | ||
| "supported": ", ".join(sorted(SUPPORTED_SUFFIXES)), | ||
| }, | ||
| ) | ||
|
|
||
| try: | ||
| audio, sample_rate = _read_samples(path, suffix) | ||
| except AudioLoadError: | ||
| raise | ||
| except (OSError, ValueError, RuntimeError) as exc: | ||
| msg = f"Failed to load audio file: {path}" | ||
| raise AudioLoadError( | ||
| msg, | ||
| details={"path": str(path), "reason": str(exc)}, | ||
| ) from exc |
There was a problem hiding this comment.
Enforce trusted-path resolution and boundary validation before any audio I/O.
load_audio and _read_samples read and report path directly. This misses the required trust-boundary check (Path.resolve() + upload-root containment) and leaks raw user-derived paths in exception context.
🔧 Suggested hardening sketch
def load_audio(path: Path) -> tuple[np.ndarray, int]:
+ resolved = path.resolve(strict=False)
+ # Validate resolved is under configured upload root before any I/O.
+ # Example:
+ # upload_root = settings.upload_dir.resolve(strict=True)
+ # if upload_root not in resolved.parents and resolved != upload_root:
+ # raise AudioLoadError("Audio path is outside allowed directory", details={...})
+
- suffix = path.suffix.lower()
+ suffix = resolved.suffix.lower()
@@
- audio, sample_rate = _read_samples(path, suffix)
+ audio, sample_rate = _read_samples(resolved, suffix)
@@
- msg = f"Failed to load audio file: {path}"
+ msg = "Failed to load audio file"
raise AudioLoadError(
msg,
- details={"path": str(path), "reason": str(exc)},
+ details={"reason": str(exc)},
) from excAs per coding guidelines, “all file I/O involving audio assets must use pathlib.Path.resolve() and validate against the configured upload directory before proceeding” and “agents must never write or log raw audio file paths derived from user input.”
Also applies to: 92-100
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/loader.py` around lines 27 - 69, Ensure
load_audio and _read_samples canonicalize and validate input paths before any
I/O: call Path.resolve() on the incoming path and verify the resolved path is
contained within the configured upload root (reject otherwise) before calling
_read_samples or performing reads; do not include the raw resolved path in
exception details or log messages—use a sanitized identifier (e.g., relative
path under the upload root or a redacted/hashed token) when constructing
AudioLoadError details and error messages so no user-derived absolute paths are
leaked.
| if audio.size == 0: | ||
| return float("-inf") | ||
|
|
||
| samples = np.asarray(audio, dtype=np.float64) | ||
| signal_power = float(np.mean(samples**2)) | ||
| if signal_power <= _NOISE_FLOOR: | ||
| return float("-inf") | ||
|
|
||
| # 4th-order Butterworth low-pass at Wn=0.5 (half of Nyquist, one | ||
| # quarter of the sample rate) approximates the slow-varying signal | ||
| # envelope; the residual captures broadband noise. | ||
| sos = scipy_signal.butter(N=4, Wn=0.5, btype="low", output="sos") | ||
| envelope = scipy_signal.sosfiltfilt(sos, samples) | ||
| noise = samples - envelope |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the full context of the quality.py file
cat -n src/audio_processor/preprocessing/quality.pyRepository: ByronWilliamsCPA/audio-processor
Length of output: 3441
🏁 Script executed:
# Check the scipy.signal.sosfiltfilt documentation and test its behavior with short arrays
python << 'EOF'
import numpy as np
import scipy.signal as scipy_signal
# Test sosfiltfilt with different array lengths
sos = scipy_signal.butter(N=4, Wn=0.5, btype="low", output="sos")
test_lengths = [1, 2, 3, 4, 5, 10, 20, 100]
for length in test_lengths:
samples = np.random.randn(length).astype(np.float64)
try:
result = scipy_signal.sosfiltfilt(sos, samples)
print(f"Length {length}: SUCCESS")
except ValueError as e:
print(f"Length {length}: ValueError - {e}")
except Exception as e:
print(f"Length {length}: {type(e).__name__} - {e}")
EOFRepository: ByronWilliamsCPA/audio-processor
Length of output: 194
🏁 Script executed:
# Check if there are any tests for the check_snr function with short arrays
find . -name "*test*.py" -type f | head -20 | xargs grep -l "check_snr\|quality" 2>/dev/null || echo "No matching test files found"Repository: ByronWilliamsCPA/audio-processor
Length of output: 119
🏁 Script executed:
# Also check for any error handling around sosfiltfilt in the codebase
rg "sosfiltfilt" -B 5 -A 5Repository: ByronWilliamsCPA/audio-processor
Length of output: 1030
🏁 Script executed:
# Read the test file to see if there are tests for short arrays
cat -n ./tests/unit/preprocessing/test_quality.pyRepository: ByronWilliamsCPA/audio-processor
Length of output: 2773
🌐 Web query:
scipy.signal.sosfiltfilt minimum length padlen requirements ValueError short arrays
💡 Result:
When using scipy.signal.sosfiltfilt, the function requires that the length of the input signal along the filtering axis be greater than the padlen value [1][2][3]. Specifically, the documentation states that padlen must be less than x.shape[axis] - 1 [1][2]. If this condition is not met, a ValueError is raised [3]. The default padlen is calculated based on the filter coefficients (the number of second-order sections) [1][2]. If you encounter this error with a signal that you believe is sufficiently long, it is often because the function is attempting to filter along the wrong axis [4][3]. By default, sosfiltfilt operates on the last axis (axis=-1) [1][2]. If your input array is multidimensional (e.g., shape (N, 1)), the function may be trying to filter along the dimension with length 1, which is almost certainly shorter than the default padlen [4][3]. To resolve this, you can: 1. Explicitly specify the correct axis (e.g., axis=0) in the function call [4][3]. 2. Ensure your input signal is a one-dimensional array (e.g., using np.ravel or indexing) [4][3]. 3. Manually set a smaller padlen if appropriate for your application, provided it remains consistent with the requirements of the filter [5]. Note that while the error message explicitly mentions padlen, the root cause is frequently a mismatch between the intended filtering dimension and the array's shape [4][3].
Citations:
- 1: https://docs.scipy.org/doc/scipy-1.5.3/reference/generated/scipy.signal.sosfiltfilt.html
- 2: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.sosfiltfilt.html
- 3: https://stackoverflow.com/questions/78316329/python-sosfiltfilt-vector-x-must-be-greater-than-padlen-in-reality-it-is-gre
- 4: https://stackoverflow.com/questions/46849290/error-with-padlen-in-signal-filtfilt-in-python
- 5: https://stackoverflow.com/questions/58374492/python-valueerror-the-length-of-the-input-vector-x-must-be-greater-than-padle
🌐 Web query:
scipy.signal.butter 4th order second order sections default padlen sosfiltfilt
💡 Result:
scipy.signal.butter(..., output='sos') and scipy.signal.sosfiltfilt: what happens with a “default padlen” and 4th-order SOS 1) butter: “4th order” with SOS - scipy.signal.butter(N,..., output='sos') returns a second-order sections array shaped (n_sections, 6). [1] - For bandpass/bandstop, SciPy states the resulting order for the final SOS matrix is 2*N; for simple low/high-pass, you typically get Nth-order realized as N biquad/sections in the SOS output representation. [1] 2) sosfiltfilt: default padlen is computed from the SOS structure - scipy.signal.sosfiltfilt(sos, x,..., padtype='odd', padlen=None) defaults to padtype='odd'. If padtype is None, no padding is used. [2] - If padlen is None, SciPy computes a default padlen value as: 3 * (2 * len(sos) + 1 - min((sos[:, 2] == 0).sum, (sos[:, 5] == 0).sum)) [2] (i.e., it depends on the number of SOS sections and whether there are zeros/poles at the origin for the implemented SOS representation). [2] - Note: padlen=0 implies no padding. [2] 3) Key implication for the edge artifact you mentioned (flattening near the ends) - If you set padtype=None in sosfiltfilt, you disable padding even though sosfiltfilt is a forward-backward filter that otherwise uses padding/extension to reduce edge transients. With no padding, the backward/forward initialization and signal extension assumptions can lead to boundary artifacts (often observed as flattening or other end effects). This is consistent with the fact that sosfiltfilt’s API explicitly ties edge handling to padding via padtype/padlen. [2] 4) Practical fixes - Prefer the default edge handling: do NOT set padtype=None; let padtype default to 'odd' and keep padlen=None so SciPy chooses the computed default padlen. [2] - If artifacts persist, increase padlen (not padtype=None). sosfiltfilt requires padlen < x.shape[axis] - 1. [2] - If you want “no delay” + better numerical robustness for higher orders, sosfiltfilt is generally the right approach; SciPy’s filtfilt documentation notes that sosfiltfilt (and SOS form) should be preferred over filtfilt for most filtering tasks due to fewer numerical problems. [3] 5) Your specific code snippet vs defaults - Your call sosfiltfilt(sos, data, padtype=None) forces “no padding” (padtype=None means no padding is used). [2] - To use the default padlen behavior, use padtype='odd' (or omit it) and padlen=None (or omit it). Example: data_filtered = scipy.signal.sosfiltfilt(sos, data, padtype='odd', padlen=None) (Default padtype='odd' and padlen is computed when padlen=None.) [2]
Citations:
- 1: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html
- 2: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.sosfiltfilt.html
- 3: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.filtfilt.html
Guard sosfiltfilt for short inputs to prevent ValueError on arrays shorter than the default padlen.
check_snr handles empty and silent audio but does not validate array length before calling scipy_signal.sosfiltfilt. Very short but non-empty, non-silent arrays (fewer than ~16 samples, depending on default padlen for 4th-order Butterworth) will raise ValueError. Add a minimum-length check before filtering, or use an alternate SNR estimator for short clips.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/quality.py` around lines 40 - 53, The
check_snr function calls scipy_signal.sosfiltfilt (via sosfiltfilt) without
guarding very short arrays, which raises ValueError; before creating sos or
calling scipy_signal.sosfiltfilt, check len(samples) against a small threshold
(e.g. 16) or compute the required padlen and compare to samples.size, and if the
input is too short, use a safe fallback (for example compute envelope with a
simple moving-average convolution or skip filtering and treat residual as zero /
return float("-inf") for unreliable SNR) so that envelope and noise are always
defined; update the branch around sos, envelope and noise in check_snr to use
this guarded fallback path.
| def detect_speech_segments( | ||
| audio: np.ndarray, | ||
| sample_rate: int, | ||
| ) -> list[tuple[float, float]]: | ||
| """Detect speech segments in mono audio using Silero VAD. | ||
|
|
||
| Args: | ||
| audio: 1-D mono audio array of floating-point samples in the range | ||
| ``[-1.0, 1.0]``. Silero VAD natively supports 8 kHz and 16 kHz | ||
| sample rates; 16 kHz is recommended. | ||
| sample_rate: Sample rate of ``audio`` in Hz. | ||
|
|
||
| Returns: | ||
| Ordered list of ``(start_seconds, end_seconds)`` tuples describing | ||
| the detected speech intervals. Returns an empty list if no speech | ||
| is detected. | ||
| """ | ||
| model, utils = _load_silero_vad() | ||
| # Silero's `utils` is a tuple; `get_speech_timestamps` is the first | ||
| # element by convention. | ||
| get_speech_timestamps = utils[0] | ||
|
|
||
| tensor = torch.as_tensor(audio, dtype=torch.float32) | ||
| timestamps: list[dict[str, int]] = get_speech_timestamps( | ||
| tensor, | ||
| model, | ||
| sampling_rate=sample_rate, | ||
| ) | ||
|
|
||
| return [(ts["start"] / sample_rate, ts["end"] / sample_rate) for ts in timestamps] |
There was a problem hiding this comment.
Validate sample_rate and audio shape before model invocation.
sample_rate is used both in Silero call and final division without guards. For sample_rate <= 0 this can fail with unclear errors (or divide by zero). Add explicit checks (sample_rate > 0, supported rates, audio.ndim == 1) and raise a typed validation error.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/audio_processor/preprocessing/vad.py` around lines 58 - 87, In
detect_speech_segments validate inputs before calling _load_silero_vad: check
that sample_rate is an int > 0 and one of Silero's supported rates (e.g., 8000
or 16000), check that audio is a 1-D numpy array (audio.ndim == 1) and that its
length > 0; if any check fails raise a typed validation error (ValueError) with
a clear message mentioning the offending parameter; perform these guards at the
top of detect_speech_segments so subsequent calls to _load_silero_vad,
get_speech_timestamps and the final division are safe.
| @pytest.mark.unit | ||
| def test_convert_to_wav_invokes_ffmpeg_with_argv_list(tmp_path: Path) -> None: | ||
| """Successful invocation passes an argv list (never a shell string).""" | ||
| src = tmp_path / "in.mp3" | ||
| src.write_bytes(b"\x00") | ||
| dst = tmp_path / "out.wav" | ||
|
|
||
| fake = MagicMock(spec=subprocess.CompletedProcess) | ||
| fake.returncode = 0 | ||
| fake.stderr = "" | ||
|
|
||
| with ( | ||
| patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"), | ||
| patch.object(ffmpeg_module.subprocess, "run", return_value=fake) as run_mock, | ||
| ): | ||
| result = convert_to_wav(src, dst) | ||
|
|
||
| assert result == dst | ||
| run_mock.assert_called_once() | ||
| args, kwargs = run_mock.call_args | ||
| cmd = args[0] | ||
| assert isinstance(cmd, list), "command must be argv list to avoid shell injection" | ||
| assert cmd[0] == "/usr/bin/ffmpeg" | ||
| assert str(src) in cmd | ||
| assert str(dst) in cmd | ||
| assert kwargs.get("shell", False) is False | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_convert_to_wav_raises_when_ffmpeg_missing(tmp_path: Path) -> None: | ||
| """Calling ``convert_to_wav`` without ffmpeg on PATH raises OSError.""" | ||
| src = tmp_path / "in.mp3" | ||
| src.write_bytes(b"\x00") | ||
| dst = tmp_path / "out.wav" | ||
|
|
||
| with ( | ||
| patch.object(ffmpeg_module, "_FFMPEG_PATH", None), | ||
| pytest.raises(FfmpegConversionError, match="ffmpeg binary not found"), | ||
| ): | ||
| convert_to_wav(src, dst) | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_convert_to_wav_raises_on_nonzero_exit(tmp_path: Path) -> None: | ||
| """A non-zero ffmpeg exit surfaces FfmpegConversionError with stderr context.""" | ||
| src = tmp_path / "in.mp3" | ||
| src.write_bytes(b"\x00") | ||
| dst = tmp_path / "out.wav" | ||
|
|
||
| fake = MagicMock(spec=subprocess.CompletedProcess) | ||
| fake.returncode = 1 | ||
| fake.stderr = "Invalid data found when processing input" | ||
|
|
||
| with ( | ||
| patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"), | ||
| patch.object(ffmpeg_module.subprocess, "run", return_value=fake), | ||
| pytest.raises(FfmpegConversionError) as exc_info, | ||
| ): | ||
| convert_to_wav(src, dst) | ||
|
|
||
| assert "exit 1" in str(exc_info.value) | ||
| assert ( | ||
| exc_info.value.details.get("stderr") | ||
| == "Invalid data found when processing input" | ||
| ) | ||
| assert exc_info.value.details.get("exit_code") == 1 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add tests for edge cases and error conditions.
The current tests cover the happy path and basic error cases, but several edge cases are missing per the coding guidelines requirement to "test for edge cases including: special characters in paths, network timeouts, partial failures":
Consider adding tests for:
- Input file doesn't exist: Verify behavior when
input_pathpoints to non-existent file - Output parent directory doesn't exist: The docstring states "parent directory must already exist" but this isn't validated or tested
- Special characters in paths: Unicode characters, spaces, quotes in filenames
- Empty input file: Zero-byte or minimal invalid audio file
- Subprocess timeout: If timeout is added to
subprocess.run(per my other comment), testTimeoutExpired - Path traversal attempts: Test with paths containing
..if validation is added
📋 Example additional test cases
`@pytest.mark.unit`
def test_convert_to_wav_with_special_chars_in_path(tmp_path: Path) -> None:
"""Handle filenames with spaces and Unicode correctly."""
src = tmp_path / "audio with spaces 音频.mp3"
src.write_bytes(b"\x00")
dst = tmp_path / "output 输出.wav"
fake = MagicMock(spec=subprocess.CompletedProcess)
fake.returncode = 0
fake.stderr = ""
with (
patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
patch.object(ffmpeg_module.subprocess, "run", return_value=fake) as run_mock,
):
result = convert_to_wav(src, dst)
assert result == dst
# Verify paths are properly passed to subprocess
cmd = run_mock.call_args[0][0]
assert str(src) in cmd
assert str(dst) in cmd
`@pytest.mark.unit`
def test_convert_to_wav_missing_output_parent_dir(tmp_path: Path) -> None:
"""Fail gracefully when output parent directory doesn't exist."""
src = tmp_path / "in.mp3"
src.write_bytes(b"\x00")
dst = tmp_path / "nonexistent" / "out.wav" # parent doesn't exist
with (
patch.object(ffmpeg_module, "_FFMPEG_PATH", "/usr/bin/ffmpeg"),
# Depending on implementation, this might raise before subprocess
pytest.raises((FfmpegConversionError, FileNotFoundError)),
):
convert_to_wav(src, dst)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/preprocessing/test_ffmpeg.py` around lines 19 - 84, Add unit tests
to cover the missing edge cases for convert_to_wav: add tests that call
ffmpeg_module.convert_to_wav with a non-existent input Path (assert it raises
FfmpegConversionError or FileNotFoundError), with an output whose parent
directory does not exist (assert proper error is raised), with filenames
containing spaces/unicode/quotes (assert subprocess.run is invoked with the
exact paths), with an empty zero-byte input file (assert ffmpeg error handling
attaches stderr/exit code), and a test that simulates subprocess.TimeoutExpired
by patching ffmpeg_module.subprocess.run to raise TimeoutExpired and asserting
convert_to_wav surfaces a FfmpegConversionError or timeout-specific behavior;
reference ffmpeg_module, convert_to_wav, FfmpegConversionError and
subprocess.run in your new tests.
| @pytest.mark.unit | ||
| def test_check_snr_silent_input_returns_neg_inf() -> None: | ||
| """All-zero input has no signal energy and returns -inf.""" | ||
| audio = np.zeros(1_000, dtype=np.float32) | ||
| assert check_snr(audio) == float("-inf") | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_check_snr_empty_input_returns_neg_inf() -> None: | ||
| """Empty input returns -inf rather than raising.""" | ||
| assert check_snr(np.array([], dtype=np.float32)) == float("-inf") | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| def test_check_snr_returns_finite_for_noisy_signal() -> None: | ||
| """A sine tone plus broadband noise yields a finite, lower SNR.""" | ||
| rng = np.random.default_rng(seed=0) | ||
| sr = 16_000 | ||
| t = np.linspace(0, 1.0, sr, endpoint=False) | ||
| tone = 0.5 * np.sin(2 * np.pi * 440.0 * t) | ||
| noise = rng.normal(scale=0.1, size=sr) | ||
| noisy = (tone + noise).astype(np.float32) | ||
|
|
||
| snr_db = check_snr(noisy) | ||
|
|
||
| assert math.isfinite(snr_db) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Add a regression test for short non-empty inputs in check_snr.
Please add a tiny array case (e.g., length < filtfilt pad length) and assert deterministic behavior instead of exception.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/preprocessing/test_quality.py` around lines 25 - 51, Add a
regression unit test to tests/unit/preprocessing/test_quality.py that exercises
check_snr with a very short non-empty array (e.g., length 2–4, values not all
zero) to ensure it does not raise from internal filtering/pad logic and returns
a deterministic result; create a fixed tiny input (like np.array([0.1, -0.1],
dtype=np.float32)), call check_snr(tiny_input) and assert the result is
deterministic (e.g., math.isfinite(snr) and/or equals a specific expected float)
in a new test function named
test_check_snr_short_non_empty_input_returns_deterministic_value.
…qm-q3c2 Targeted single-package upgrade to patch: - GHSA-r6ph-v2qm-q3c2 (HIGH) — cryptography vulnerable to a subgroup attack due to missing subgroup validation for SECT curves Surfaced after the urllib3 fix landed; this was the next blocker on Dependency Review and OSV Vulnerability Scanner. Same targeted-upgrade pattern as the urllib3 fix to avoid wholesale-upgrade collateral. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… GHSAs Targeted single-package upgrade to patch: - GHSA-x2qx-6953-8485 (HIGH) — unsafe option check before shlex.split - GHSA-rpm5-65cw-6hj4 (HIGH) — command injection via Git options bypass - GHSA-7545-fcxq-7j24 (HIGH) — path traversal in reference APIs - GHSA-v87r-6q3f-2j67 (HIGH) — RCE via core.hooksPath newline injection - GHSA-mv93-w799-cj2w (HIGH) — bypass of CVE-2026-42215 patch Surfaced after cryptography fix; was the next Dependency Review blocker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GHSA-5789-5fc7-67v3 (HIGH) — path traversal via incorrect startswith() check - GHSA-24qx-w28j-9m6p (HIGH) — CORS origin validation bypass via re.match() - GHSA-5mrq-x3x5-8v8f (HIGH) — auth cookies remain valid after password reset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GHSA-rch3-82jr-f9w9 (HIGH) — auth token theft via CommandLinker XSS - GHSA-37w4-hwhx-4rc4 (HIGH) — extension manager API/GUI policy bypass - GHSA-mqcg-5x36-vfcg (HIGH) — command linker HTML attribute exec Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…default config) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ITLE_RE) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cklog Targeted multi-package upgrade via uv lock --upgrade-package (per package, no wholesale upgrade) to clear the remaining HIGH-severity findings flagged by Dependency Review, OSV Vulnerability Scanner, and pip-audit: - nbconvert 7.16.6 -> 7.17.1 - notebook 7.5.0 -> 7.5.6 - pillow 12.0.0 -> 12.2.0 - pip 25.3 -> 26.1.1 - protobuf 6.33.1 -> 6.33.6 - pyasn1 0.6.1 -> 0.6.3 - pygments 2.19.2 -> 2.20.0 - pytest 9.0.1 -> 9.0.3 - python-dotenv 1.2.1 -> 1.2.2 - python-multipart 0.0.20 -> 0.0.29 - requests 2.32.5 -> 2.34.2 - tornado 6.5.2 -> 6.5.5 - virtualenv 20.35.4 -> 21.3.3 - werkzeug 3.1.4 -> 3.1.8 - filelock 3.20.0 -> 3.29.0 - jaraco-context 6.0.1 -> 6.1.2 Local verification: - uv run pip-audit: only py 1.11.0 PYSEC-2022-42969 remains (DISPUTED ReDoS, no upstream fix, already filtered by OSV scanner) - uv run pytest tests/unit/preprocessing/: 16 passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sing deps
Audio preprocessing introduces a transitive dep tree (librosa -> numba,
silero-vad -> torch -> nvidia-cu12-*) that surfaces licenses not in the
existing allow-list, plus packages whose license metadata is not indexed
by GitHub's dependency API.
Adds to allow-licenses:
MPL-1.1, 0BSD, BSD-4-Clause, BSD-2-Clause-Views, Zlib,
LicenseRef-scancode-python-cwi,
LicenseRef-scancode-secret-labs-2011,
LicenseRef-scancode-unicode
Adds allow-dependencies-licenses for packages with no metadata-known
license (torch, lxml, silero-vad, onnxruntime, deepgram-sdk, protobuf,
jaraco-functools, ruamel-yaml{,-clib}, mkdocs-git-revision-date-localized
plus 15 nvidia-cu12-* CUDA packages).
Follows the documented precedent in this file (2026-05-17 entries for
MIT-CMU / ZPL-2.1 / HPND-Markus-Kuhn / LicenseRef-scancode-protobuf).
Per-license rationale documented inline.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surfaced by CI pip-audit after the batch upgrade landed; my local pip-audit had picked up an idna 3.11 venv even though uv.lock already had 3.15 (uv sync didn't auto-reinstall). CI uses --frozen so it reads the lock directly and saw the actual issue. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/dependency-review.yml:
- Around line 63-99: The allow-dependencies-licenses block currently contains
versionless PURLs (e.g., pkg:pypi/torch) which act as wildcards; update each
pkg:pypi/* entry in the allow-dependencies-licenses list to include the exact
approved version (e.g., pkg:pypi/torch@2.0.1) that was reviewed, or remove
entries that exist only to silence "no license" metadata so the action can
handle undetected licenses normally; also revisit the allow-licenses list (the
allow-licenses symbol) to remove or narrow any broad copyleft entries if you
intend to restrict repo-wide license acceptance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 33dcabc8-f193-4fde-9946-056b4c05d3ad
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (1)
.github/workflows/dependency-review.yml
| allow-licenses: MIT, MIT-CMU, Apache-2.0, BSD-2-Clause, BSD-2-Clause-Views, BSD-3-Clause, BSD-4-Clause, 0BSD, ISC, MPL-1.1, MPL-2.0, LGPL-2.1, LGPL-3.0, Python-2.0, Unlicense, CC0-1.0, GPL-3.0-or-later, ZPL-2.1, Zlib, HPND-Markus-Kuhn, LicenseRef-scancode-protobuf, LicenseRef-scancode-python-cwi, LicenseRef-scancode-secret-labs-2011, LicenseRef-scancode-unicode | ||
| # Packages whose license metadata is not indexed by GitHub's dependency | ||
| # API (action sees "no license"). These all have known permissive | ||
| # licenses that would be allowed under allow-licenses if metadata was | ||
| # present: | ||
| # - torch, lxml, silero-vad, onnxruntime, deepgram-sdk: BSD-3-Clause / MIT | ||
| # - protobuf: BSD-3-Clause (LicenseRef-scancode-protobuf upstream) | ||
| # - jaraco-functools, ruamel-yaml, ruamel-yaml-clib: MIT | ||
| # - mkdocs-git-revision-date-localized-plugin: MIT | ||
| # - nvidia-* CUDA stack: NVIDIA Software License Agreement (proprietary | ||
| # EULA, accepted as dependency of torch which is required for VAD) | ||
| allow-dependencies-licenses: >- | ||
| pkg:pypi/torch, | ||
| pkg:pypi/lxml, | ||
| pkg:pypi/silero-vad, | ||
| pkg:pypi/onnxruntime, | ||
| pkg:pypi/deepgram-sdk, | ||
| pkg:pypi/protobuf, | ||
| pkg:pypi/jaraco-functools, | ||
| pkg:pypi/ruamel-yaml, | ||
| pkg:pypi/ruamel-yaml-clib, | ||
| pkg:pypi/mkdocs-git-revision-date-localized-plugin, | ||
| pkg:pypi/nvidia-cublas-cu12, | ||
| pkg:pypi/nvidia-cuda-cupti-cu12, | ||
| pkg:pypi/nvidia-cuda-nvrtc-cu12, | ||
| pkg:pypi/nvidia-cuda-runtime-cu12, | ||
| pkg:pypi/nvidia-cudnn-cu12, | ||
| pkg:pypi/nvidia-cufft-cu12, | ||
| pkg:pypi/nvidia-cufile-cu12, | ||
| pkg:pypi/nvidia-curand-cu12, | ||
| pkg:pypi/nvidia-cusolver-cu12, | ||
| pkg:pypi/nvidia-cusparse-cu12, | ||
| pkg:pypi/nvidia-cusparselt-cu12, | ||
| pkg:pypi/nvidia-nccl-cu12, | ||
| pkg:pypi/nvidia-nvjitlink-cu12, | ||
| pkg:pypi/nvidia-nvshmem-cu12, | ||
| pkg:pypi/nvidia-nvtx-cu12 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/dependency-review.yml | sed -n '54,100p'Repository: ByronWilliamsCPA/audio-processor
Length of output: 3284
🌐 Web query:
actions/dependency-review-action allow-dependencies-licenses PURL versioning behavior
💡 Result:
In the GitHub actions/dependency-review-action, the allow-dependencies-licenses configuration option accepts a list of packages in Package URL (PURL) format to exclude them from license checks [1][2][3]. Regarding versioning and matching behavior: 1. Version Specificity: When a specific version is included in the PURL (e.g., pkg:githubactions/owner/repo@2.0.0), the action matches that specific version [1]. 2. Wildcard/Omission Behavior: If no version is specified in the PURL, the action treats it as a wildcard, effectively allowing all versions of that package [3]. 3. PURL Parsing: The action uses a permissive, custom parser for PURLs rather than strictly adhering to the standard packageurl-js library [4]. This allows it to handle certain non-standard or partial PURLs that users commonly provide, though it generally expects valid PURL structures [4]. 4. Limitations: As of May 2026, the action does not natively support PURL qualifiers (such as?license=...) to filter by specific licenses for a given package, nor does it support native wildcard characters (like *) within the PURL string itself for name matching, although these have been requested as enhancements [5][6]. In summary, providing a PURL without a version acts as a blanket allow for all versions of that package, while providing a versioned PURL restricts the exclusion to that specific version [1][3].
Citations:
- 1: https://github.com/actions/dependency-review-action
- 2: https://github.com/actions/dependency-review-action/blob/main/action.yml
- 3: https://github.com/actions/dependency-review-action/blob/main/docs/examples.md
- 4: https://github.com/actions/dependency-review-action/pull/753](https:/togithub.com/actions/dependency-review-action/pull/753)
- 5: allow-dependencies-licenses: allow-list specific license actions/dependency-review-action#1046
- 6: Support name wildcard on allow-dependencies-licenses purls actions/dependency-review-action#959
Pin PURL versions to prevent blanket license waivers for future upgrades.
The current configuration uses versionless PURLs in allow-dependencies-licenses (e.g., pkg:pypi/torch instead of pkg:pypi/torch@2.0.1), which act as wildcards matching all versions of each package indefinitely. This turns metadata gaps for the current PR into permanent exceptions for all future versions, even after dependency updates. Additionally, allow-licenses now admits copyleft licenses (GPL-3.0-or-later, LGPL-2.1, LGPL-3.0, MPL-1.1, MPL-2.0) repo-wide.
Scope approvals by adding explicit versions to the PURLs (e.g., pkg:pypi/torch@2.0.1) to limit exceptions to the exact packages reviewed in this PR. Alternatively, drop entries that only suppress missing metadata, and instead rely on the action's existing handling of undetected licenses.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/dependency-review.yml around lines 63 - 99, The
allow-dependencies-licenses block currently contains versionless PURLs (e.g.,
pkg:pypi/torch) which act as wildcards; update each pkg:pypi/* entry in the
allow-dependencies-licenses list to include the exact approved version (e.g.,
pkg:pypi/torch@2.0.1) that was reviewed, or remove entries that exist only to
silence "no license" metadata so the action can handle undetected licenses
normally; also revisit the allow-licenses list (the allow-licenses symbol) to
remove or narrow any broad copyleft entries if you intend to restrict repo-wide
license acceptance.
Trivy on PR #27 surfaced one new HIGH CVE in the python:3.12-slim base image (already had 5 documented CVEs in the same libgnutls30t64 package): - CVE-2026-42009 (HIGH, affected) — DoS via DTLS packet reordering No upstream Debian fix available. Same compensating-control argument as the other libgnutls entries: GnuTLS is transitive via curl/ffmpeg only, Python TLS uses OpenSSL via cryptography/urllib3, no untrusted TLS endpoints flow through GnuTLS in the application request path. Adds one .trivyignore entry and one row to the existing libgnutls30t64 table in docs/known-vulnerabilities.md (Discovered 2026-05-19, reassess 60-day cap per CLAUDE.md unfixed-CVE policy). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…L escape CI exposed a pre-existing docs build regression once PR #27 touched docs/: pygments 2.20.0 tightened html.escape() to reject None, breaking pymdown-extensions <= 10.17.2's code-highlight formatter which passed filename=None unconditionally. Affected any code block in any page. The bug was on main (verified by uv-syncing main's exact lock and running mkdocs build --strict locally) but the docs build job is path- filtered to docs-touching PRs, so it had stayed undetected. Upgrade pymdown-extensions v10.17.2 -> v10.21.3 (which fixes the filename handling) while keeping pygments 2.20.0 (which fixes CVE-2026-4539). Local verification: - uv run mkdocs build --strict: built in 1.84s, no errors - uv run pip-audit: only py 1.11.0 PYSEC-2022-42969 (already documented) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
…API hardening, decomposition) (#53) * refactor: remove dead preprocessing pipeline and orphaned exceptions The preprocessing/ package (PR #27) was superseded by the services/ implementations (PR #43) and is imported nowhere in production code. The two pipelines computed quality metrics with different algorithms (Butterworth-filter SNR vs spectral-percentile SNR) and maintained a second, independent Silero VAD model cache, creating a correctness ambiguity and duplicate maintenance burden. This removes: - src/audio_processor/preprocessing/{loader,ffmpeg,vad,quality}.py - src/audio_processor/exceptions.py (AudioLoadError, FfmpegConversionError were only used by the preprocessing modules; the canonical hierarchy lives in core/exceptions.py) - tests/unit/preprocessing/ The wired-in services/ implementations remain the single source of truth. Suite: 436 passing, coverage 89.98%. * feat: unify job lifecycle behind a shared JobStore and enable ARQ enqueue The API stored jobs in a process-local in-memory dict and never enqueued them, while the worker read/wrote a separate Redis store. A submitted job could never reach COMPLETED from the API's perspective, and /results and /artifacts were structurally unreachable (issue #50). Introduce core/job_store.py with a JobStore abstraction and two backends: - InMemoryJobStore (dev/tests; also exposes a sync mapping interface for direct record injection) - RedisJobStore (JSON-in-Redis keyed by job:{id}) Both backends share one key scheme and serialization, so the API and the separate-process worker observe the same state. - routes.py resolves the store via app.state (falling back to the in-memory store) and enqueues to ARQ when enqueue_enabled and a pool is attached. - audio_tasks._update_job_status delegates to RedisJobStore (single source of truth; removes the duplicated get/merge/set + key literal). - api/__init__.py gains a lifespan that opens an ARQ pool and attaches a RedisJobStore when job_store_backend == "redis". Defaults (memory backend, enqueue disabled) preserve existing behavior; arq is imported lazily so the API does not hard-depend on the 'jobs' extra. New config: job_store_backend ("memory"|"redis"), enqueue_enabled (bool). Tests: add test_job_store.py (both backends) and test_audio_tasks.py, which gives process_audio_job its first coverage and verifies a job reaches COMPLETED with result+artifacts in the shared store. 449 passing, 92.76%. * feat(security): add API-key auth, rate limiting, and safe streaming uploads Addresses two findings from the architecture review: F3 (no authn/authz): every endpoint was anonymous, so unauthenticated callers could drive FFmpeg work and paid Deepgram calls. F4 (unbounded upload): the size guard trusted the client Content-Length header and the body was read fully into memory before validation, enabling a memory-exhaustion DoS; orphaned temp files were only cleaned on the validation-error path. Changes: - api/security.py: require_api_key (constant-time X-API-Key check) and a per-client fixed-window rate_limit dependency. Both are gated by config and default OFF, so existing behavior is preserved. - require_api_key is attached at the /api/v1 router; rate_limit guards the expensive POST /process endpoint. - routes.py: stream uploads to disk in 1 MiB chunks with a hard byte cap enforced on bytes actually read (no full-body buffering, header not trusted). A finally block guarantees the temp file is removed on every failure path; on success the worker owns and deletes it. - config: auth_required, api_keys (+ api_key_set), rate_limit_enabled, rate_limit_requests, rate_limit_window_seconds. No new third-party dependency is added (keeps the audited dependency set stable); the rate limiter is process-local and documented as a safety net to complement a gateway/Redis limiter in multi-process deployments. Tests: test_security.py and test_routes_hardening.py cover auth (open/401/ wrong/ok/misconfig-500), rate limiting (under/over/per-client), the streaming cap, and temp-file cleanup on failure. 462 passing, coverage 92.79%. * docs: document job/security settings in .env.example; apply ruff format - Add JOB_STORE_BACKEND, ENQUEUE_ENABLED, AUTH_REQUIRED, API_KEYS, RATE_LIMIT_* to .env.example so the new configuration is discoverable. - Apply ruff format to api/__init__.py and test_security.py (whitespace only). * fix(test): move UploadFile import into TYPE_CHECKING block CodeQL flagged UploadFile as an unused runtime import; it is only used in a cast() string annotation, so it belongs in the type-checking block. * refactor: remove dead Python 3.10 UTC compatibility shims requires-python is >=3.11, so datetime.UTC is always available and the sys.version_info guard's else branch was unreachable dead code, duplicated across four modules (models, routes, audio_tasks, worker). Replace each with a direct `from datetime import UTC` and drop the now-unused sys/timezone imports. * refactor: remove template scaffolding from worker and utils Cookiecutter residue that the audio pipeline never used: - jobs/worker.py: drop the example/stub tasks (example_background_task, send_email_task, process_file_upload) and the no-op cleanup_old_data cron (it returned a hard-coded 0 yet was registered on the worker). Remove the two large triple-quoted "example" blocks (FastAPI integration + Celery alternative) and the now-unused asyncio/datetime/cron imports. WorkerSettings now registers only process_audio_job. - utils/financial.py: delete the empty placeholder module (no references). - Update test_worker.py to cover the real surface (process_audio_job registration, lifecycle hooks, enqueue_task) and CLAUDE.md project tree. 456 passing, coverage 92.96%. * refactor: decompose process_audio_job into staged helpers The 260-line task mixed orchestration, conversion, transcription, artifact generation, cleanup, and six near-identical inlined progress dicts. Extract focused, individually-readable helpers: - _progress(): builds a progress payload (collapses the 6 repeated blocks). - _convert_audio(): converter is_video/extract vs convert_for_asr branch. - _transcribe(): lazy Deepgram import + ConfigurationError tolerance. - _build_transcription_payload(): result serialization. - _generate_artifacts(): lazy import + generation-failure tolerance. process_audio_job is now a ~80-line orchestrator with the same external behavior and status transitions. Add tests for the two previously-uncovered error branches (generic-exception wrapping to AudioProcessorError + FAILED, and artifact-generation failure still completing the job). 458 passing, coverage 93.06%. * fix: address PR review (rate-limiter bound, config invariant, test coverage) - security.py: bound _RATE_WINDOWS with opportunistic eviction of expired windows above a soft cap, so a flood of unique client identifiers cannot grow the per-process map without bound. - config.py: add a model_validator enforcing the documented invariant that enqueue_enabled requires job_store_backend='redis', so the misconfiguration fails fast at startup instead of silently never processing jobs. - test_job_store.py: cover the bytes-decode branch of RedisJobStore._decode. * fix(test): resolve CodeQL unused-import on UploadFile CodeQL flagged UploadFile as unused because its only reference was the string in cast("UploadFile", ...), and ruff TC006 conversely requires the cast type to be quoted -- a direct conflict. Drop the cast/import entirely and pass the duck-typed _FakeUpload directly with a precise reportArgumentType ignore on the call. Also rename the autouse fixture _clear_store -> clear_store to match the suite convention and avoid reportUnusedFunction. * fix(core): make RedisJobStore.update atomic with per-field HSET (#57) Closes #54. RedisJobStore stored each job as a single JSON string and updated it with a read-modify-write (GET, merge in Python, SET). Because the API and the ARQ worker write the same job from separate processes, two concurrent updates could clobber each other (lost update): whichever SET landed last overwrote every field the other writer had changed. This is the failure mode the F2 "shared job store" fix is meant to prevent, so the store must not reintroduce it. Store each job as a Redis hash (job:{id}) with individually JSON-encoded field values: - create: MULTI/EXEC of DELETE + HSET(mapping) + EXPIRE (replaces any prior record, clears stale fields, reapplies TTL). - update: HSET only the provided non-None fields + EXPIRE. HSET is atomic, so writers touching different fields no longer clobber each other; same-field writes remain last-writer-wins (the expected contract). - get: HGETALL + per-field decode, tolerant of bytes or str (clients created without decode_responses return bytes). Consolidate the two duplicated get/set FakeRedis stand-ins into a single hash-capable tests/unit/_fake_redis.py (adds hset/hgetall/expire/delete and a buffered transactional pipeline), and add a regression test asserting two disjoint concurrent field updates both survive. Suite: 460 passing, coverage 93.03%. ruff + ruff format clean; basedpyright src/: 0 errors. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR review — secrets, API hardening, error handling Re-applied on top of the merged atomic store (#57), which already implemented the per-field HSET RedisJobStore and the shared test fake. This commit adds the remaining review items that #57 did not cover: Security: - api_keys is now SecretStr (was plain str) so keys never leak via repr / model_dump / logs; api_key_set reads via get_secret_value(). - require_api_key compares against all keys without short-circuiting (avoids a timing side channel); the rate limiter hashes the API key before using it as an in-memory map key. - Fail-fast Settings validator: auth_required requires at least one api_key. Memory: - Rate-limiter window map is hard-capped: after evicting expired windows it evicts the oldest entries when still over the cap. Error handling: - _generate_artifacts catches any exception (DOM/json.dumps can raise TypeError/ValueError/KeyError) and records an artifacts_error field instead of failing a completed transcription. - Status-route timestamp parsing is defensive (no opaque 500 on a bad record). - _maybe_enqueue raises when enabled without a pool, and enqueue failures mark the job FAILED instead of stranding it QUEUED. Tests: test_config.py (validators, SecretStr non-leak) and _maybe_enqueue seam tests. 489 passing, coverage 92.69%. * style: drop unused noqa(BLE001) on re-raising except in _maybe_enqueue guard The except re-raises, so BLE001 does not apply and the suppression was unused (ruff RUF100), failing the Code Quality CI checks. * fix(security): use non-crypto hash for rate-limit bucket id CodeQL flagged hashlib.sha256(api_key) as weak hashing of sensitive data (it treats the API key as a password requiring a slow KDF). The hash here is only an in-process rate-limit bucket id, not password storage. Use the builtin non-cryptographic hash(), which keeps per-key buckets and still avoids holding the raw secret as a map key, without tripping the crypto sink. --------- Co-authored-by: Claude <noreply@anthropic.com>



Summary
Implements Phase 1 Sprints 1-4 of the audio preprocessing pipeline: pure-Python signal processing that runs before Deepgram transcription. No live API key is required for any code in this PR; all tests use mocks and synthetic numpy fixtures.
Modules Created
src/audio_processor/exceptions.pyAudioLoadError(extendsProjectBaseError)src/audio_processor/preprocessing/__init__.pysrc/audio_processor/preprocessing/loader.py(Sprint 1)soundfile(librosa for MP3), mono-mix, resample to 16 kHzload_audio(path: Path) -> tuple[np.ndarray, int]src/audio_processor/preprocessing/ffmpeg.py(Sprint 2)ffmpeg; validates binary onPATHat import; argv list only — no shell injectionconvert_to_wav(input_path: Path, output_path: Path) -> Pathsrc/audio_processor/preprocessing/vad.py(Sprint 3)torch.hub.load(thread-safe lazy cache)detect_speech_segments(audio: np.ndarray, sample_rate: int) -> list[tuple[float, float]]src/audio_processor/preprocessing/quality.py(Sprint 4)scipy.signal); peak-magnitude clipping detectorcheck_snr(audio) -> float,check_clipping(audio, threshold=0.99) -> boolTests (15 unit tests, all mocked)
tests/unit/preprocessing/test_loader.pytests/unit/preprocessing/test_ffmpeg.pysubprocess.runtests/unit/preprocessing/test_vad.pytorch.hub.load+get_speech_timestampstests/unit/preprocessing/test_quality.pyCoverage
Per-module statement coverage of the new files (from
pytest --cov=src/audio_processor/preprocessing --cov=src/audio_processor/exceptions):exceptions.pypreprocessing/__init__.pypreprocessing/quality.pypreprocessing/vad.pypreprocessing/ffmpeg.pypreprocessing/loader.pyEstimated repo-wide line-coverage delta: +5 to +7 percentage points — roughly 100 of ~110 new executable statements covered, against the existing ~582-statement codebase. (The repo as a whole remains below the 80% gate because of pre-existing untested modules like
cli.py,api/__init__.py,core/cache.py, etc.; this PR does not aim to fix those.)Dependencies
scipy>=1.11.0to the[audio]extra inpyproject.toml(used bycheck_snr). All other audio libraries (librosa,soundfile,torch,silero-vad) were already present.Verification
uv run pytest tests/unit/preprocessing/ -v→ 15 passeduv run ruff checkon the new files → cleanuv run ruff formatapplieduv run basedpyrighton the new files → 0 errors (28 warnings are all third-party stub-availability advisories onlibrosa/soundfile/scipy/torch.hub)uv run banditon the new files → 0 high, 1 medium, 2 low — all expected (subprocess usage inffmpeg.pywithshell=False;torch.hub.loadfor Silero VAD as specified in the task)Test plan
subprocess/torch.hub.loadconfirmed acceptableffmpegbinary present in CI image (required bypreprocessing.ffmpegimport)Generated by Claude Code
Summary by CodeRabbit
New Features
Chores
Tests
Chore