Skip to content

TheColby/pvx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

pvx logo

pvx

pvx is a Python toolkit for high-quality time and pitch processing using a phase-vocoder/short-time Fourier transform (STFT) core.

It is designed for users who need musically usable results under both normal and extreme processing conditions, including long time stretching, formant-aware pitch movement, transient-sensitive material, and stereo/multichannel coherence preservation.

Primary project goals and differentiators:

  • audio quality first (phase coherence, transient integrity, formant stability, stereo coherence)
  • speed second (throughput/runtime tuning only after quality targets are met)
  • multichannel-native audio processing

At a glance, pvx provides:

  • a unified command-line interface (CLI) (pvx) plus installed tool entry points (pvxvoc, pvxfreeze, and others)
  • focused tools (voc, freeze, harmonize, retune, morph, and more) with shared argument conventions
  • deterministic central processing unit (CPU) paths for reproducible runs, plus optional graphics processing unit (GPU)/Compute Unified Device Architecture (CUDA) acceleration where available
  • native Apple Silicon support in the CPU path
  • comma-separated values (CSV)-driven automation workflows for segment-wise and trajectory-driven processing
  • microtonal support (ratio, cents, and scale-constrained retune workflows)
  • shared mastering/output controls (target loudness units relative to full scale (LUFS), limiting, clipping, dithering, and output policy options)
  • comprehensive generated documentation (Markdown, HyperText Markup Language (HTML), and Portable Document Format (PDF))

Table of Contents

Value Proposition

pvx is designed for users who care more about artifact control than raw throughput. In practical terms:

  • higher control density than typical one-knob stretch tools (phase, transient, stereo coherence, formant, and mastering layers in one command surface)
  • explicit reproducibility for research and production (deterministic central processing unit (CPU) mode, manifests, benchmark gates, and stable command syntax)
  • first-class automation for time-varying control-rate signals (comma-separated values (CSV)/JavaScript Object Notation (JSON) interpolation, routing, follow, and chain)
  • one unified command-line interface (CLI) (pvx) with backwards-compatible direct entry points (pvxvoc, pvxfreeze, etc.)

Install

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e .
pvx --help

Or with uv:

uv venv .venv
source .venv/bin/activate
uv pip install -e .
uv run pvx --help

Persist pvx on your shell path (zsh):

printf 'export PATH="%s/.venv/bin:$PATH"\n' "$(pwd)" >> ~/.zshrc
source ~/.zshrc
pvx --help

Optional CUDA:

python3 -m pip install cupy-cuda12x

uv equivalent:

uv pip install cupy-cuda12x

Man pages are generated during make install / make install-dev:

python3 scripts/scripts_install_man_pages.py
MANPATH="$(pwd)/man:$MANPATH" man pvx

Installation and Runtime Matrix

Platform / Runtime CPU mode GPU/CUDA mode Notes
Linux x86_64 Supported Supported (CUDA + CuPy) Best choice for NVIDIA CUDA acceleration.
Windows x86_64 Supported Supported (CUDA + CuPy) Match CuPy package to installed CUDA runtime.
macOS Intel Supported Not CUDA Use CPU mode; Metal acceleration is not a CUDA path.
macOS Apple Silicon (M1/M2/M3/M4) Supported (native arm64) Not CUDA Native Apple Silicon support in CPU path; prefer quality-focused profiles first.

Primary command:

pvx voc input.wav --stretch 1.2 --output output.wav

Fallback without PATH updates:

.venv/bin/pvx voc input.wav --stretch 1.2 --output output.wav
# or
python3 -m pvx.cli.pvx voc input.wav --stretch 1.2 --output output.wav

Fallback with uv:

uv run pvx voc input.wav --stretch 1.2 --output output.wav

Legacy wrappers remain available for backward compatibility.

Start Here (No Prior DSP Knowledge Needed)

If this is your first phase-vocoder workflow, think of pvx as:

  • a way to make audio longer/shorter without changing musical note center
  • a way to change pitch without changing duration
  • a way to do both while protecting attacks, timbre, and spatial image

You do not need to understand the math first. Start with copy-paste commands, listen, then adjust one parameter at a time. No ceremonial DSP robes required.

60-Second First Render

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e .
pvx voc input.wav --stretch 1.20 --output output.wav

Same flow with uv:

uv venv .venv
source .venv/bin/activate
uv pip install -e .
uv run pvx voc input.wav --stretch 1.20 --output output.wav

If pvx is not found after install:

printf 'export PATH="%s/.venv/bin:$PATH"\n' "$(pwd)" >> ~/.zshrc
source ~/.zshrc
pvx --help

If that does not work, it is usually a PATH issue, which is both common and mildly annoying.

No PATH fallback:

.venv/bin/pvx voc input.wav --stretch 1.20 --output output.wav
# or
python3 -m pvx.cli.pvx voc input.wav --stretch 1.20 --output output.wav

uv fallback (no PATH changes):

uv run pvx voc input.wav --stretch 1.20 --output output.wav

What you should hear:

  • same pitch
  • about 20% longer duration
  • minor artifact risk on sharp percussive attacks
  • a small sense of relief that it worked first go

Stretch vs Pitch Shift (Plain Language)

Operation What changes What should stay the same
Time stretch (--stretch) Duration/tempo Pitch/key
Pitch shift (--pitch, --cents, --ratio) Pitch/key Duration/tempo
Combined stretch + pitch Both Clarity, transients, stereo image (as much as possible)

Concrete examples:

  • --stretch 2.0: a 5-second file becomes about 10 seconds.
  • --pitch 12: one octave up.
  • --pitch -12: one octave down.
  • --ratio 3/2: perfect fifth up (just ratio).

Beginner Command Pack (Copy/Paste)

# Slower speech review
pvx voc speech.wav --preset vocal_studio --stretch 1.30 --output speech_slow.wav

# Faster speech review
pvx voc speech.wav --preset vocal_studio --stretch 0.85 --output speech_fast.wav

# Pitch up without changing speed
pvx voc vocal.wav --stretch 1.0 --pitch 3 --output vocal_up3.wav

# Pitch down with formant protection
pvx voc vocal.wav --stretch 1.0 --pitch -4 --pitch-mode formant-preserving --output vocal_down4_formant.wav

# Drum-safe stretch
pvx voc drums.wav --preset drums_safe --stretch 1.25 --output drums_safe.wav

# Stereo coherence lock
pvx voc mix.wav --stretch 1.2 --stereo-mode mid_side_lock --coherence-strength 0.9 --output mix_lock.wav

# Freeze one moment into a pad
pvx freeze hit.wav --freeze-time 0.25 --duration 12 --output hit_freeze.wav

# Morph two sounds
pvx morph a.wav b.wav --alpha 0.4 --output morph.wav

# Cross-synthesis: keep A timing/phase but imprint B timbre envelope
pvx morph a.wav b.wav --blend-mode carrier_a_envelope_b --alpha 0.75 --envelope-lifter 32 --output morph_env.wav

# True A->B trajectory morph over time (single command)
pvx morph A.wav B.wav --alpha controls/alpha_curve.csv --interp linear --output morph_traj.wav

# Mono source flying through a captured 4-channel space (A->B trajectory)
pvx trajectory-reverb source.wav --ir room_4ch.wav --coord-system cartesian --start -1,0,1 --end 1,0,1 --output flythrough.wav

# Retune to a major scale
pvx retune vocal.wav --root C --scale major --strength 0.85 --output vocal_retuned.wav

# Retune with alternate concert pitch (A4 = 432 Hz)
pvx retune vocal.wav --root A --scale minor --a4-reference-hz 432 --output vocal_a432.wav

# Retune with an explicit root fundamental (C4 ~= 261.6256 Hz)
pvx retune vocal.wav --root-hz 261.6256 --scale major --output vocal_c4_root.wav

# Ask pvx to recommend and use a root fundamental from the file
pvx retune vocal.wav --recommend-root --scale minor --output vocal_auto_root.wav

# Denoise then dereverb in one pipe
pvx denoise noisy.wav --reduction-db 8 --stdout | pvx deverb - --strength 0.3 --output cleaned.wav

More runnable recipes (72): docs/EXAMPLES.md

Wild experimentation pack (100 ideas): docs/CRAZY_100.md

If you run these and everything sounds exactly the same, either the command failed quietly or your source was already suspiciously perfect.

Time-Varying Control Signals (CSV/JSON)

When you want parameters to change over time, pass a comma-separated values (CSV) or JavaScript Object Notation (JSON) file directly to the flag:

pvx voc input.wav --stretch controls/stretch.csv --interp linear --output output.wav
pvx voc input.wav --pitch-shift-ratio controls/pitch.json --interp polynomial --order 3 --output output.wav
pvx voc input.wav --n-fft controls/nfft.csv --hop-size controls/hop.csv --output output.wav

Interpolation choices:

  • --interp none (stairstep / sample-and-hold)
  • --interp linear (default)
  • --interp nearest
  • --interp cubic
  • --interp exponential (piecewise exponential easing)
  • --interp s_curve (piecewise smoothstep S-curve easing)
  • --interp smootherstep (piecewise quintic S-curve easing)
  • --interp polynomial --order N (any integer N >= 1, default N=3; effective degree is capped to min(N, control_points-1))

Polynomial order examples:

  • --interp polynomial --order 1 (global straight-line fit)
  • --interp polynomial --order 2 (quadratic curve)
  • --interp polynomial --order 3 (cubic curve)
  • --interp polynomial --order 5 (higher-order fit; can overshoot)

Point-style CSV:

time_sec,value
0.0,1.0
1.0,1.5
2.0,2.0

Segment-style CSV:

start_sec,end_sec,value
0.0,0.5,1.0
0.5,1.0,1.25
1.0,2.0,1.6

Point-style JSON:

{
  "interpolation": "linear",
  "order": 3,
  "points": [
    {"time_sec": 0.0, "value": 1.0},
    {"time_sec": 1.0, "value": 1.5},
    {"time_sec": 2.0, "value": 2.0}
  ]
}

Multi-parameter JSON:

{
  "parameters": {
    "time_stretch": {
      "points": [
        {"time_sec": 0.0, "value": 1.0},
        {"time_sec": 3.0, "value": 2.0}
      ]
    },
    "n_fft": {
      "points": [
        {"time_sec": 0.0, "value": 1024},
        {"time_sec": 3.0, "value": 4096}
      ]
    }
  }
}

Important compatibility notes:

  • per-parameter dynamic controls (--stretch some.csv) cannot be combined with legacy --pitch-map / --pitch-map-stdin in the same run
  • dynamic --time-stretch cannot be combined with --target-duration

Interpolation graph examples (same control points, different interpolation mode/order):

Mode Example curve
none (stairstep) none interpolation
nearest nearest interpolation
linear linear interpolation
cubic cubic interpolation
exponential exponential interpolation
s_curve (smoothstep) s_curve smoothstep interpolation
smootherstep smootherstep interpolation
polynomial order 1 polynomial order 1
polynomial order 2 polynomial order 2
polynomial order 3 polynomial order 3
polynomial order 5 polynomial order 5

Core function graph gallery:

Function family Graph
Pitch ratio vs semitones pitch ratio vs semitones
Pitch ratio vs cents pitch ratio vs cents
Dynamics transfer curves dynamics transfer curves
Soft clip transfer functions softclip transfer
Morph blend magnitude curves morph blend magnitude curves
Mask exponent response mask exponent curves
Phase mix curve phase mix angle curve

What Is a Phase Vocoder? (No Math Version)

A phase vocoder is a way to process sound in very short overlapping slices.

For each slice:

  1. It measures "how much of each frequency is present" and "where its phase is".
  2. It modifies timing and/or pitch in that spectral representation.
  3. It rebuilds audio from overlapping slices.

In short, you are taking audio apart, tidying it up, and putting it back together without pretending time is optional.

Why "phase" matters:

  • If magnitudes are changed without consistent phase evolution, output can sound smeared, chorus-like, metallic, or unstable.
  • Good phase handling keeps tones continuous across frames and improves naturalness.

In practical terms, pvx gives you controls for this quality layer:

  • phase locking
  • transient protection/hybrid modes
  • stereo coherence modes
  • formant-aware pitch workflows

Analog Tape Methods for Pitch/Time Shifting

Before digital signal processing (DSP), classical tape varispeed linked time and pitch: speed tape up and pitch rises; slow it down and pitch falls. The central engineering problem was decoupling those two controls.

Anton Springer before ELTRO (pre-commercial phase)

Before ELTRO-branded units became widely known, Anton Springer had already developed and demonstrated the core time/pitch regulation idea in the 1950s.

Practical timeline (pre-ELTRO):

  • around 1950: early patent-era work focused on changing information rate while preserving intelligibility
  • 1953: public demonstration of an acoustic speed/pitch regulator at the International Congress on Acoustics (Delft)
  • 1950s to early 1960s: continued engineering/publication work (including German technical press such as Funkschau) on segmented rotating-head tape replay methods

Core mechanism:

  • a rotating multi-head replay path processes short tape segments
  • segment repeat/skip behavior controls duration and pitch more independently than fixed-head varispeed
  • transition handling between segments determines perceived smoothness and artifact level

ELTRO as commercialization phase

The ELTRO information rate changer can be viewed as the commercialization and operational packaging of that earlier Springer regulator lineage. Historically, this analog segmented-playback family is a direct ancestor of modern time-stretch/pitch-shift systems.

Two practical historical points still matter for pvx:

  • Segment operations can create discontinuity artifacts if transitions are not handled carefully.
  • Perceptual quality depends heavily on continuity constraints (in analog systems: segment stitching; in phase-vocoder systems: phase evolution, transient handling, and crossfades).

Wendy Carlos’ account of the Eltro Mark II also documents production use in film post work, including 2001: A Space Odyssey voice treatment workflows where time and pitch manipulation were applied in controlled passes. That production history is a useful reminder that high-quality results usually come from methodical multi-stage processing, not one aggressive control move.

Sources:

Mental Model (1 Minute)

Input waveform -> short overlapping frames -> frequency-domain edit -> overlap-add resynthesis -> output waveform

Useful intuition:

  • window size (--n-fft / --win-length) trades time detail vs frequency detail
  • hop size (--hop-size) controls frame overlap density
  • larger windows often help low-frequency tonal stability
  • transient handling is important for drums/plosives/onsets

30-Second Quick Start

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -e .
pvx voc input.wav --stretch 1.20 --output output.wav

Same quick start with uv:

uv venv .venv
source .venv/bin/activate
uv pip install -e .
uv run pvx voc input.wav --stretch 1.20 --output output.wav

If pvx is not found after install, add the virtualenv binaries to your shell path environment variable (PATH):

printf 'export PATH="%s/.venv/bin:$PATH"\n' "$(pwd)" >> ~/.zshrc
source ~/.zshrc
pvx --help

If you do not want to modify the path environment variable (PATH), run the same command through the repository wrapper:

.venv/bin/pvx voc input.wav --stretch 1.20 --output output.wav
# or
python3 -m pvx.cli.pvx voc input.wav --stretch 1.20 --output output.wav

Equivalent with uv:

uv run pvx voc input.wav --stretch 1.20 --output output.wav

What this does:

  • reads input.wav
  • stretches duration by 20%
  • writes output.wav

Prefer installed commands (pvx, pvxvoc, pvxfreeze) for stable entry points.

With uv, run wrappers the same way:

uv run pvx voc input.wav --stretch 1.2 --output output.wav
uv run pvx freeze input.wav --freeze-time 0.25 --duration 8 --output freeze.wav

Unified CLI (Primary Entry Point)

pvx is now the recommended command surface for first-time users.

pvx list
pvx help voc
pvx examples basic
pvx guided
pvx follow --example all
pvx chain --example
pvx stream --example
pvx stretch-budget --help

You can also use a convenience shortcut for the default vocoder path:

pvx input.wav --stretch 1.20 --output output.wav

This is equivalent to:

pvx voc input.wav --stretch 1.20 --output output.wav

5-Minute Tutorial (Single-File Workflow)

Use one file (voice.wav) and run three common operations.

  1. Inspect available presets/examples:
pvx voc --example all
  1. Time-stretch only:
pvx voc voice.wav --stretch 1.30 --output voice_stretch.wav
  1. Pitch-shift only (duration unchanged):
pvx voc voice.wav --stretch 1.0 --pitch -3 --output voice_down3st.wav
  1. Pitch-shift with formant preservation:
pvx voc voice.wav --stretch 1.0 --pitch -3 --pitch-mode formant-preserving --output voice_down3st_formant.wav
  1. Quick A/B check:
  • voice_down3st.wav should sound darker/slower-formant (“larger” vocal tract impression).
  • voice_down3st_formant.wav should keep vowel identity more stable.

Conceptual Overview: What Is a Phase Vocoder?

A phase vocoder uses the short-time Fourier transform (STFT) to repeatedly answer this question: "What frequencies are present in this tiny time slice, and how do their phases evolve from one slice to the next?"

The core workflow is:

  1. Split audio into overlapping frames.
  2. Apply a window function to each frame.
  3. Transform each frame into spectral bins (magnitude + phase).
  4. Modify timing/pitch by controlling phase progression and synthesis hop.
  5. Reconstruct audio by overlap-adding all processed frames.

If you are new to this, the key idea is that phase continuity between frames is what separates high-quality output from "phasiness" artifacts.

1) Analysis STFT

pvx analyzes each frame with:

$$ X_t[k] = \sum_{n=0}^{N-1} x[n+tH_a]w[n]e^{-j2\pi kn/N} $$

where:

  • $x[n]$ represents the input signal sample at index $n$
  • $t$ represents frame index
  • $k$ represents frequency-bin index
  • $N$ represents frame size (--n-fft)
  • $H_a$ represents analysis hop
  • $w[n]$ represents the selected window (--window)

Plain-English meaning:

  • each frame is windowed, then transformed
  • output bin $X_t[k]$ is complex-valued (magnitude and phase)
  • this gives the per-frame spectral state used by downstream processing

2) Phase-Vocoder Propagation

Time stretching is controlled by phase evolution:

$$ \Delta\phi_t[k] = \mathrm{princarg}(\phi_t[k]-\phi_{t-1}[k]-\omega_kH_a) $$ $$ \hat\phi_t[k] = \hat\phi_{t-1}[k] + \omega_kH_s + \Delta\phi_t[k] $$

where:

  • $\phi_t[k]$ is observed phase at frame $t$, bin $k$
  • $\hat\phi_t[k]$ is synthesized/output phase
  • $\omega_k$ is nominal bin center frequency in radians/sample
  • $H_s$ is synthesis hop (effective time-stretch control)
  • $\mathrm{princarg}(\cdot)$ wraps phase to $(-\pi, \pi]$

Plain-English meaning:

  • first estimate the true per-bin phase advance
  • then re-accumulate phase using a new synthesis hop
  • this lets duration change while preserving spectral continuity

3) Pitch Mapping

Pitch controls map musical intervals to ratio:

$$ r = 2^{\Delta s/12} = 2^{\Delta c/1200} $$

where:

  • $r$ is pitch ratio
  • $\Delta s$ is semitone shift (--pitch)
  • $\Delta c$ is cents shift (--cents)

Practical interpretation:

  • $r > 1$: pitch up
  • $r < 1$: pitch down
  • formant options control whether vocal timbre shifts with pitch or is preserved

When To Use Which Tool (Decision Tree)

Start
 |
 +-- Need general time/pitch processing on one file or batch?
 |    -> pvx voc
 |
 +-- Need sustained spectral drone from one instant?
 |    -> pvx freeze
 |
 +-- Need stacked harmony voices from one source?
 |    -> pvx harmonize
 |
 +-- Need timeline-constrained pitch/time map from CSV?
 |    -> pvx conform / pvx warp
 |
 +-- Need morphing between two sources?
 |    -> pvx morph
 |
 +-- Need monophonic retune to scale/root?
 |    -> pvx retune
 |
 +-- Need denoise or dereverb cleanup?
      -> pvx denoise / pvx deverb

Common Workflows

Goal Tool Minimal command
Vocal retune / timing correction pvx voc pvx voc vocal.wav --preset vocal --stretch 1.05 --pitch -1 --output vocal_fix.wav
Sound-design freeze pad pvx freeze pvx freeze hit.wav --freeze-time 0.12 --duration 10 --output-dir out
Tempo stretch with transient care pvx voc pvx voc drums.wav --stretch 1.2 --transient-preserve --phase-locking identity --output drums_120.wav
Harmonic layering pvx harmonize pvx harmonize lead.wav --intervals 0,4,7 --gains 1,0.8,0.7 --output-dir out
Cross-source morphing / cross-synthesis pvx morph pvx morph a.wav b.wav --blend-mode carrier_a_envelope_b --alpha 0.7 --output morph.wav
Phase-consistent denoising pvx denoise / pvx noisefilter pvx denoise speech.wav --noise-seconds 0.4 --reduction-db 5 --smooth 9 --output speech_clean.wav
AI dataset augmentation (deterministic) pvx augment pvx augment data/*.wav --output-dir aug_out --variants-per-input 4 --intent asr_robust --seed 1337
Build control envelope map pvx envelope / pvx lfo pvx envelope --mode adsr --duration 8 --rate 20 --key stretch --output stretch_env.csv
Build periodic LFO map (sine/triangle/square/saw) pvx lfo pvx lfo --wave triangle --duration 8 --frequency-hz 0.5 --center 1.0 --amplitude 0.2 --key stretch --output stretch_lfo.csv
Reshape control map pvx reshape pvx reshape stretch_env.csv --key stretch --operation resample --rate 50 --interp polynomial --order 5 --output stretch_dense.csv

More complete examples and use-case playbooks (72+ runnable recipes): docs/EXAMPLES.md Wild experimentation pack (100 ideas): docs/CRAZY_100.md

Supported File Types

Category Supported types
Audio file input/output All formats provided by the active soundfile/libsndfile build
Stream output (--stdout) wav, flac, aiff/aif, ogg/oga, caf
Control maps csv, json
Run manifests json
Generated docs html, pdf

Full table of all currently supported audio container types: docs/FILE_TYPES.md

Performance and GPU (Quality-First)

pvx is not tuned as a "fastest possible at any cost" engine. Duh: it's written in Python. Start from quality-safe defaults, validate artifact levels, then reduce runtime where acceptable.

CPU path

  • default path is robust and portable
  • use power-of-two FFT sizes first (1024, 2048, 4096, 8192) for stable transform behavior and good throughput

CUDA path

pvx voc input.wav --device cuda --stretch 1.1 --output out_cuda.wav

Short aliases:

  • --gpu means --device cuda
  • --cpu means --device cpu

Quality-First tuning checklist

  • start with quality controls first: --phase-locking identity, transient protection, stereo coherence mode
  • choose larger --n-fft when low-frequency clarity matters; only reduce --n-fft when quality remains acceptable
  • use --multires-fusion when it audibly improves content; disable only if quality is unchanged
  • after artifact checks, optimize runtime via --auto-segment-seconds + --checkpoint-dir + --resume

CLI Discoverability and UX

pvx now provides a single command surface for discovery (pvx list, pvx help <tool>, pvx examples, pvx guided), while pvxvoc retains advanced controls for detailed phase-vocoder workflows.

Additional helper workflows:

  • pvx chain: managed multi-stage chains without manually wiring per-stage --stdout / - plumbing
  • pvx stream: stateful chunk engine for long-form streaming workflows (--mode stateful default, --mode wrapper compatibility fallback)
  • pvx stretch-budget: estimate max safe stretch from an input file and disk budget before launching extreme renders
  • pvx augment: deterministic augmentation dataset generation for machine-learning pipelines with JSONL/CSV manifests
  • pvx augment-manifest: validate/merge/stats utilities for augmentation manifests

pvx voc includes beginner UX features:

  • Intent presets:
    • Legacy: --preset none|vocal|ambient|extreme
    • New: --preset default|vocal_studio|drums_safe|extreme_ambient|stereo_coherent
  • Example mode: --example basic (or --example all)
  • Guided mode: --guided (interactive prompts)
  • Grouped help sections for discoverability:
    • I/O, Performance, Quality/Phase, Time/Pitch, Transients, Stereo, Output/Mastering, Debug
  • Beginner aliases:
    • --stretch -> --time-stretch
    • --pitch / --semitones -> --pitch-shift-semitones
    • --cents -> --pitch-shift-cents
    • --ratio -> --pitch-shift-ratio
    • --out -> --output
    • --gpu / --cpu -> device shortcut
  • Common output consistency:
    • shared tools now accept explicit single-file output via --output / --out in addition to --output-dir + --suffix
  • Script-local examples:
    • every major tool now prints copy-paste examples in --help (not only in the README)

Plan/debug aids:

  • --auto-profile
  • --auto-transform
  • --explain-plan
  • --manifest-json

New quality controls:

  • Hybrid transient engine:
    • --transient-mode off|reset|hybrid|wsola
    • --transient-sensitivity
    • --transient-protect-ms
    • --transient-crossfade-ms
  • Stereo/multichannel coherence:
    • --stereo-mode independent|mid_side_lock|ref_channel_lock
    • --ref-channel
    • --coherence-strength

Runtime metrics visibility:

  • Unless --silent is used, pvx tools now print an ASCII metrics table for input/output audio
    • sample rate, channels, duration, peak/RMS/crest, DC offset, ZCR, clipping %, spectral centroid, 95% bandwidth
    • plus an input-vs-output comparison table with input, output, and delta(out-in) columns: SNR, SI-SDR, LSD, modulation distance, spectral convergence, envelope correlation, transient smear, loudness/true-peak, and stereo drift metrics

Output policy controls (shared across audio-output tools):

  • --bit-depth {inherit,16,24,32f}
  • --dither {none,tpdf} and --dither-seed
  • --true-peak-max-dbtp
  • --metadata-policy {none,sidecar,copy}
  • --subtype remains available as explicit low-level override

Announcement Readiness: Top 5 Complaints (and Fixes)

If you announce pvx, these are the most likely first-wave complaints and the built-in responses.

Likely complaint Code-level response Documentation response
“I installed it, but pvx is not found.” pvx doctor checks virtual environment activity, PATH, and dependency status with concrete fix commands. Install and troubleshooting sections now point to pvx doctor as first triage.
“There are too many flags; I don’t know where to start.” pvx quickstart prints a minimal five-step launch sequence. Quick-start sections and examples now include the same sequence.
“My first render sounds phasey/choppy.” pvx safe wraps pvx voc with conservative quality-first defaults (identity phase locking, hybrid transient mode, stereo coherence controls). Quality guidance and cookbook examples now include pvx safe first-pass usage.
“Do I have to use FFT/STFT only?” pvx transforms prints transform options (fft, dft, czt, dct, dst, hartley) plus availability and recommendation guidance. Transform guidance is now explicit and linked from command help and examples.
“How do I know this build is healthy before a demo?” pvx smoke runs a fast synthetic end-to-end render and verifies output creation/readback. Launch checklist now includes pvx smoke before public demos/releases.

Launch checklist (copy/paste):

pvx doctor
pvx quickstart input.wav --output output.wav
pvx safe input.wav --material mix --output output_safe.wav
pvx transforms
pvx smoke --output smoke_out.wav

AI Research and Data Augmentation

pvx now includes deterministic augmentation workflows for machine-learning and research datasets via pvx augment.

Key properties:

  • deterministic generation with --seed
  • repeatable train/validation/test assignment with --split
  • split-leakage control via grouping:
    • --grouping stem-prefix (default)
    • --group-separator "__"
  • optional balanced split assignment from label metadata:
    • --split-mode label_balanced
    • --split-mode speaker_balanced
    • --labels-csv labels.csv
  • intent profiles tuned for common tasks:
    • asr_robust (automatic speech recognition robustness)
    • mir_music (music information retrieval)
    • ssl_contrastive (self-supervised contrastive augmentation)
  • paired contrastive view mode: --pair-mode contrastive2
  • label perturbation policy: --label-policy preserve|allow_alter
  • policy files for reproducible settings: --policy augment_policy.json
  • deterministic parallel rendering with resume support:
    • --workers N
    • --resume
    • --append-manifest
  • provenance fields in manifest: source_sha256, output_sha256
  • optional per-output audit metrics in manifest: peak/rms/clip/zero-crossing
  • manifest outputs for reproducibility/audit:
    • JSON Lines (JSONL): default augment_manifest.jsonl
    • comma-separated values (CSV): default augment_manifest.csv

Command examples:

# Speech-focused augmentation set
pvx augment data/speech/*.wav --output-dir aug/speech --variants-per-input 6 --intent asr_robust --seed 1337

# Music-focused augmentation set with custom split ratios
pvx augment data/music/*.wav --output-dir aug/music --variants-per-input 4 --intent mir_music --split 0.7,0.2,0.1 --seed 2026

# Dry-run planning only (no audio renders, manifest only)
pvx augment data/*.wav --output-dir aug/plan --variants-per-input 3 --intent ssl_contrastive --dry-run --seed 42

# Balanced split assignment by speaker metadata
pvx augment data/*.wav --output-dir aug/speaker_bal --variants-per-input 3 --intent asr_robust --split-mode speaker_balanced --labels-csv labels.csv --seed 42

# Resume interrupted runs and append to existing manifest
pvx augment data/*.wav --output-dir aug/resume --variants-per-input 4 --intent mir_music --resume --append-manifest --seed 2026

# Manifest validation and merge tools
pvx augment-manifest validate aug/resume/augment_manifest.jsonl --strict
pvx augment-manifest merge aug/run_a/augment_manifest.jsonl aug/run_b/augment_manifest.jsonl --output-jsonl aug/merged_manifest.jsonl --output-csv aug/merged_manifest.csv

# Augmentation profile benchmark suite and gates (speech/music/noisy/stereo)
python benchmarks/run_augment_profile_suite.py --quick --gate --out-dir benchmarks/out_augment_profiles

# Refresh per-profile baselines after intentional benchmark changes
python benchmarks/run_augment_profile_suite.py --quick --refresh-baselines --out-dir benchmarks/out_augment_profiles_refresh

See full guide: docs/AI_AUGMENTATION.md

Benchmarking (pvx vs Rubber Band vs librosa)

Run a tiny benchmark (cycle-consistency metrics):

python3 benchmarks/run_bench.py --quick --out-dir benchmarks/out

With uv:

uv run python3 benchmarks/run_bench.py --quick --out-dir benchmarks/out

This uses the tuned deterministic profile by default (--pvx-bench-profile tuned). Use --pvx-bench-profile legacy to compare against the prior pvx benchmark settings.

Stage 2 reproducibility controls:

  • corpus manifest + hash validation: --dataset-manifest, --strict-corpus, --refresh-manifest
  • deterministic CPU checks: --deterministic-cpu, --determinism-runs
  • stronger gates: --gate-row-level, --gate-signatures
  • automatic quality diagnostics are emitted in benchmarks/out/report.md and report.json

Interpret benchmark priorities:

  • quality metrics are primary acceptance criteria
  • runtime is tracked as a secondary engineering metric

Reported metrics now include:

  • LSD, modulation spectrum distance, transient smear, stereo coherence drift
  • SNR, SI-SDR, spectral convergence, envelope correlation
  • RMS delta, crest-factor delta, bandwidth(95%) delta, ZCR delta, DC delta, clipping-ratio delta
  • Perceptual/intelligibility: PESQ, STOI, ESTOI, ViSQOL MOS-LQO, POLQA MOS-LQO, PEAQ ODG
  • Loudness/mastering: integrated LUFS delta, short-term LUFS delta, LRA delta, true-peak delta
  • Pitch/harmonic: F0 RMSE (cents), voicing F1, HNR drift
  • Transient timing: onset precision/recall/F1, attack-time error
  • Spatial/stereo: ILD drift, ITD drift, inter-channel phase deviation (low/mid/high/mean)
  • Artifact-focused: phasiness index, musical-noise index, pre-echo score

Notes:

  • Some perceptual standards require external/proprietary tools. When unavailable, pvx reports deterministic proxy estimates and includes a Proxy Fraction in benchmark markdown.
  • External hooks are supported via environment variables:
    • VISQOL_BIN
    • POLQA_BIN
    • PEAQ_BIN

Run with regression gate against committed baseline:

python3 benchmarks/run_bench.py --quick --out-dir benchmarks/out --strict-corpus --determinism-runs 2 --baseline benchmarks/baseline_small.json --gate --gate-row-level --gate-signatures

uv equivalent:

uv run python3 benchmarks/run_bench.py --quick --out-dir benchmarks/out --strict-corpus --determinism-runs 2 --baseline benchmarks/baseline_small.json --gate --gate-row-level --gate-signatures

PVC-style parity benchmark for phase 3-7 operators:

python3 benchmarks/run_pvc_parity.py --quick --out-dir benchmarks/out_pvc_parity --baseline benchmarks/baseline_pvc_parity.json --gate --gate-tolerance 0.20

uv equivalent:

uv run python3 benchmarks/run_pvc_parity.py --quick --out-dir benchmarks/out_pvc_parity --baseline benchmarks/baseline_pvc_parity.json --gate --gate-tolerance 0.20

Visual Documentation

See docs/DIAGRAMS.md for:

  • expanded architecture and DSP atlas (Mermaid + ASCII)
  • quality-first tuning and metrics-flow diagrams
  • STFT analysis/resynthesis timelines
  • phase propagation and phase-locking diagrams
  • hybrid transient/WSOLA/stitching diagrams
  • stereo coherence mode diagrams
  • map/segment and checkpoint/resume diagrams
  • benchmark and CI gate flow diagrams
  • mastering chain and troubleshooting decision trees

Troubleshooting

“No readable input files matched…”

  • verify path and extension
  • quote globs in shells if needed
  • run pvx guided (or pvx voc --guided)

Output sounds “phasier” or “smear-y”

  • enable --phase-locking identity
  • enable --transient-preserve
  • reduce stretch ratio, or use --stretch-mode multistage

Freeze output flutters instead of sounding stationary

  • use pvx freeze ... --phase-mode instantaneous (default; best for stable holds)
  • if you want explicit bin-center stepping, use --phase-mode bin
  • use --phase-mode hold only for deliberately static/experimental phase behavior

Speech sounds robotic after pitch shift

  • use --pitch-mode formant-preserving
  • reduce semitone magnitude
  • increase overlap (--hop-size smaller relative to --win-length)

Denoise sounds chirpy or watery

  • reduce --reduction-db (for example from 8 to 5)
  • increase --smooth by 2 to 4 frames
  • increase --floor slightly (for example 0.1 to 0.2)
  • if possible, use --noise-file with clean room tone instead of inferred leading-noise estimation

CUDA requested but falls back

  • ensure CuPy install matches your CUDA runtime
  • test with --device cuda to force explicit failure if unavailable

Long extreme render interrupted

  • rerun with --checkpoint-dir ... --resume
  • consider --auto-segment-seconds 0.25 to reduce recompute scope

I need a massive stretch ratio (for example 1,000,000x)

  • run a budget estimate first so you do not launch an impossible render:
    • pvx stretch-budget input.wav --disk-budget 20GB --bit-depth 16 --requested-stretch 1000000
  • for script/CI gating, fail early if request exceeds budget:
    • pvx stretch-budget input.wav --disk-budget 20GB --requested-stretch 1000000 --fail-if-exceeds --json
  • if you proceed, prefer:
    • --target-duration over arbitrary giant ratios
    • --stretch-mode multistage
    • --auto-segment-seconds + --checkpoint-dir + --resume

FAQ

Can pvx time-stretch and time-compress?

Yes. --stretch > 1 lengthens, --stretch < 1 shortens.

Can I shift pitch without changing duration?

Yes. Use pitch flags with --stretch 1.0, e.g. --pitch, --cents, or --ratio.

Can I use the phase-vocoder path for denoising?

Yes. Use pvx denoise for phase-consistent short-time Fourier transform (STFT) spectral subtraction, or pvx noisefilter with a reusable response profile.

# Speech-safe denoise (conservative)
pvx denoise speech.wav --noise-seconds 0.4 --reduction-db 5 --floor 0.2 --smooth 9 --output speech_clean.wav

# Music-safe denoise (retain ambience/harmonics)
pvx denoise mix.wav --noise-seconds 0.3 --reduction-db 4 --floor 0.25 --smooth 7 --output mix_clean.wav

# Denoise then stretch in one pipe
pvx denoise noisy.wav --reduction-db 6 --stdout | pvx voc - --stretch 2.0 --output clean_stretch.wav

I installed pvx but get zsh: command not found: pvx. What now?

Use one of these immediately:

.venv/bin/pvx --help
# or
python3 -m pvx.cli.pvx --help

Then either keep using one of those forms or add .venv/bin to your path environment variable (PATH):

printf 'export PATH="%s/.venv/bin:$PATH"\n' "$(pwd)" >> ~/.zshrc
source ~/.zshrc

What is an .npz file in pvx?

.npz is NumPy's compressed container format. In pvx it stores reusable analysis/response artifacts:

  • .pvxan.npz: short-time Fourier transform (STFT) analysis payloads
  • .pvxrf.npz: derived frequency-response payloads

These files are binary, compact, and intended for machine reuse in repeatable pipelines.

Can I chain tools in one shell line?

Yes. Use --stdout and - input on downstream tools.

For shorter one-liners without manual pipe wiring, use managed chain mode:

pvx chain input.wav --pipeline "voc --stretch 1.2 | formant --mode preserve" --output output_chain.wav

For chunked long renders through the default stateful stream engine:

pvx stream input.wav --output output_stream.wav --chunk-seconds 0.2 --time-stretch 3.0

Compatibility fallback (legacy segmented-wrapper behavior):

pvx stream input.wav --mode wrapper --output output_stream.wav --chunk-seconds 0.2 --time-stretch 3.0

Can I generate many random variants in one command?

Yes. Use --lucky N on processing workflows:

pvx voc input.wav --output-dir out --lucky 12
pvx freeze input.wav --output-dir out --lucky 8 --lucky-seed 42
pvx chain input.wav --pipeline "voc --stretch 1.5 | deverb --strength 0.4" --output out/chain.wav --lucky 5

Can I reverberate a mono file with a 4-channel impulse response and move source position from A to B?

Yes. Use pvx trajectory-reverb with cartesian or spherical coordinates:

# Cartesian trajectory
pvx trajectory-reverb source.wav --ir room_4ch.wav \
  --coord-system cartesian --start -1,0,1 --end 1,0,1 \
  --trajectory-shape ease-in-out --output flythrough_cart.wav

# Spherical trajectory (azimuth,elevation,radius)
pvx trajectory-reverb source.wav --ir room_4ch.wav \
  --coord-system spherical --start -90,0,1.2 --end 90,0,1.2 \
  --output flythrough_sph.wav

Optional explicit speaker layout (azimuth,elevation per channel):

pvx trajectory-reverb source.wav --ir room_4ch.wav \
  --speaker-angles "-45,0;45,0;135,0;-135,0" \
  --start -1,0,1 --end 1,0,1 --output flythrough_layout.wav
pvx voc input.wav --stretch 1.1 --stdout \
  | pvx denoise - --reduction-db 10 --stdout \
  | pvx deverb - --strength 0.4 --output cleaned.wav

Can I route control maps in pipes without awk?

Yes. The shortest path is the one-command helper:

pvx follow A.wav B.wav --output B_follow.wav --emit pitch_to_stretch --pitch-conf-min 0.75

Under the hood, this runs pitch tracking on A.wav, emits a control map, and feeds it to pvx voc on B.wav.

Manual pipe form is still available for explicit control-bus routing:

Pitch-to-stretch sidechain:

pvx pitch-track A.wav --emit pitch_to_stretch --output - \
  | pvx voc B.wav --control-stdin --output B_follow.wav

Explicit route example (map pitch_ratio -> stretch, force pitch_ratio to unity):

pvx pitch-track A.wav --output - \
  | pvx voc B.wav --control-stdin --route stretch=pitch_ratio --route pitch_ratio=const(1.0) --output B_time_follow.wav

Can I generate LFOs from the command line (sine/triangle/ramp/square/saw)?

Yes. Use pvx lfo (alias for pvx envelope) and select --wave:

# Triangle LFO using frequency in Hz
pvx lfo --wave triangle --duration 8 --frequency-hz 0.5 --center 1.0 --amplitude 0.2 --key stretch --output stretch_tri.csv

# Sine LFO using cycle count over clip duration
pvx lfo --wave sine --duration 12 --cycles 6 --center 1.0 --amplitude 0.25 --key stretch --output stretch_sine.csv

# Square LFO with duty cycle
pvx lfo --wave square --duration 8 --frequency-hz 2.0 --center 1.0 --amplitude 0.3 --duty-cycle 0.35 --key pitch_ratio --output pitch_square.csv

# Ramp envelope (non-periodic)
pvx lfo --wave ramp --duration 6 --start 1.0 --end 0.5 --key stretch --output stretch_ramp.csv

Then apply with control-bus routing:

pvx voc input.wav --stretch stretch_tri.csv --interp linear --output out.wav

How large can stretch get before it becomes impractical?

Use:

pvx stretch-budget input.wav --disk-budget 20GB --bit-depth 16 --json

This estimates the maximum safe stretch from:

  • input frames/channels/sample rate
  • assumed output format + subtype/bit depth
  • available budget (--disk-budget or free space at --budget-path)
  • safety headroom (--safety-margin, default 0.90)

You can also ask whether a requested ratio fits:

pvx stretch-budget input.wav --disk-budget 20GB --requested-stretch 1000000 --fail-if-exceeds

pvx pitch-track can now emit a broad feature vector for control-map routing, including:

  • pitch and voicing: f0_hz, pitch_ratio, confidence, voicing_prob, pitch_stability, note_boundary
  • loudness/dynamics: rms, rms_db, short_lufs_db, crest_factor_db, clip_ratio, transientness
  • spectral shape: spectral_centroid_hz, spectral_spread_hz, spectral_flatness, spectral_flux, rolloff_hz
  • timbre/descriptors: mfcc_01..mfcc_N, formant_f1_hz..formant_f3_hz, harmonic_ratio, inharmonicity
  • rhythm: tempo_bpm, beat_phase, downbeat_phase, onset_strength, transient_mask
  • stereo/noise/artifact proxies: ild_db, itd_ms, hum_50_ratio, hum_60_ratio, hiss_ratio
  • MPEG-7-style descriptors: mpeg7_* columns including centroid/spread/flatness/flux/rolloff/attack-time/temporal-centroid and coarse audio spectrum envelope bands.

Feature-routing examples:

# MFCC-driven pitch modulation on B
pvx pitch-track A.wav --feature-set all --mfcc-count 13 --output - \
  | pvx voc B.wav --control-stdin --route pitch_ratio=affine(mfcc_01,0.002,1.0) --route pitch_ratio=clip(pitch_ratio,0.5,2.0) --output B_mfcc_pitch.wav

# MPEG-7 spectral flux drives stretch with clipping
pvx pitch-track A.wav --feature-set all --output - \
  | pvx voc B.wav --control-stdin --route stretch=affine(mpeg7_spectral_flux,0.05,1.0) --route stretch=clip(stretch,0.8,1.6) --route pitch_ratio=const(1.0) --output B_flux_stretch.wav

Expanded cookbook with many more single-feature, multi-feature, feature-vector, and multi-guide recipes:

Built-in pvx follow example printer:

pvx follow --example
pvx follow --example all
pvx follow --example mfcc_flux

Does pvx support microtonal workflows?

Yes. Use ratio/cents/semitone controls and CSV map modes, plus pvx retune --scale-cents, --a4-reference-hz, --root-hz, or --recommend-root.

Is every algorithm phase-vocoder-based?

No. The repo includes non-phase-vocoder modules too (analysis, denoise, dereverb, decomposition, etc.).

Why This Matters

How pvx differs from librosa and Rubber Band

  • librosa is a broad analysis library; pvx is an operational CLI toolkit with research- and production-oriented pipelines, shared mastering chain, map-based automation, and extensive command-line workflows.
  • Rubber Band is a strong dedicated stretcher; pvx emphasizes inspectable Python implementations, explicit transform/window control, CSV-driven control maps, integrated multi-tool workflows, and a quality-first tuning philosophy.

Why phase coherence matters

Unconstrained phase across bins/frames causes audible blur, chorus-like instability, and transient damage. Phase locking and transient-aware logic reduce these failures.

When transient preservation matters

Most for drums, consonants, plosives, and percussive attacks. Less critical for smooth pads and static drones.

When NOT to use a phase vocoder

  • strong transient-critical material with very large ratio changes may prefer waveform/granular strategies
  • extremely low-latency live paths may prefer simpler time-domain methods
  • if your target is artifact-heavy texture, stochastic engines may be preferable to strict phase coherence

Lessons from Paul Koonce's PVC Package

Two useful historical references:

What translates well into modern pvx:

PVC idea Why it still matters How pvx uses or extends it
Tool-per-task command design (plainpv, twarp, harmonizer, etc.) Keeps workflows composable and scriptable pvx subcommands (voc, freeze, harmonize, conform, retune, morph, analysis, response, ...) plus pvx chain
Command help as a first-class UX surface Beginners discover flags faster from terminal help than docs pvx --help, grouped flag sections, --example, --guided, and script-level example blocks
Dynamic parameter control from external data files Real workflows need time-varying control, not static knobs Per-parameter CSV/JSON control-rate signals with interpolation (none, linear, nearest, cubic, exponential, s_curve, smootherstep, polynomial)
Shell-script driven reproducibility Repeatable runs matter for research and production Copy-paste recipes, pvx examples, benchmark scripts, JSON manifests, and deterministic CPU mode
Explicit defaults shown in help Makes behavior predictable and debuggable Shared defaults + output policy + ASCII metric tables for every non-silent run
Analysis/synthesis experimentation mindset Quality work needs inspectable internals and comparisons Transform selection (fft, dft, czt, dct, dst, hartley) and benchmark gates vs baselines

Practical next steps inspired by PVC tradition:

  • keep every new tool runnable from one command without hidden state
  • keep dynamic-control file formats simple and text-editable
  • prefer transparent defaults and explicit artifact tradeoffs over black-box presets
  • keep docs and --help synchronized so terminal users are not forced into source code

Sorted roadmap for additional top-level pvx commands (highest implementation return on investment first):

Phase Priority Proposed commands Why this phase comes first
Phase 1 Highest doctor, inspect, validate, schema, preset, config Removes onboarding/debug friction and makes workflows self-describing before adding more DSP complexity.
Phase 2 High render, graph, queue, watch, cache Makes long runs and iterative workflows practical for real projects.
Phase 3 Medium-High mod, derive, route, quantize, smooth Expands control-rate signal design so dynamic parameter workflows become concise and repeatable.
Phase 4 Medium bench, compare, regress, abx, report Turns quality claims into measurable quality gates and release criteria.
Phase 5 Medium-Low align, match, stem, spatialize, live, serve Adds advanced production and deployment workflows once foundations are stable.

Roadmap sort key:

  • primary: user impact on quality-first workflows
  • secondary: implementation complexity and dependency risk
  • tertiary: leverage across multiple existing tools (voc, retune, morph, freeze, and wrappers)

Progressive Documentation Map

Complete Markdown documentation list (all .md documentation files):

Complete HyperText Markup Language (HTML) documentation list (all .html documentation files):

Portable Document Format (PDF) bundle:

  • docs/pvx_documentation.pdf

Community and Governance

License

MIT

Attribution

Copyright (c) 2026 Colby Leider and contributors. See ATTRIBUTION.md.

About

Python toolkit for high-quality time and pitch processing

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors