Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

MCMcCallum · 2023-12-20T00:01:53Z

Is your feature request related to a problem? Please describe.
Currently all constant Q operations call the stft operation with parameter center=True, which forces some sort of padding (e.g., zeros, or a reflection). This prevents the computation of the equivalent of a very large CQT in chunks, as this padding will be inserted at each chunk point. This center parameter should be configurable through the user interface so that it is possible to for "gapless" chunking by the user, where the user can provide each segment of audio as a continuation of the former. This would also require a simple interface to get the equivalent maximum CQT window size, so that a user can chunk the audio and align CQT windows appropriately.

Describe the solution you'd like

Add center parameter to cqt, vqt and pseudo_cqt operations.
Add a function, (e.g. cqt_window_size(sample_rate, fmin, num_octaves, bins_per_octave, hop_length)) so that this is easily accessible to the user.

Describe alternatives you've considered
Currently to do this, one must modify private functions of the constantq module such as __cqt_response, which is bad practice.

Additional context
N/A

The text was updated successfully, but these errors were encountered:

bmcfee · 2023-12-20T01:00:10Z

This is quite a bit more complex than it is in, say, stft.

The core VQT function has a recursive downsampling step for computing the response at each octave (wavelet pyramid). Most of the resamplers we support do not provide a streaming / stateful block-processing API, so there isn't an obvious way to thread a stream generator through the function.

One could imagine just working with blocks and chopping off the padding cleverly like we do for STFT. However, this will generally not provide consistent results compared to a full signal analysis because the resamplers we use typically have zero phase (which is good for our purposes), and this tends to imply at least some forward lookahead if not a fully bidirectional pass. (For context, the original matlab code used a bidirectional butterworth filter for this, so each downsample depended on the entire signal.)

MCMcCallum · 2023-12-20T01:30:13Z

Ah yes, I saw the recursive downsampling in there for the VQT. The effective window size would then depend on the resampling method also... Although the above suggestions could be applied to the psuedo_cqt function.

In any case, I think some lookahead is acceptable. My current use case is not near-real-time stream processing, but chunked processing of very large audio files (e.g., several hours). As such, a lookahead of several seconds is acceptable, and is how I currently handle resampling elsewhere. I'll handle this for now by passing in additional lookback / lookforward samples that are beyond the practical IR length of any IIR / FIR filter and line up with hop boundaries then "cleverly" chop off the extraneous frames.

It would be a nice extra to have an interface that helped with this, provided lookback / lookforward lengths, either for the user to manage, or had a mode where it would store overlapping state internally. But I appreciate that is a much bigger change.

bmcfee · 2023-12-20T13:32:00Z

Yeah, it could definitely be implemented in pseudo-cqt, since that's just an stft + basis projection anyway.

I agree that some lookahead is fine here, but the issue is going to be maintaining state through the api. When we have things like iir filters that propagate state across blocks (pcen, preemphasis), we manage this by having an additional return variable to initialize the next call. This isn't my favorite api choice, but it is done for consistency with the scipy style (see lfilter).

I think really the best way to go about this kind of thing would be to use generators instead of functions, which would allow for internal state to be preserved without expanding the api. I think this kind of thing is doable with the soxr backend, but i haven't looked into it carefully. A generator interface has been in my mind for a while now though, any it might make it into a plan for librosa 2.0, provided it doesn't conflict with our other plans (array api mainly).

bmcfee added enhancement Does this improve existing functionality? question Issues asking for help doing something labels Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

MCMcCallum commented Dec 20, 2023

bmcfee commented Dec 20, 2023

MCMcCallum commented Dec 20, 2023

bmcfee commented Dec 20, 2023

Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

Comments

MCMcCallum commented Dec 20, 2023

bmcfee commented Dec 20, 2023

MCMcCallum commented Dec 20, 2023

bmcfee commented Dec 20, 2023