Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant Q transforms do not allow STFT operations without zero padding for chunked audio processing. #1788

Open
MCMcCallum opened this issue Dec 20, 2023 · 3 comments
Labels
enhancement Does this improve existing functionality? question Issues asking for help doing something

Comments

@MCMcCallum
Copy link

Is your feature request related to a problem? Please describe.
Currently all constant Q operations call the stft operation with parameter center=True, which forces some sort of padding (e.g., zeros, or a reflection). This prevents the computation of the equivalent of a very large CQT in chunks, as this padding will be inserted at each chunk point. This center parameter should be configurable through the user interface so that it is possible to for "gapless" chunking by the user, where the user can provide each segment of audio as a continuation of the former. This would also require a simple interface to get the equivalent maximum CQT window size, so that a user can chunk the audio and align CQT windows appropriately.

Describe the solution you'd like

  • Add center parameter to cqt, vqt and pseudo_cqt operations.
  • Add a function, (e.g. cqt_window_size(sample_rate, fmin, num_octaves, bins_per_octave, hop_length)) so that this is easily accessible to the user.

Describe alternatives you've considered
Currently to do this, one must modify private functions of the constantq module such as __cqt_response, which is bad practice.

Additional context
N/A

@bmcfee bmcfee added enhancement Does this improve existing functionality? question Issues asking for help doing something labels Dec 20, 2023
@bmcfee
Copy link
Member

bmcfee commented Dec 20, 2023

This is quite a bit more complex than it is in, say, stft.

The core VQT function has a recursive downsampling step for computing the response at each octave (wavelet pyramid). Most of the resamplers we support do not provide a streaming / stateful block-processing API, so there isn't an obvious way to thread a stream generator through the function.

One could imagine just working with blocks and chopping off the padding cleverly like we do for STFT. However, this will generally not provide consistent results compared to a full signal analysis because the resamplers we use typically have zero phase (which is good for our purposes), and this tends to imply at least some forward lookahead if not a fully bidirectional pass. (For context, the original matlab code used a bidirectional butterworth filter for this, so each downsample depended on the entire signal.)

@MCMcCallum
Copy link
Author

Ah yes, I saw the recursive downsampling in there for the VQT. The effective window size would then depend on the resampling method also... Although the above suggestions could be applied to the psuedo_cqt function.

In any case, I think some lookahead is acceptable. My current use case is not near-real-time stream processing, but chunked processing of very large audio files (e.g., several hours). As such, a lookahead of several seconds is acceptable, and is how I currently handle resampling elsewhere. I'll handle this for now by passing in additional lookback / lookforward samples that are beyond the practical IR length of any IIR / FIR filter and line up with hop boundaries then "cleverly" chop off the extraneous frames.

It would be a nice extra to have an interface that helped with this, provided lookback / lookforward lengths, either for the user to manage, or had a mode where it would store overlapping state internally. But I appreciate that is a much bigger change.

@bmcfee
Copy link
Member

bmcfee commented Dec 20, 2023

Yeah, it could definitely be implemented in pseudo-cqt, since that's just an stft + basis projection anyway.

I agree that some lookahead is fine here, but the issue is going to be maintaining state through the api. When we have things like iir filters that propagate state across blocks (pcen, preemphasis), we manage this by having an additional return variable to initialize the next call. This isn't my favorite api choice, but it is done for consistency with the scipy style (see lfilter).

I think really the best way to go about this kind of thing would be to use generators instead of functions, which would allow for internal state to be preserved without expanding the api. I think this kind of thing is doable with the soxr backend, but i haven't looked into it carefully. A generator interface has been in my mind for a while now though, any it might make it into a plan for librosa 2.0, provided it doesn't conflict with our other plans (array api mainly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Does this improve existing functionality? question Issues asking for help doing something
Development

No branches or pull requests

2 participants