-
Notifications
You must be signed in to change notification settings - Fork 115
Overlap-save convolution #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b63cb65 to
f17f6d0
Compare
|
I think this still has bugs in it. Let me fix it up before people spend too much time on it. |
src/dspbase.jl
Outdated
| out_start, | ||
| out_stop, | ||
| u_start | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be made a closure to avoid passing a gazillion parameters? (Of course, not doing dynamic dispatch here and just inlining that function would be even better, but I guess the type instability would be pretty bad for performance and so would be having these_blocks be a vector?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the problem is Val{n_edges}(), which is obviously not type stable. Using simple integers instead of Val types causes the code generated by subsets to be less than great, and more importantly the output of subsets, edge_dims, would no longer a Tuple, which leads to type instability. I could play around with it a bit more to see if I can out a better solution. I hadn't though of using a closure. In my previous experience with closures, it ends up forcing the compiler to do a lot of boxing and unboxing of captured variables. I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're also right with these_blocks, it's a big performance win for it to be a Tuple, because then the output of the subsequent ProductIterator is type stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually variables that are only assigned to once should be safe to capture in a closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a closure with variables being modified passed as arguments, but it still substantially changed the output of @code_warntype, and had more type instability for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes if the compiler loses track of closed-over variables you can help it out by wrapping the closure in a let block so you're limiting the scope of when that variable could be reassigned. Alternatively there are sometimes places where some type-annotations can help. There's more info and an example in this comment. Between these to workarounds it's generally pretty doable to make the closures fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I also don't like closures with a bunch of closed-over variables from a stylistic perspective. For me, it's hard to spot the closed-over variables, hard to separate out the closure and figure out what's going on with it or use things like @code_warntype on closures. Basically, I'd rather have a huge parameter list than a huge set of captured variables.
a9c5a7d to
0b06ca8
Compare
|
|
|
What is |
|
N is the number of dimensions (vector, matrix, 3-d array, 4-d array) |
|
Ah, thanks. These plots are super useful! Thanks for making them. Would it be a pain to generate the same plot but color each grid space red or blue based on the heuristic in |
|
It's not perfect, but it does a pretty good job. |
|
Looks pretty good to me. That's awesome, thanks for the visualization. Is the PR ready for review? |
|
I think so? The style is a little sloppy but it doesn't bug. |
ssfrr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice (and I learned some things about ND-FFTs and FFTW along the way). Most of my comments are nitpicky things, some are things caused a little confusion while reviewing. I think the main thing I'd like to see is more tests.
As a general note - the process of chopping up into blocks, handling the edges, iterating through the dimensions, etc. is all pretty complicated - do you think you could add some more docstrings and comments to the various internal functions that describe what's going on throughout the process, what each function is responsible for, and what assumptions they make about the input? It doesn't need to go into a huge amount of detail, but some signposts would be helpful.
src/dspbase.jl
Outdated
| out_start, | ||
| out_stop, | ||
| u_start | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes if the compiler loses track of closed-over variables you can help it out by wrapping the closure in a let block so you're limiting the scope of when that variable could be reassigned. Alternatively there are sometimes places where some type-annotations can help. There's more info and an example in this comment. Between these to workarounds it's generally pretty doable to make the closures fast.
1516f2b to
5c8096c
Compare
|
Whoops, found a bug |
cb664e9 to
710cde3
Compare
|
@ssfrr I had to rebase and in the process I made some your comments "outdated" when I didn't really address them. It seems like the two outstanding issues are possibly adding more tests, and possibly making the edge function a closure instead of an standalone function. This is also what @martinholters wanted. If you can think of any more test cases, I'd be happy to add them. As for the closure, I think that replacing function parameters with captured variables just obscures the actual complexity, and it makes it harder for the compiler to make the same quality code. I played around a little with let blocks, but couldn't make it work. To be honest I didn't try that hard to make the code work as a closure, because of my personal preference for being explicit about what is being shared between the two functions, so it may be possible with some more effort. Would you guys be more satisfied with a standalone function if I did a better job of annotating the parameters? Happy to discuss if you guys think there is a reason to use a closure. |
|
I can go through my comment threads and see which still apply. I already marked a few as resolved. AFAICT the testing seems to be in pretty good shape at this point, but I want to do a little playing around with the PR to see if I pop anything up. I'm wondering if the closure issue is a sign that we should be looking for a better way to refactor that logic. Maybe it's just me but I find the edge finding and nested iteration and set manipulation pretty hard to follow. Also it means that a lot of the work of figuring out how to apply the block convolution is happening in the outer function, which contributes to the large number of parameters that need to be passed to the inner function. Also the block convolution process is somewhat duplicated for the edge blocks and center blocks. Do you think it would work to simplify the outer loop to be something like: function _conv_kern_os!(out,
u::AbstractArray{<:Any, N},
v,
su,
sv,
sout,
nffts) where N
# Pre-allocation
tdbuff, fdbuff, fftplan, ifftplan = os_prepare_conv(u, nffts)
# Transform the smaller filter
_zeropad!(tdbuff, v)
filter_fd = os_filter_transform!(tdbuff, p)
filter_fd .*= 1 / prod(nffts) # Normalize once for brfft
# nffts gives us the size of the FFT we'll perform, but only part of it
# gets saved to the output array
save_blocksize = min.(sout, nffts .- sv .+ 1)
nblocks = cld.(sout, save_blocksize)
for block_idx in CartesianIndices(nblocks)
_conv_kern_os_block!(out, u, filter_fd, tdbuff, fdbuff,
fftplan, ifftplan,
Tuple(block_idx), save_blocksize, nffts)
end
out
endand then have I have a start on what |
|
Yeah the logic is pretty gnarled. I was playing around with it a bit the other day to try to simplify it and I think it should be possible to do so. One thing that could be made more compact is the calculation of Another reason for the large set of parameters is to re-use as many things as possible, for performance reasons. As you can no doubt tell, when I was writing the The block convolution logic is definitely duplicated between the edge and center blocks. The inner loop of the edge method could be used on center blocks as is, in fact, because it is strictly more general. However I wanted to separate the two because the logic in the center blocks is dead simple, and requires no zero padding. As far as I can tell, one of the big benefits of using overlap-save vs overlap-add is that you don't need to zero-pad for the majority of the blocks. The way that zero padding works at the moment means that there is no performance savings when you call |
|
it seems like we could combine them but keep the performance gains by having something like: out_region = ... # figure out where to copy the data to
if axes(out_region) != axes(out)
fill!(out, zero(eltype(out)))
end
# copy the dataSo it would only do the zero-fill if some parts of the output weren't getting filled with data. (Using |
|
Maybe... I'll think about how to do that. It seems like at the end of the day it's going to be a performance question. I would like to measure how much performance is lost by doing all the extra logic and putting a branch in the inner loop. We should also keep a few things separate. 1) Closure vs standalone function 2) number of inputs to that function, either as sneaky closed-over variables or explicit parameters 3) using a separate, simplified loop for the blocks that are provably simpler to deal with or one loop capable of handling everything. If we decide to use one loop, then there's no reason to have it be in a separate function. As it is, most of the parameters being passed are constants (see my most recent commit for some more annotation). I might have to put this on the back burner for while, I don't think I have time to do a significant overhaul at the moment. |
|
Gentle bump. |
|
Bump. |
ssfrr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the extended delay. The added comments are really helpful in tracing through what's going on, thanks for adding that.
As a performance anecdote, for a pretty common audio use case (convolving a long 1D array with a much shorter one) this PR is takes 25% of the time and allocates 25% as much memory. Nice!
Over the past week or so I've been getting some time in tracing through this PR. As far as I can tell the basic approach can be summarized in pseudocode as:
for block_idx in edge_blocks
conv_edge_block!(out, u, v, block_idx)
end
for block_idx in center_blocks
conv_center_block!(out, u, v, block_idx)
endwhere conv_center_block can be simpler and faster because it doesn't need any padding. The majority of the complexity here is figuring out the set of blocks in edge_blocks.
I think the code would be simpler and easier to follow as:
for block_idx in all_blocks
if block_needs_padding(out, u, v, block_idx)
conv_edge_block!(out, u, v, block_idx)
else
conv_center_block!(out, u, v, block_idx)
end
endWhich cuts out all the dimension juggling to iterate through the edge blocks. In your comment here you mention that for small convolution kernels there are orders of magnitude more center blocks than edge blocks, so it might be worth avoiding calling block_needs_padding on each block. I'd counter though that block_needs_padding should just be a few comparisons, which seems like it should be negligible compared to convolving the block. In your example of conv(rand(1000,1000), rand(3,3)) each center block does an 8x8 FFT, 40 complex multiplications, and an 8x8 IFFT, and copies 36 elements to the output.
If you're convinced this is the best way forward I don't want to block this PR. It's a good improvement and I haven't found any bugs in my review and a little testing. I'm just a little worried about maintenance and overall complexity.
| ideal_save_blocksize = nffts .- sv .+ 1 | ||
| # Number of samples that are "missing" if the valid portion of the | ||
| # convolution is smaller than the output | ||
| sout_deficit = max.(0, ideal_save_blocksize .- sout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the comment correct here? sout_deficit will only be nonzero when the block is larger than the output, not smaller. So it seems like this is the number of samples extra (past the output) that are present in the convolution block. Presumably this would only happen if there's only a single block in at least one dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should read "if the output is smaller than the valid portion..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment still needs fixing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/dspbase.jl
Outdated
|
|
||
| last_full_blocks = fld.(su, save_blocksize) | ||
| # block indices for center blocks, which need no padding | ||
| center_block_ranges = UnitRange.(2, last_full_blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we guaranteed that all(save_blocksize .>= sv)? You can think of save_blocksize as being the hop size of the block processing, and if it hops less than sv then there might be multiple blocks in the beginning that need to be zero-padded when copying from u.
It might be safer to have something like
first_center_blocks = fld.((sv .- 1), save_blocksize) .+ 2
center_block_ranges = UnitRange.(first_center_blocks, last_full_blocks)(I think I have that condition right, but it could use a double-check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this resolved?
src/dspbase.jl
Outdated
| u::AbstractArray{<:Any, N}, | ||
| data_dest::Tuple = ntuple(::Integer -> 1, N), | ||
| data_region = CartesianIndices(u)) where N | ||
| # Copy the data to the beginning of the padded array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is no longer accurate - the data isn't necessarily copied to the beginning
|
Thanks for taking another look at this, and sorry that I've been so insistent. I'll test out the simplified logic that you've been suggesting. You're definitely right that this PR is pretty complicated, and I can see why it's hard to follow the logic that calculates the edge set. |
no need to apologize. This squeaky wheel deserved the grease. 🙂 |
|
Quick vote of support for this PR. I was wondering what you folks thought the best way to apply this to a streaming application would be? Specifically one where you only need to prepare the buffers, fft plans, etc once and then input blocks continuously without a priori knowledge of how many blocks will be received |
Yeah, that would be awesome. A lot of the necessary DSP infrastructure is here, but it would take some more design around how to represent the incoming/outgoing streams, as well as some engineering around handling the start/end conditions. I've got a lot of thoughts but it's a bit off-topic for this PR - feel free to open an issue referencing this if you want to discuss. |
|
Hey @galenlynch, I wanted to check in on this PR. Have you had a chance to see if can be simplified? I don't want this to languish too long and be difficult to merge in, or block other things from merging, so if you don't think you'll be able to revamp it within a week or two maybe it's best to merge it as-is (with whatever rebasing is necessary to get it on current master) and fix it up later? Functionally I think it's a strict improvement over what we have now and I don't want this great work to go to waste. I think the comments you added and the discussion in this thread should hopefully be enough to pick it up again. Thoughts from other maintainers? |
|
I haven't had a chance sadly, even though I agree with you that it should be possible to simplify it. Next week or two would definitely be impossible. If you're ok merging as is and then maybe simplifying it later then I agree that given the circumstances that would be the best. I'll rebase it. |
|
@ssfrr sorry for the delay. I tried to squeeze this in before my defense, but the merge became non-trivial so I gave up. Now that I defended, I should finally have the time to get this merged! |
|
No problem. 🎉 🛡️ ❗ |
I have added an overlap-save algorithm to convolution, which is substantially faster (~2x) for large convolution problems. It is also substantially faster than other packages like ImageFiltering.jl. For smaller convolutions, it is about the same or a little bit slower than standard fft convolution. The cutoff between overlap-save and simple fft convolution is pretty arbitrary at the moment, so it would be nice to add some benchmarking to allow DSP.jl to choose between the two algorithms depending on the size of the input. It would also be nice to add spatial/time domain filtering for small convolutions, as it would almost certainly be faster than fft convolutions. This is relevant to JuliaDSP#224.
817f44d to
c3723ff
Compare
|
I finally got around to rebasing this. All tests are passing, but it's been a minute since I last thought about this stuff in depth, so I think I want to think it over before going ahead with the merge. The rebase required a bit of tweaking of the the zeropad functions to allow both arbitrary indexing while also being able to specify the source and destination regions for the padded data. I fully intend to clean up this PR by possibly eliminating the separate functions for the edges and the center of the convolution, which would also get rid of the huge number of parameters that are passed between these functions. Both @martinholters and @ssfrr have raised concerns about the readability of the code as it stands, so I think this clean-up would be quite helpful. However, as Spencer said, this PR is a big performance improvement so it might be nice to just have these changes in the code base in their current functioning if inelegant form, and I can work on cleaning it up in a separate PR. |
|
Awesome, glad you're back on the case. Regarding the arbitrary indexing stuff - I wonder if there's an approach where we use regular Arrays internally and then re-apply the offset at the end of the process? It seems like maintaining all the offset stuff creates a lot of complexity but isn't actually relevant for the convolution itself. Just a thought. |
|
That's a possibility, but I think it would just move the complexity around. Right now, only |
|
OK I think it's ready to merge if you're happy with it. |
src/dspbase.jl
Outdated
| nfft = pow2 > prev_pow2 ? 2 ^ prev_pow2 : 2 ^ (pow2 - 1) | ||
|
|
||
| # L is the number of usable samples produced by each block | ||
| L = nfft - nb + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this condition just come down to nfft > nfull? since L doesn't seem to be used anywhere else you could save a small computation and just write if nfft > nfull for the next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Nice catch. I made a commit that implements your suggestion.
This commit simplifies the correction of using a larger fft than necessary while doing convolution, as per @HDictus' suggestion. This can occur in `optimalfftfiltlength`, which searches `nfft` over powers of two to minimized the total complexity of the convolution. However, when `nfft` is larger than the full convolution, it makes sense to consider other fast powers. This condition was previously detected based on the output of a fft convolution being larger than the input. However, in this same condition, `nfft` is also larger than the full convolution size, `nfull`. Because `nfft` and `nfull` are already computed, it is faster to not calculate the output size to detect this condition.
|
@ssfrr good to merge? |
ssfrr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than some of the unresolved issues from last time around I think this is good to go.
| ideal_save_blocksize = nffts .- sv .+ 1 | ||
| # Number of samples that are "missing" if the valid portion of the | ||
| # convolution is smaller than the output | ||
| sout_deficit = max.(0, ideal_save_blocksize .- sout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment still needs fixing, right?
src/dspbase.jl
Outdated
|
|
||
| last_full_blocks = fld.(su, save_blocksize) | ||
| # block indices for center blocks, which need no padding | ||
| center_block_ranges = UnitRange.(2, last_full_blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this resolved?
src/dspbase.jl
Outdated
| u::AbstractVector, | ||
| data_dest::Tuple = (1,), | ||
| data_region = CartesianIndices(u), | ||
| ::Tuple{Base.OneTo{Int}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this argument was added to the docstring, but it doesn't seem to be actually present in the methods.
8505260 to
f0aec0c
Compare
Some old comments by @ssfrr had not been addressed, mostly relating to documentation matching the code. One more substantial comment was making the overlap save algorithm work in situations where `nffts < 2 * sv -1`, in which case more than the first "block" will require padding. I have addressed these comments with this commit. Interestingly, this will also eliminate situations where no blocks require padding (e.g. if `nv = 1` in a dimension). In addition, I have changed some variable names in `optimalfftfiltlength` for clarity, and also changed some tests which were using the old interface, which used to have `nb` and `nx` reversed.
f0aec0c to
8933ec3
Compare
|
🎉 Thanks for putting up with my sluggish response times and sticking with this! |
|
Thanks for putting in all the time to read my confusing code! The end result is much better thanks to your comments and holding my feet to the fire. |




I have added an overlap-save algorithm to convolution, which is substantially faster (~2x) for large convolution problems. It is also substantially faster than other packages like ImageFiltering.jl. For smaller convolutions, it is about the same or a little bit slower than standard fft convolution. The cutoff between overlap-save and simple fft convolution is pretty arbitrary at the moment, so it would be nice to add some benchmarking to allow DSP.jl to choose between the two algorithms depending on the size of the input. It would also be nice to add spatial/time domain filtering for small convolutions, as it would almost certainly be faster than fft convolutions. This is relevant to #224.