Skip to content

Conversation

@galenlynch
Copy link
Member

I have added an overlap-save algorithm to convolution, which is substantially faster (~2x) for large convolution problems. It is also substantially faster than other packages like ImageFiltering.jl. For smaller convolutions, it is about the same or a little bit slower than standard fft convolution. The cutoff between overlap-save and simple fft convolution is pretty arbitrary at the moment, so it would be nice to add some benchmarking to allow DSP.jl to choose between the two algorithms depending on the size of the input. It would also be nice to add spatial/time domain filtering for small convolutions, as it would almost certainly be faster than fft convolutions. This is relevant to #224.

@galenlynch galenlynch marked this pull request as ready for review April 17, 2019 12:44
@galenlynch galenlynch requested review from martinholters and ssfrr and removed request for martinholters and ssfrr April 17, 2019 12:45
@galenlynch
Copy link
Member Author

I think this still has bugs in it. Let me fix it up before people spend too much time on it.

src/dspbase.jl Outdated
out_start,
out_stop,
u_start
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be made a closure to avoid passing a gazillion parameters? (Of course, not doing dynamic dispatch here and just inlining that function would be even better, but I guess the type instability would be pretty bad for performance and so would be having these_blocks be a vector?)

Copy link
Member Author

@galenlynch galenlynch Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the problem is Val{n_edges}(), which is obviously not type stable. Using simple integers instead of Val types causes the code generated by subsets to be less than great, and more importantly the output of subsets, edge_dims, would no longer a Tuple, which leads to type instability. I could play around with it a bit more to see if I can out a better solution. I hadn't though of using a closure. In my previous experience with closures, it ends up forcing the compiler to do a lot of boxing and unboxing of captured variables. I'll try that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're also right with these_blocks, it's a big performance win for it to be a Tuple, because then the output of the subsequent ProductIterator is type stable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually variables that are only assigned to once should be safe to capture in a closure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a closure with variables being modified passed as arguments, but it still substantially changed the output of @code_warntype, and had more type instability for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes if the compiler loses track of closed-over variables you can help it out by wrapping the closure in a let block so you're limiting the scope of when that variable could be reassigned. Alternatively there are sometimes places where some type-annotations can help. There's more info and an example in this comment. Between these to workarounds it's generally pretty doable to make the closures fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I also don't like closures with a bunch of closed-over variables from a stylistic perspective. For me, it's hard to spot the closed-over variables, hard to separate out the closure and figure out what's going on with it or use things like @code_warntype on closures. Basically, I'd rather have a huge parameter list than a huge set of captured variables.

@galenlynch
Copy link
Member Author

OS_performance

@galenlynch
Copy link
Member Author

os_grid_rt

@galenlynch
Copy link
Member Author

os_grid_mem

@galenlynch
Copy link
Member Author

galenlynch commented Apr 19, 2019

All Most of the cases where overlap-save performance is worse than FFT performance would be correctly handled by the heuristic in _conv_fft!.

@ssfrr
Copy link
Contributor

ssfrr commented Apr 19, 2019

What is N in those plots?

@galenlynch
Copy link
Member Author

N is the number of dimensions (vector, matrix, 3-d array, 4-d array)

@ssfrr
Copy link
Contributor

ssfrr commented Apr 19, 2019

Ah, thanks. These plots are super useful! Thanks for making them. Would it be a pain to generate the same plot but color each grid space red or blue based on the heuristic in _conv_fft!, so we can see how the heuristic choice relates to the memory/runtime performance?

@galenlynch
Copy link
Member Author

galenlynch commented Apr 19, 2019

os_grid_heuristic

Blue is when the heuristic would pick overlap-save, orange is when it would pick fft convolution.

@galenlynch
Copy link
Member Author

It's not perfect, but it does a pretty good job.

@ssfrr
Copy link
Contributor

ssfrr commented Apr 19, 2019

Looks pretty good to me. That's awesome, thanks for the visualization. Is the PR ready for review?

@galenlynch
Copy link
Member Author

I think so? The style is a little sloppy but it doesn't bug.

Copy link
Contributor

@ssfrr ssfrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice (and I learned some things about ND-FFTs and FFTW along the way). Most of my comments are nitpicky things, some are things caused a little confusion while reviewing. I think the main thing I'd like to see is more tests.

As a general note - the process of chopping up into blocks, handling the edges, iterating through the dimensions, etc. is all pretty complicated - do you think you could add some more docstrings and comments to the various internal functions that describe what's going on throughout the process, what each function is responsible for, and what assumptions they make about the input? It doesn't need to go into a huge amount of detail, but some signposts would be helpful.

src/dspbase.jl Outdated
out_start,
out_stop,
u_start
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes if the compiler loses track of closed-over variables you can help it out by wrapping the closure in a let block so you're limiting the scope of when that variable could be reassigned. Alternatively there are sometimes places where some type-annotations can help. There's more info and an example in this comment. Between these to workarounds it's generally pretty doable to make the closures fast.

@galenlynch
Copy link
Member Author

Whoops, found a bug

@galenlynch galenlynch force-pushed the overlap_save_conv branch from cb664e9 to 710cde3 Compare May 3, 2019 13:10
@galenlynch
Copy link
Member Author

@ssfrr I had to rebase and in the process I made some your comments "outdated" when I didn't really address them. It seems like the two outstanding issues are possibly adding more tests, and possibly making the edge function a closure instead of an standalone function. This is also what @martinholters wanted.

If you can think of any more test cases, I'd be happy to add them.

As for the closure, I think that replacing function parameters with captured variables just obscures the actual complexity, and it makes it harder for the compiler to make the same quality code. I played around a little with let blocks, but couldn't make it work. To be honest I didn't try that hard to make the code work as a closure, because of my personal preference for being explicit about what is being shared between the two functions, so it may be possible with some more effort. Would you guys be more satisfied with a standalone function if I did a better job of annotating the parameters? Happy to discuss if you guys think there is a reason to use a closure.

@ssfrr
Copy link
Contributor

ssfrr commented May 3, 2019

I can go through my comment threads and see which still apply. I already marked a few as resolved. AFAICT the testing seems to be in pretty good shape at this point, but I want to do a little playing around with the PR to see if I pop anything up.

I'm wondering if the closure issue is a sign that we should be looking for a better way to refactor that logic. Maybe it's just me but I find the edge finding and nested iteration and set manipulation pretty hard to follow. Also it means that a lot of the work of figuring out how to apply the block convolution is happening in the outer function, which contributes to the large number of parameters that need to be passed to the inner function. Also the block convolution process is somewhat duplicated for the edge blocks and center blocks.

Do you think it would work to simplify the outer loop to be something like:

function _conv_kern_os!(out,
                        u::AbstractArray{<:Any, N},
                        v,
                        su,
                        sv,
                        sout,
                        nffts) where N

    # Pre-allocation
    tdbuff, fdbuff, fftplan, ifftplan = os_prepare_conv(u, nffts)

    # Transform the smaller filter
    _zeropad!(tdbuff, v)
    filter_fd = os_filter_transform!(tdbuff, p)
    filter_fd .*= 1 / prod(nffts) # Normalize once for brfft

    # nffts gives us the size of the FFT we'll perform, but only part of it
    # gets saved to the output array
    save_blocksize = min.(sout, nffts .- sv .+ 1)
    nblocks = cld.(sout, save_blocksize)
    for block_idx in CartesianIndices(nblocks)
        _conv_kern_os_block!(out, u, filter_fd, tdbuff, fdbuff,
                             fftplan, ifftplan,
                             Tuple(block_idx), save_blocksize, nffts)
    end

    out
end

and then have _conv_kern_os_block! be responsible for figuring out whether it needs to zero-pad or copy? I haven't totally thought through that part but it seems like it could first zero-out the block, then find the corners of the valid data region and use copyto! to copy the right data in. That would avoid needing to use Val{N} anywhere but still be type-stable.

I have a start on what _conv_kern_os_block! could look like but it'll take some more fiddling to get it right. Does this seem like a worthwhile direction to you?

@galenlynch
Copy link
Member Author

Yeah the logic is pretty gnarled. I was playing around with it a bit the other day to try to simplify it and I think it should be possible to do so. One thing that could be made more compact is the calculation of edge_range, which is the main product of all the set iteration and manipulation, and also suffers from tuple -> mutable array -> tuple design at the moment. I tried creating the tuple directly using ntuple but the closure really messed up type stability, and a let block didn't help. At the very least I could be doing more sign posting than I am to clarify the logic.

Another reason for the large set of parameters is to re-use as many things as possible, for performance reasons. As you can no doubt tell, when I was writing the _conv_kern_os! function the edge function was just the loop body.

The block convolution logic is definitely duplicated between the edge and center blocks. The inner loop of the edge method could be used on center blocks as is, in fact, because it is strictly more general. However I wanted to separate the two because the logic in the center blocks is dead simple, and requires no zero padding. As far as I can tell, one of the big benefits of using overlap-save vs overlap-add is that you don't need to zero-pad for the majority of the blocks. The way that zero padding works at the moment means that there is no performance savings when you call _zeropad! but don't end up with any padding. I think it's a big performance win to use the simpler logic and zero padding on the center blocks, and would prefer to keep the reduced set of logic.

@ssfrr
Copy link
Contributor

ssfrr commented May 3, 2019

it seems like we could combine them but keep the performance gains by having something like:

out_region = ... # figure out where to copy the data to
if axes(out_region) != axes(out)
    fill!(out, zero(eltype(out)))
end

# copy the data

So it would only do the zero-fill if some parts of the output weren't getting filled with data. (Using axes here means it's only checking that the ranges are equal, which should be super cheap. Comparing the CartesianIndices directly currently seems to fallback to the generic AbstractArray implementation, which checks all the elements.)

@galenlynch
Copy link
Member Author

Maybe... I'll think about how to do that. It seems like at the end of the day it's going to be a performance question. I would like to measure how much performance is lost by doing all the extra logic and putting a branch in the inner loop.

We should also keep a few things separate. 1) Closure vs standalone function 2) number of inputs to that function, either as sneaky closed-over variables or explicit parameters 3) using a separate, simplified loop for the blocks that are provably simpler to deal with or one loop capable of handling everything. If we decide to use one loop, then there's no reason to have it be in a separate function. As it is, most of the parameters being passed are constants (see my most recent commit for some more annotation).

I might have to put this on the back burner for while, I don't think I have time to do a significant overhaul at the moment.

@galenlynch
Copy link
Member Author

Gentle bump.

@galenlynch
Copy link
Member Author

Bump.

Copy link
Contributor

@ssfrr ssfrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the extended delay. The added comments are really helpful in tracing through what's going on, thanks for adding that.

As a performance anecdote, for a pretty common audio use case (convolving a long 1D array with a much shorter one) this PR is takes 25% of the time and allocates 25% as much memory. Nice!

Over the past week or so I've been getting some time in tracing through this PR. As far as I can tell the basic approach can be summarized in pseudocode as:

for block_idx in edge_blocks
    conv_edge_block!(out, u, v, block_idx)
end

for block_idx in center_blocks
    conv_center_block!(out, u, v, block_idx)
end

where conv_center_block can be simpler and faster because it doesn't need any padding. The majority of the complexity here is figuring out the set of blocks in edge_blocks.

I think the code would be simpler and easier to follow as:

for block_idx in all_blocks
    if block_needs_padding(out, u, v, block_idx)
        conv_edge_block!(out, u, v, block_idx)
    else
        conv_center_block!(out, u, v, block_idx)
    end
end

Which cuts out all the dimension juggling to iterate through the edge blocks. In your comment here you mention that for small convolution kernels there are orders of magnitude more center blocks than edge blocks, so it might be worth avoiding calling block_needs_padding on each block. I'd counter though that block_needs_padding should just be a few comparisons, which seems like it should be negligible compared to convolving the block. In your example of conv(rand(1000,1000), rand(3,3)) each center block does an 8x8 FFT, 40 complex multiplications, and an 8x8 IFFT, and copies 36 elements to the output.

If you're convinced this is the best way forward I don't want to block this PR. It's a good improvement and I haven't found any bugs in my review and a little testing. I'm just a little worried about maintenance and overall complexity.

ideal_save_blocksize = nffts .- sv .+ 1
# Number of samples that are "missing" if the valid portion of the
# convolution is smaller than the output
sout_deficit = max.(0, ideal_save_blocksize .- sout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the comment correct here? sout_deficit will only be nonzero when the block is larger than the output, not smaller. So it seems like this is the number of samples extra (past the output) that are present in the convolution block. Presumably this would only happen if there's only a single block in at least one dimension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it should read "if the output is smaller than the valid portion..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment still needs fixing, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/dspbase.jl Outdated

last_full_blocks = fld.(su, save_blocksize)
# block indices for center blocks, which need no padding
center_block_ranges = UnitRange.(2, last_full_blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed that all(save_blocksize .>= sv)? You can think of save_blocksize as being the hop size of the block processing, and if it hops less than sv then there might be multiple blocks in the beginning that need to be zero-padded when copying from u.

It might be safer to have something like

first_center_blocks = fld.((sv .- 1), save_blocksize) .+ 2
center_block_ranges = UnitRange.(first_center_blocks, last_full_blocks)

(I think I have that condition right, but it could use a double-check)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved?

src/dspbase.jl Outdated
u::AbstractArray{<:Any, N},
data_dest::Tuple = ntuple(::Integer -> 1, N),
data_region = CartesianIndices(u)) where N
# Copy the data to the beginning of the padded array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer accurate - the data isn't necessarily copied to the beginning

@galenlynch
Copy link
Member Author

Thanks for taking another look at this, and sorry that I've been so insistent. I'll test out the simplified logic that you've been suggesting. You're definitely right that this PR is pretty complicated, and I can see why it's hard to follow the logic that calculates the edge set.

@ssfrr
Copy link
Contributor

ssfrr commented Jun 11, 2019

sorry that I've been so insistent

no need to apologize. This squeaky wheel deserved the grease. 🙂

@btmit
Copy link

btmit commented Jun 25, 2019

Quick vote of support for this PR.

I was wondering what you folks thought the best way to apply this to a streaming application would be?

Specifically one where you only need to prepare the buffers, fft plans, etc once and then input blocks continuously without a priori knowledge of how many blocks will be received

@ssfrr
Copy link
Contributor

ssfrr commented Jun 25, 2019

I was wondering what you folks thought the best way to apply this to a streaming application would be?

Yeah, that would be awesome. A lot of the necessary DSP infrastructure is here, but it would take some more design around how to represent the incoming/outgoing streams, as well as some engineering around handling the start/end conditions. I've got a lot of thoughts but it's a bit off-topic for this PR - feel free to open an issue referencing this if you want to discuss.

@ssfrr
Copy link
Contributor

ssfrr commented Aug 15, 2019

Hey @galenlynch, I wanted to check in on this PR. Have you had a chance to see if can be simplified?

I don't want this to languish too long and be difficult to merge in, or block other things from merging, so if you don't think you'll be able to revamp it within a week or two maybe it's best to merge it as-is (with whatever rebasing is necessary to get it on current master) and fix it up later? Functionally I think it's a strict improvement over what we have now and I don't want this great work to go to waste. I think the comments you added and the discussion in this thread should hopefully be enough to pick it up again.

Thoughts from other maintainers?

@galenlynch
Copy link
Member Author

I haven't had a chance sadly, even though I agree with you that it should be possible to simplify it. Next week or two would definitely be impossible. If you're ok merging as is and then maybe simplifying it later then I agree that given the circumstances that would be the best. I'll rebase it.

@galenlynch
Copy link
Member Author

@ssfrr sorry for the delay. I tried to squeeze this in before my defense, but the merge became non-trivial so I gave up. Now that I defended, I should finally have the time to get this merged!

@ssfrr
Copy link
Contributor

ssfrr commented Oct 1, 2019

No problem. 🎉 🛡️ ❗

I have added an overlap-save algorithm to convolution, which is substantially
faster (~2x) for large convolution problems. It is also substantially faster
than other packages like ImageFiltering.jl. For smaller convolutions, it is
about the same or a little bit slower than standard fft convolution. The cutoff
between overlap-save and simple fft convolution is pretty arbitrary at the
moment, so it would be nice to add some benchmarking to allow DSP.jl to choose
between the two algorithms depending on the size of the input. It would also be
nice to add spatial/time domain filtering for small convolutions, as it would
almost certainly be faster than fft convolutions. This is relevant to JuliaDSP#224.
@galenlynch
Copy link
Member Author

galenlynch commented Nov 15, 2019

I finally got around to rebasing this. All tests are passing, but it's been a minute since I last thought about this stuff in depth, so I think I want to think it over before going ahead with the merge.

The rebase required a bit of tweaking of the the zeropad functions to allow both arbitrary indexing while also being able to specify the source and destination regions for the padded data.

I fully intend to clean up this PR by possibly eliminating the separate functions for the edges and the center of the convolution, which would also get rid of the huge number of parameters that are passed between these functions. Both @martinholters and @ssfrr have raised concerns about the readability of the code as it stands, so I think this clean-up would be quite helpful. However, as Spencer said, this PR is a big performance improvement so it might be nice to just have these changes in the code base in their current functioning if inelegant form, and I can work on cleaning it up in a separate PR.

@ssfrr
Copy link
Contributor

ssfrr commented Nov 15, 2019

Awesome, glad you're back on the case. Regarding the arbitrary indexing stuff - I wonder if there's an approach where we use regular Arrays internally and then re-apply the offset at the end of the process? It seems like maintaining all the offset stuff creates a lot of complexity but isn't actually relevant for the convolution itself. Just a thought.

@galenlynch
Copy link
Member Author

That's a possibility, but I think it would just move the complexity around. Right now, only _zeropad methods have to be aware of arbitrary indexing, and do so with a single method for both 1-indexed arras as well as arrays with arbitrary offsets. It just looks at the axes of every array to determine the first index, instead of assuming it's one. The conv methods are otherwise pretty agnostic about indexing. We could add another method that copies the data, but that would be more memory intensive and possibly slower. Also, having zeropad and similar functions handle arbitrary indexing is also the approach that we took in #279, so we would have to change those functions as well if we were to approach the problem by copying results into the desired array format at the end.

@galenlynch
Copy link
Member Author

OK I think it's ready to merge if you're happy with it.

src/dspbase.jl Outdated
nfft = pow2 > prev_pow2 ? 2 ^ prev_pow2 : 2 ^ (pow2 - 1)

# L is the number of usable samples produced by each block
L = nfft - nb + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this condition just come down to nfft > nfull? since L doesn't seem to be used anywhere else you could save a small computation and just write if nfft > nfull for the next line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Nice catch. I made a commit that implements your suggestion.

This commit simplifies the correction of using a larger fft than necessary while
doing convolution, as per @HDictus' suggestion. This can occur in
`optimalfftfiltlength`, which searches `nfft` over powers of two to minimized
the total complexity of the convolution. However, when `nfft` is larger than the
full convolution, it makes sense to consider other fast powers. This condition
was previously detected based on the output of a fft convolution being larger
than the input. However, in this same condition, `nfft` is also larger than the
full convolution size, `nfull`. Because `nfft` and `nfull` are already computed,
it is faster to not calculate the output size to detect this condition.
@galenlynch
Copy link
Member Author

@ssfrr good to merge?

Copy link
Contributor

@ssfrr ssfrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than some of the unresolved issues from last time around I think this is good to go.

ideal_save_blocksize = nffts .- sv .+ 1
# Number of samples that are "missing" if the valid portion of the
# convolution is smaller than the output
sout_deficit = max.(0, ideal_save_blocksize .- sout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment still needs fixing, right?

src/dspbase.jl Outdated

last_full_blocks = fld.(su, save_blocksize)
# block indices for center blocks, which need no padding
center_block_ranges = UnitRange.(2, last_full_blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this resolved?

src/dspbase.jl Outdated
u::AbstractVector,
data_dest::Tuple = (1,),
data_region = CartesianIndices(u),
::Tuple{Base.OneTo{Int}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this argument was added to the docstring, but it doesn't seem to be actually present in the methods.

Some old comments by @ssfrr had not been addressed, mostly relating to
documentation matching the code. One more substantial comment was making the
overlap save algorithm work in situations where `nffts < 2 * sv -1`, in which
case more than the first "block" will require padding. I have addressed these
comments with this commit. Interestingly, this will also eliminate situations
where no blocks require padding (e.g. if `nv = 1` in a dimension).

In addition, I have changed some variable names in `optimalfftfiltlength` for
clarity, and also changed some tests which were using the old interface, which
used to have `nb` and `nx` reversed.
@galenlynch galenlynch merged commit f6c051d into JuliaDSP:master Jan 3, 2020
@ssfrr
Copy link
Contributor

ssfrr commented Jan 3, 2020

🎉 Thanks for putting up with my sluggish response times and sticking with this!

@galenlynch
Copy link
Member Author

Thanks for putting in all the time to read my confusing code! The end result is much better thanks to your comments and holding my feet to the fire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants