-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of conv()
Uses Overlap and Save While Direct Method Is Faster
#355
Comments
See this similar plot (but for longer filters) that @galenlynch made that compares regular FFT processing to overlap-save. I think that the thing to do is to add another check for short convolutions that uses the direct method for those. @RoyiAvital do you have the code for that plot? I'm curious about the weird diagonal lines where the overlap-save is faster. Did you use |
Much of this is a repeat of what I said on Discourse, but I think the tradeoff between direct and fft or overlap-save is not a question of whether everything is in memory or not, but rather what the kernel size is. FFT convolution has an algorithmic advantage over direct convolution (NlogN vs N^2), so you should only choose direct convolution when the overhead outweighs the complexity savings. Those savings are only increased with overlap-save compared to fft convolution. In my testing, the break-even point between fft convolution and direct convolution happens when the kernel is length 58. You are welcome to compare for yourself as direct convolution is already implemented in DSP.jl in the As I mentioned in Discourse, it might be easy to use the direct convolution Could you please compare your times in Matlab to the DSP.jl times? |
I think you miss interpret my writing. I will try to summarize, but I don't think there is a point to keep debate as the image I pasted above shows real world results.
I showed on my code to @ssfrr the for I also ran my analysis on bigger signals to have broader perspective: I also learned from a talk with @ssfrr that your implementation for Frequency Domain Convolution is using Again, I think you should look at the results I posted. I think they speak for themselves. The code is also available and you can use it for farther analysis. Anyhow, just my 2 cents. |
I understand and agree with your point that direct convolution will be faster than Fourier methods for kernel sizes between 0 and some number. I don't know where that trade-off point is, and I think it's worth keeping in mind how much run-time will be saved. That being said, I'm all for a direct convolution branch in In those plots, you're comparing the runtime of Matlab's To generate your test times, I ran your code on my computer with It's worth noting that DSP.jl's However, as you already pointed out, 2,020 is a small number of datapoints for many signals, though it's maybe a relevant size for images. Two seconds of audio data (sampled at 44.1 kS/s) is around 100,000 samples long, so let's see what happens in this regime. Here I'm showing results generated the same way, but for a 100,000 length signal and kernels with lengths between 4 and 2,020. You can see that for almost all these kernels, the DSP.jl With all of these tests, we can see that implementation does matter, and that the overlap-save algorithm seems to fare well—in agreement with many sources. Certainly, in some cases a direct convolution algorithm will be faster, though I think that calculating when the increased efficiency of overlap-save will not make up for the extra overhead will be difficult since it depends on both the kernel size and the signal size. I am very hesitant to draw any conclusions about cache-friendliness or thread-friendliness of the underlying algorithms from your Matlab scripts, or indeed from DSP.jl's implementation. I don't know why you think it's better to not use the closest power of 2, 3, or 5 prior to Fourier transformation, which is in contrast to FFTW's recommendations, but if you have some actual benchmarks using Julia code then a PR would be welcome. Once again, if you or others are motivated to implement direct convolution method, that would be great. However, I haven't been motivated to chase those extra microseconds, given how hard I think it will be to make a direct convolution method that generalizes to arbitrary dimensions and array indexing. |
@RoyiAvital I'm confused by something on your plots - I'd expect your first plot (up to length 2000) to be the top-left corner of your next plot (up to length 18000) but they're showing different results. The first shows that direct is faster up to kernel lengths of about 500, but the second shows direct being faster up to about 2000 - was there something else different about those two runs? On Complexity AnalysisSay you have a length
For a fixed kernel length
|
@ssfrr , I think all presents know the theory of the asymptotic performance of the 3 methods. Regarding the cut off point, I wouldn't do a deep research on that. @galenlynch , regarding your comment about 2 [Sec] of audio is 100,000 samples. Anyhow, I think you have all you need.
|
EDIT: I agree with what @ssfrr said about complexity. So I think it's a question of judging when the overhead of either overlap-save or FFT convolution is not worth it, and choosing the algorithm appropriately. This is already being done to choose between overlap-save and FFT convolution, and the benchmarks guiding this decision point were from two Julia functions. While we don't have a direct convolution branch yet, my testing with the pretty fast direct convolution that's part of the FIR filtering codebase (and has comparable performance as Matab's I think I wasn't quite clear with my post with the benchmarks of DSP.jl and the Matlab comparisons made by @RoyiAvital: I don't buy the idea that your Matlab benchmarks tell us which algorithm to choose in what situation. First of all, your benchmarks are comparing a compiled function that is shipped with Matlab against two of your own functions written in the Matlab language, which is unfair. The direct convolution code in your benchmark uses Matlab's From what I can tell, direct convolution would indeed be faster for small filters but only in the cases outlined above, and for marginal performance savings. In case you think it's a problem with our implementation, Multirate stuff is also implemented in DSP.jl... let's just choose the fastest algorithm for We have an implementation of all three algorithms already, I don't know why you think we should copy your code.
|
@galenlynch , I have no idea why you wrote I suggested you copy my code or anything. All I wrote is:
My MATLAB implementation of OVS is non allocating and using only FFT (Implemented by FFTW) and indexing. Two operations MATLAB is quite efficient at (Not like Julia but this is as efficient MATLAB can be). Hence it gave us the intuition. My conclusion from my MATLAB implementation about the gains of OVS being negligible might be different with much more efficient OVS implementation in Julia. But I'd still bet that the cases OVS is superior to both Direct and DFT based are the minority cases. |
This thread is getting a little unfocused. As I understand it the central claim is that there's a regime where direct convolution is faster than FFT or overlap-save (OVS). I think we can all agree that's true, but someone needs to run the benchmarks to figure out where that threshold is. Improving the FFT and OVS implementations and API changes can be a separate issue (and I agree that there can be API improvements). I'm not sure if there's much more to discuss until we see those benchmarks, using the Julia implementations. |
I agree with the central claim, and also that the API could be improved. Sorry if I got a bit aggressive in my response. |
To add another data point to this discussion: I'm not across the technical details you are discussing here, but I was surprised that practically the direct windowing in DynamicGrids.jl runs 3*3 It's hard to benchmark precisely because of differences in initial allocations and setup - if DSP.jl had a pre-allocated using DynamicGrids,
LinearAlgebra,
StaticArrays,
BenchmarkTools,
DSP
s = 3; x = 1000
A = rand(Float32, x, x)
kernel = rand(Float32, s, s)
sa_kernel = SMatrix{s,s}(kernel)
o = ResultOutput(A; tspan=1:101) # 100 frames, the first tstep is init
rule = Convolution(DynamicGrids.Kernel(sa_kernel))
# DynamicGrids with StaticArray kernel
@btime sim!($o, $rule)
# 457.183 ms (1595 allocations: 7.81 MiB)
# DSP with regular Array kernel
@btime for i = 1:100 conv($A, $kernel); end
# 1.488 s (10700 allocations: 383.84 MiB)
# DSP with StaticArray kernel
@btime for i = 1:100 conv($A, $sa_kernel); end
# 1.632 s (11100 allocations: 383.86 MiB)
# Running the whole simulation setup for every frame - which
# allocates a lot and isn't type stable (it's not designed to run like
# this), DynamicGrids.jl is stil 1.5x faster:
o = ResultOutput(A; tspan=1:2)
@btime for i in 1:100 sim!($o, $rule) end
# 1.080 s (11000 allocations: 768.13 MiB)
# Single step with preallocated padded arrays - so less allocation,
# but some simulation setup time aded:
@btime for i in 1:100 sim!($o, $rule; simdata=$simdata) end
# 826.144 ms (2300 allocations: 142.19 KiB) You can probably optimise this further if you know you will only be using it for dot multiplication. |
Yeah I have a PR that makes conv in DSP.jl much faster for many use cases by adding direct convolution and also using multithreading. It's a bit stalled at the moment due to my own lack of time and my uncertainty over the design of the system that selects which convolution algorithm to use, since the number of algorithms is something like 10. |
The current implementation uses in many cases the Overlap and Save method for Linear Convolution.
While this method is optimized to a streamed data it is not optimized for cases where all data is given in memory.
See https://discourse.julialang.org/t/convolution-conv-with-same-size-output/38260/11:
I think it is better to implement direct loop with
@simd
and@inbounds
. It will probably be faster than the current method for most cases. For longer kernels / signals it should use the Frequency Domain as it is implemented now,The text was updated successfully, but these errors were encountered: