Implement time-domain convolution and use it for integers #545

martinholters · 2024-02-23T07:59:41Z

This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.

In particular, for "small" u and v, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:

julia> conv(zeros(0), zeros(0))
ERROR: ArgumentError: invalid Array dimensions

julia> conv(zeros(0), zeros(1))
ERROR: InexactError: trunc(Int64, -Inf)

julia> conv(zeros(1), zeros(0))
ERROR: InexactError: trunc(Int64, -Inf)

The time-domain convolution form this PR does the right thing:

julia> conv(zeros(Int, 0), zeros(Int, 0))
Int64[]

julia> conv(zeros(Int, 0), zeros(Int, 1))
Int64[]

julia> conv(zeros(Int, 1), zeros(Int, 0))
Int64[]

Another point worth considering is whether to use time-domain convolution for more input types. Consider

julia> conv(zeros(BigFloat, 1), zeros(BigFloat, 1))
ERROR: type BigFloat not supported

julia> conv(zeros(Rational{Int}, 1), zeros(Rational{Int}, 1))
1-element Vector{Float64}:
 0.0

For BigFloat, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. For Rational, I wonder whether doing the computation on Rationals (and also returning Rationals) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches for Float32 and Float64 input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?

Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.

Closes #411, fixes #410.

codecov · 2024-02-23T08:03:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.79%. Comparing base (67d62c9) to head (61a8b8d).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
+ Coverage   97.74%   97.79%   +0.04%     
==========================================
  Files          18       18              
  Lines        3197     3218      +21     
==========================================
+ Hits         3125     3147      +22     
+ Misses         72       71       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wheeheee · 2024-02-23T08:51:41Z

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

wheeheee

maybe Val(N) would help with type stability

src/dspbase.jl

martinholters · 2024-02-23T09:01:37Z

maybe Val(N) would help with type stability

It does for N>10, which may not be too relevant, but then again ... why not?

martinholters · 2024-02-23T09:05:16Z

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Hm. Export tdconv and fdconv (names subject to bike-shedding) and have conv do some best-effort choosing among them? Or add kwarg to conv to force a choice instead of the additional exports?

wheeheee · 2024-02-24T06:22:15Z

I kind of like the names tdconv and fdconv, but maybe adding a kwarg is better, since people might also want to choose the overlap-save algorithm and that's one more function.

src/dspbase.jl

martinholters · 2024-02-26T10:12:36Z

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

The algorithm set to :auto means "select based on problem size", which is the default for types known to be reasonably FFT-able (Float32, Float64, and their Complex versions). For other types the default is :direct. This lets user opt into :auto for other types at their own risk, but provides a sane default.

wheeheee · 2024-02-26T14:10:02Z

I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over CartesianIndices for 1-based arrays (unsure about offset indices). Here it is:

function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
    output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
    checkbounds(out, output_indices)
    fill!(out, 0)
    offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
    for j in CartesianIndices(E), i in CartesianIndices(k)
        out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
    end
    return out
end

Benchmarks:

julia> x = rand(100); y = rand(100); out = rand(199);

julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.000 μs …  38.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.216 μs ± 602.871 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █    █                                            ▃     ▂
  ▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
  9 μs         Histogram: log(frequency) by time      10.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.510 μs …   5.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.583 μs ± 101.580 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █                                                   ▂
  ▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
  1.51 μs         Histogram: frequency by time        1.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

martinholters · 2024-02-27T07:58:49Z

Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from _conv_td to _conv_td!, a simple loop like you have provided is probably better. Once we're sure about the API, I'll try to optimize, probably by switching to such a loop.

wheeheee · 2024-02-27T09:14:29Z

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

martinholters · 2024-02-27T13:11:24Z

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

I was primarily referring to the keyword-based API. Once that's settled, the organization of the conv methods can be streamlined. And the heuristic when to choose time vs frequency domain is mostly a placeholder for now, that needs adjustment either way, but that's somewhat orthogonal.

So let's discuss API. Some random thoughts I have gathered while working on this:

We should allow the user to explicitly select time or frequency domain. I'm not entirely convinced explicit distinction between "simple FFT" and "overlap-save" is needed. Would there be a use case where a caller needs to decide this?
We should have conv(u,v) do something reasonable by default. To me, that means at least to do small convolutions for non-FFT-able datatypes in time domain and large convolutions of FFT-able datatype in frequency domain. For large integer convolutions, it is not clear whether the long runtime of time domain or the rounding issues of frequency domain are the lesser evil, so an automatic mode that only considers problem size but ignores input types might also be desirable.
For the poly-algorithm conv, the result type should be independent of the algorithm, and I guess the only thing take really makes sense is eltype(conv(u,v)) == promote_type(eltype(u), eltype(v)).
If we decide to implement conv as first allocating the output and then calling a conv! function, we should consider making the latter public.
The implementation in the first commit here uses the collect machinery to derive the output eltype based on the actual values. That may have applications in rare situations, but I wouldn't want to make the API more complicated just to allow this.
We might want to be a bit careful with swapping u and v in the light of non-commutative number types.

I think the current kwargs approach addresses the first three items, although choosing between :auto and :direct based on the eltype might come a bit surprising. I'm not sure I like that. The alternative would be two different auto-like keywords, I guess.

wheeheee · 2024-02-27T16:03:03Z

My opinion, for what it's worth:

I initially thought it possible in rare cases that the slower algorithm is chosen, or that it could be platform-dependent, so that it could be nice to give users a choice by exposing os vs fft. As this doesn't seem to be the case, confusing users with these options seems pointless and without other benefits, I take back that suggestion.
I think many would find conv! welcome, totally in favour of that.
WRT swapping arguments, since AFAIK there isn't a trait or a general way to determine commutativity, we should probably define another union of types that are commutative and where swapping is legal, for correctness? Along with documentation, @warn for non-explicitly-allowed types could be warranted since that can really slow things down, though I'm not sure if @warn does bad things even on the slow path...
On integer convolutions: I'm on the side again of :direct by default, so path 2 in Integer convolutions have rounding errors #410, with another warning if the inputs aren't reasonably small. This would obviate the need for another keyword argument, if I understand you correctly, and another benefit, as was mentioned in that issue, is conforming with SciPy's behaviour.
Additionally, not sure if when conv! is in use, and the output array is of a wider type (although again, not too sure how to determine that), we should promote the elements from u and v first before multiplying and adding.

martinholters · 2024-02-28T13:31:23Z

I'm no fan of @warnings, unless there is also an option to say "I know what I'm doing, no need to flood me with warnings". And including that option would make the API somewhat clunky, in my opinion.

Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how conv! should deal with offset axes. A short recap for conv:

If all input axes are OneTos, so will be the output axes, and if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $1,\ldots,M+N-1$ as a OneTo (for every axis).
If there is at least one non-OneTo axis, and if the input indices are $0,\ldots,M-1$ and $0,\ldots,N-1$, the output indices will be $0,\ldots,M+N-2$ (for every axis). Shifts in the input are reflected in the output, so if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $2,\ldots,M+N$. This directly matches the $y[n]=\sum v[m]\cdot u[n-m]$ definition.

The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to conv! matches what conv returns, all is well, but what if not. If the output has a OneTo axis, just storing the results starting at 1 seems ok, I'd say. If the output and inputs have offsets, but they don't "match", if all required indices are present in the output, storing the result there and zeroing the rest sounds reasonable. If the result does not fit into the output, a BoundsError seems warranted. A silent truncation might also be considered. But what if all inputs are OneTos but the output is not? Put the results starting at 2, starting at 1, or error?

Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now.

wheeheee · 2024-02-28T15:55:13Z

Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider * potentially non-commutative, I suspect we could still retain the performance benefits of swapping u and v by simply changing the iteration order of the CartesianIndices.

However, as it is, there is the potential for type instability if the axes of each CartesianIndices are not of the same type, although union splitting may automatically take care of that. If this is implemented, I think extracting the kernel out and using a function barrier would at least look nicer, if it isn't strictly necessary. Scratch that, I see that they are all UnitRanges now...

we should promote the elements from u and v first before multiplying and adding.

I also just realized that muladd automatically promotes first.

galenlynch · 2024-02-29T14:26:58Z

These all seem like great changes!

martinholters · 2024-02-29T14:39:43Z

This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the algorithm kwarg also may need some more consideration. Current status:

:direct chooses time-domain direct convolution.
:fft chooses an FFT-based algorithm; whether simple or overlap-save is chosen automatically. Should we allow the caller to choose?
:auto picks the faster (well, presently modulo the shortcomings of the super-simple heuristic) of those.

The default for algorithm is chosen based on the (output) eltype: :auto if it is Float32, Float64 or a Complex of those, :direct otherwise. Another option would be to have another keyword, say :fastest, to do what :auto does now, and have :auto be what the default does now (i.e. being equal to :direct or :fastest depending on eltype). Would the latter be better?

Another alternative would be to replace the algorithm keyword with something like allow_fft_based::Boolean where false corresponds to current :direct and true to current :auto, with the default again decided by eltype. Would that be preferable?

galenlynch · 2024-02-29T14:51:27Z

I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo.

martinholters · 2024-02-29T14:54:23Z

That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.)

mbaz · 2024-02-29T17:47:24Z

I agree with giving users the ability to choose specific algorithms. Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

wheeheee · 2024-03-01T04:16:43Z

I think the second option with :auto replacing the current default behaviour seems safer.
Perhaps a simplified general-user-facing conv(u, v; mode=:auto(/:unsafe_fast)) would be friendlier, with :unsafe_fast enabling conversion to floats and FFT for numbers that aren't IEEE floats / complex.
And, for those who need it, put the additional, more involved options in conv!(out, u, v; alg), also defining an allocate_output for convenience, where one can choose a specific algorithm, :fft, or the options in conv. The different keyword arguments would be confusing though.

martinholters · 2024-03-01T07:23:43Z

Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

Do you mean fft behave like the present auto, i.e. also allowing time-domain? That would be a bit surprising, so probably not. So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

@wheeheee I don't think allowing different kwargs in conv and conv! for similar purposes would help simplify the interface for users in any way.

mbaz · 2024-03-01T14:00:03Z

So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

Yes, that's what I meant.

test/dsp.jl

martinholters · 2024-03-11T14:36:37Z

Ok, as there were no complaints about the current API, I've added docs and more tests. If someone could check the docs for comprehensibility, that would be welcome.
Remaining to-do is to devise a heuristic for deciding between time and frequency domain.

mbaz

A couple of potential improvements to the docstrings.

src/dspbase.jl

wheeheee

typos

src/dspbase.jl

martinholters · 2024-03-28T12:34:25Z

Sorry this has stalled. The only thing left to do from my perspective is to come up with a heuristic to choose between time-domain and frequency-domain. I meant to base this on some benchmarking, but that "some" turned out to be quite a lot, especially when not restricting to 1-d. For the time-domain algorithm, my benchmarking confirmed that C*length(u)*length(v) is a reasonable estimate for the runtime with C depending on dimensionality and element type. (Interestingly, dimensionality and element type seem to be non-orthogonal here, though.) But some preliminary benchmarking I did for the FFT convolution gave surprisingly inconclusive results.

If anyone wants to pick this up or has a good idea for a benchmarking strategy - any help is appreciated!

wheeheee · 2024-03-31T17:25:15Z

Not sure what the detailed benchmarks say, but would it be acceptable to start with the conservative heuristic here since it's IIUC a definite improvement for small sizes?
Also saw this and tried it out, with muladd too; I think it fixes the problem with OffsetArrays not SIMDing as well as normal arrays. Hopefully @simd is safe for arbitrary eltypes.
Another small problem: the padding heuristic using nextfastfft isn't always optimal, causing performance to have weird jumps sometimes.

julia> x = rand(ComplexF64, 245); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  45.600 μs (13 allocations: 31.91 KiB)

julia> x = rand(ComplexF64, 246); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  27.500 μs (13 allocations: 32.28 KiB)

* add `:algorithm` kwarg * add `conv!`

Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>

wheeheee reviewed Feb 23, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

martinholters mentioned this pull request Feb 24, 2024

conv is slower than necessary #167

Open

martinholters commented Feb 24, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

martinholters force-pushed the mh/tdconv branch from 68c20b4 to 3bb56ca Compare February 26, 2024 07:33

martinholters force-pushed the mh/tdconv branch from addb53b to fce6cf9 Compare February 29, 2024 14:19

martinholters force-pushed the mh/tdconv branch from fce6cf9 to 969a8a1 Compare March 1, 2024 08:41

wheeheee reviewed Mar 6, 2024

View reviewed changes

test/dsp.jl Outdated Show resolved Hide resolved

martinholters force-pushed the mh/tdconv branch from af5889d to 4dd76fa Compare March 6, 2024 07:46

wheeheee mentioned this pull request Mar 8, 2024

Make conv return floats when given integer arrays #411

Closed

martinholters force-pushed the mh/tdconv branch 2 times, most recently from bcfbb77 to 7c603ec Compare March 11, 2024 13:05

mbaz reviewed Mar 11, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

wheeheee reviewed Mar 12, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

src/dspbase.jl Outdated Show resolved Hide resolved

wheeheee reviewed Mar 13, 2024

View reviewed changes

src/dspbase.jl Outdated Show resolved Hide resolved

martinholters and others added 8 commits May 1, 2024 00:02

Implement time-domain convolution and use it for integers

8699efe

Rework conv interface and rework its internals

ce8ac71

* add `:algorithm` kwarg * add `conv!`

minor conv optimization

3da38a3

Update/add docs for conv/conv!

41dfa6a

more conv testing

4a0f559

Fix docs typos

7cb1df5

Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>

More doc improvements

275509f

FIx typos in docstring

61a8b8d

Co-authored-by: wheeheee <104880306+wheeheee@users.noreply.github.com>

wheeheee force-pushed the mh/tdconv branch from 3b84ab6 to 61a8b8d Compare April 30, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement time-domain convolution and use it for integers #545

Implement time-domain convolution and use it for integers #545

martinholters commented Feb 23, 2024

codecov bot commented Feb 23, 2024 •

edited

wheeheee commented Feb 23, 2024

wheeheee left a comment

martinholters commented Feb 23, 2024 •

edited

martinholters commented Feb 23, 2024

wheeheee commented Feb 24, 2024

martinholters commented Feb 26, 2024

wheeheee commented Feb 26, 2024 •

edited

martinholters commented Feb 27, 2024

wheeheee commented Feb 27, 2024

martinholters commented Feb 27, 2024

wheeheee commented Feb 27, 2024 •

edited

martinholters commented Feb 28, 2024

wheeheee commented Feb 28, 2024 •

edited

galenlynch commented Feb 29, 2024

martinholters commented Feb 29, 2024

galenlynch commented Feb 29, 2024

martinholters commented Feb 29, 2024

mbaz commented Feb 29, 2024

wheeheee commented Mar 1, 2024

martinholters commented Mar 1, 2024

mbaz commented Mar 1, 2024

martinholters commented Mar 11, 2024

mbaz left a comment

wheeheee left a comment

martinholters commented Mar 28, 2024

wheeheee commented Mar 31, 2024

Implement time-domain convolution and use it for integers #545

Are you sure you want to change the base?

Implement time-domain convolution and use it for integers #545

Conversation

martinholters commented Feb 23, 2024

codecov bot commented Feb 23, 2024 • edited

Codecov Report

wheeheee commented Feb 23, 2024

wheeheee left a comment

Choose a reason for hiding this comment

martinholters commented Feb 23, 2024 • edited

martinholters commented Feb 23, 2024

wheeheee commented Feb 24, 2024

martinholters commented Feb 26, 2024

wheeheee commented Feb 26, 2024 • edited

martinholters commented Feb 27, 2024

wheeheee commented Feb 27, 2024

martinholters commented Feb 27, 2024

wheeheee commented Feb 27, 2024 • edited

martinholters commented Feb 28, 2024

wheeheee commented Feb 28, 2024 • edited

galenlynch commented Feb 29, 2024

martinholters commented Feb 29, 2024

galenlynch commented Feb 29, 2024

martinholters commented Feb 29, 2024

mbaz commented Feb 29, 2024

wheeheee commented Mar 1, 2024

martinholters commented Mar 1, 2024

mbaz commented Mar 1, 2024

martinholters commented Mar 11, 2024

mbaz left a comment

Choose a reason for hiding this comment

wheeheee left a comment

Choose a reason for hiding this comment

martinholters commented Mar 28, 2024

wheeheee commented Mar 31, 2024

codecov bot commented Feb 23, 2024 •

edited

martinholters commented Feb 23, 2024 •

edited

wheeheee commented Feb 26, 2024 •

edited

wheeheee commented Feb 27, 2024 •

edited

wheeheee commented Feb 28, 2024 •

edited