Add Base.parent method #14

mcabbott · 2020-09-04T15:46:30Z

I think this solves the following error (on TensorOperations v3, but not on v2):

using TensorOperations, Strided

b0 = rand(3,2);
f0 = rand(2,3,4);

bison2 = @strided permutedims(b0, (2,1))
ferret2 = @strided permutedims(f0, (1, 2, 3))

@tensor finch[k] := ferret2[i, j, k] * bison2[i, j]

# ERROR: StackOverflowError:
# Stacktrace:
#  [1] memsize(::StridedView{Float64,2,Array{Float64,1},typeof(identity)}) at /Users/me/.julia/packages/TensorOperations/1TnL0/src/implementation/stridedarray.jl:5 (repeats 79984 times)

TensorOperations.memsize(bison2) # same

Jutho · 2020-09-04T15:54:57Z

Thanks; will merge this as soon as lights turn green. I guess I had not considered using StridedView objects in @tensor, even though the latter is implemented using the former.

I had the idea at some point to add an @tensor-like macro as interface over Strided.jl, to allow things that are beyond pure Einstein summation convention (permute permute gemm), i.e. applying arbitrary functions, reducing over 1 or 3 or more indices at the same time, ...

However, as I don't need this urgently, this got postponed, and with LoopVectorization and Tullio, it seems this has all been covered. How does the multithreading strategy that you have in Tullio compare to the one of Strided.jl?

It would be great to have one go-to killer package for all those things; I am always open to collaborations if you're interested.

mcabbott · 2020-09-04T16:19:06Z

Thanks, this was a bit of a hack I had somewhere, which seemed to work before.

It looks like _threaded_blas_mul! takes exactly the same divide-the-biggest-dim recursive strategy that Tullio.threader does. (Except that I added a case for nthreads divisible by 3, as I have a 6-core cpu.)

Tullio also does some kind of recursive tiling, and when last I checked @tensor was quicker at doing permutedims, but I didn't get around to understanding why. Is this something it might be easy to borrow or share?

mcabbott · 2020-09-04T18:40:22Z

I had not seen _mul! before somehow, but that's very quick. And all from a generic mapreducedim, perhaps I should see if @tullio can use exactly this for many operations. Seems like this ought to be quicker than permute-gemm-permute for many things @tensor does, too?

julia> using LinearAlgebra, Strided, Tullio, LoopVectorization, BenchmarkTools
julia> BLAS.vendor()

:mkl
julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j];
julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N

A, B = rand(N, N + 1), rand(N + 1, N + 2)

@show N

@btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2))

@btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2))

@btime @strided(Strided._mul!(C, $A, $B, 1.0, 0.0)) setup=(C=zeros($N, $N+2));

end

N = 2

43.715 ns (0 allocations: 0 bytes)

197.185 ns (0 allocations: 0 bytes)

345.700 ns (7 allocations: 256 bytes)

N = 10

113.348 ns (0 allocations: 0 bytes)

307.412 ns (0 allocations: 0 bytes)

439.919 ns (7 allocations: 256 bytes)

N = 50

5.394 μs (0 allocations: 0 bytes)

4.940 μs (0 allocations: 0 bytes)

5.162 μs (7 allocations: 256 bytes)

N = 100

23.970 μs (47 allocations: 2.98 KiB)

25.233 μs (0 allocations: 0 bytes)

25.425 μs (7 allocations: 256 bytes)

N = 200

175.963 μs (49 allocations: 3.05 KiB)

185.961 μs (0 allocations: 0 bytes)

186.014 μs (7 allocations: 256 bytes)

N = 500

3.956 ms (50 allocations: 3.08 KiB)

2.999 ms (0 allocations: 0 bytes)

2.916 ms (7 allocations: 256 bytes)

N = 1000

38.370 ms (51 allocations: 3.11 KiB)

28.367 ms (0 allocations: 0 bytes)

23.967 ms (7 allocations: 256 bytes)

N = 2000

292.129 ms (50 allocations: 3.08 KiB)

200.771 ms (0 allocations: 0 bytes)

195.420 ms (7 allocations: 256 bytes)

N = 5000

4.597 s (52 allocations: 3.14 KiB)

3.798 s (0 allocations: 0 bytes)

3.942 s (7 allocations: 256 bytes)

N = 10000

52.223 s (52 allocations: 3.14 KiB)

41.940 s (0 allocations: 0 bytes)

30.871 s (7 allocations: 256 bytes)

Jutho · 2020-09-04T21:09:55Z

Note that _mul! is just calling BLAS gemm! on standard arrays. The only thing it does is roll out its own multithreading implementation if JULIA_NUM_THREADS>1. It will actually also work with number of threads that are not numbers of 4, but probably not optimally. This implementation was an attempt to replace the multithreading of BLAS with one provided by Julia, so that you could just run BLAS with BLAS.set_num_threads(1) and have Julia take care of the threads (the problem with this is that, while it performs quite well for mul!, you also loose all multithreading in LAPACK functions which you should then replace with Julia versions).

I am not sure how you are timing this, but you should not really compare this within one Julia session, unless you change BLAS.set_num_threads() in between. If you have JULIA_NUM_THREADS > 1 and also a number of blas threads bigger than 1 (e.g. the native/default setting), you are oversubscribing your number of cores, and I don't think there should be anything to gain from that.

So the threading in mul! is very distinct from the threading in the generic mapreducedim. That one will not perform well for matrix multiplication, not really due to the threading but just due to the kernel. The threading mechanism there uses a cost function which I generalised from the HPTT library: https://github.com/springer13/hptt

mcabbott · 2020-09-04T21:48:32Z

Oh right, I misread that, and confused __mul! with _mul!. Too good to be true should have been a clue... this all makes more sense now. (I guess my timings were over-subscribing threads, but the penalty for using twice as many as I should seems not very large.)

Thanks for the link, will see what I can learn from HPTT.

add Base.parent method

0e8b78a

Jutho merged commit d473d69 into Jutho:master Sep 4, 2020

mcabbott deleted the patch-1 branch September 4, 2020 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Base.parent method #14

Add Base.parent method #14

mcabbott commented Sep 4, 2020 •

edited

Loading

Jutho commented Sep 4, 2020

mcabbott commented Sep 4, 2020

mcabbott commented Sep 4, 2020

Jutho commented Sep 4, 2020 •

edited

Loading

mcabbott commented Sep 4, 2020

Add Base.parent method #14

Add Base.parent method #14

Conversation

mcabbott commented Sep 4, 2020 • edited Loading

Jutho commented Sep 4, 2020

mcabbott commented Sep 4, 2020

mcabbott commented Sep 4, 2020

Jutho commented Sep 4, 2020 • edited Loading

mcabbott commented Sep 4, 2020

mcabbott commented Sep 4, 2020 •

edited

Loading

Jutho commented Sep 4, 2020 •

edited

Loading