Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Base.parent method #14

Merged
merged 1 commit into from
Sep 4, 2020
Merged

Add Base.parent method #14

merged 1 commit into from
Sep 4, 2020

Conversation

mcabbott
Copy link
Contributor

@mcabbott mcabbott commented Sep 4, 2020

I think this solves the following error (on TensorOperations v3, but not on v2):

using TensorOperations, Strided

b0 = rand(3,2);
f0 = rand(2,3,4);

bison2 = @strided permutedims(b0, (2,1))
ferret2 = @strided permutedims(f0, (1, 2, 3))

@tensor finch[k] := ferret2[i, j, k] * bison2[i, j]

# ERROR: StackOverflowError:
# Stacktrace:
#  [1] memsize(::StridedView{Float64,2,Array{Float64,1},typeof(identity)}) at /Users/me/.julia/packages/TensorOperations/1TnL0/src/implementation/stridedarray.jl:5 (repeats 79984 times)

TensorOperations.memsize(bison2) # same

@Jutho
Copy link
Owner

Jutho commented Sep 4, 2020

Thanks; will merge this as soon as lights turn green. I guess I had not considered using StridedView objects in @tensor, even though the latter is implemented using the former.

I had the idea at some point to add an @tensor-like macro as interface over Strided.jl, to allow things that are beyond pure Einstein summation convention (permute permute gemm), i.e. applying arbitrary functions, reducing over 1 or 3 or more indices at the same time, ...

However, as I don't need this urgently, this got postponed, and with LoopVectorization and Tullio, it seems this has all been covered. How does the multithreading strategy that you have in Tullio compare to the one of Strided.jl?

It would be great to have one go-to killer package for all those things; I am always open to collaborations if you're interested.

@mcabbott
Copy link
Contributor Author

mcabbott commented Sep 4, 2020

Thanks, this was a bit of a hack I had somewhere, which seemed to work before.

It looks like _threaded_blas_mul! takes exactly the same divide-the-biggest-dim recursive strategy that Tullio.threader does. (Except that I added a case for nthreads divisible by 3, as I have a 6-core cpu.)

Tullio also does some kind of recursive tiling, and when last I checked @tensor was quicker at doing permutedims, but I didn't get around to understanding why. Is this something it might be easy to borrow or share?

@mcabbott
Copy link
Contributor Author

mcabbott commented Sep 4, 2020

I had not seen _mul! before somehow, but that's very quick. And all from a generic mapreducedim, perhaps I should see if @tullio can use exactly this for many operations. Seems like this ought to be quicker than permute-gemm-permute for many things @tensor does, too?

julia> using LinearAlgebra, Strided, Tullio, LoopVectorization, BenchmarkTools

julia> BLAS.vendor()
:mkl

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j];

julia> foreach((2, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10_000)) do N
A, B = rand(N, N + 1), rand(N + 1, N + 2)
@show N
@btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2))
@btime mul!(C, $A, $B) setup=(C=zeros($N, $N+2))
@btime @strided(Strided._mul!(C, $A, $B, 1.0, 0.0)) setup=(C=zeros($N, $N+2));
end
N = 2
43.715 ns (0 allocations: 0 bytes)
197.185 ns (0 allocations: 0 bytes)
345.700 ns (7 allocations: 256 bytes)
N = 10
113.348 ns (0 allocations: 0 bytes)
307.412 ns (0 allocations: 0 bytes)
439.919 ns (7 allocations: 256 bytes)
N = 50
5.394 μs (0 allocations: 0 bytes)
4.940 μs (0 allocations: 0 bytes)
5.162 μs (7 allocations: 256 bytes)
N = 100
23.970 μs (47 allocations: 2.98 KiB)
25.233 μs (0 allocations: 0 bytes)
25.425 μs (7 allocations: 256 bytes)
N = 200
175.963 μs (49 allocations: 3.05 KiB)
185.961 μs (0 allocations: 0 bytes)
186.014 μs (7 allocations: 256 bytes)
N = 500
3.956 ms (50 allocations: 3.08 KiB)
2.999 ms (0 allocations: 0 bytes)
2.916 ms (7 allocations: 256 bytes)
N = 1000
38.370 ms (51 allocations: 3.11 KiB)
28.367 ms (0 allocations: 0 bytes)
23.967 ms (7 allocations: 256 bytes)
N = 2000
292.129 ms (50 allocations: 3.08 KiB)
200.771 ms (0 allocations: 0 bytes)
195.420 ms (7 allocations: 256 bytes)
N = 5000
4.597 s (52 allocations: 3.14 KiB)
3.798 s (0 allocations: 0 bytes)
3.942 s (7 allocations: 256 bytes)
N = 10000
52.223 s (52 allocations: 3.14 KiB)
41.940 s (0 allocations: 0 bytes)
30.871 s (7 allocations: 256 bytes)

@Jutho
Copy link
Owner

Jutho commented Sep 4, 2020

Note that _mul! is just calling BLAS gemm! on standard arrays. The only thing it does is roll out its own multithreading implementation if JULIA_NUM_THREADS>1. It will actually also work with number of threads that are not numbers of 4, but probably not optimally. This implementation was an attempt to replace the multithreading of BLAS with one provided by Julia, so that you could just run BLAS with BLAS.set_num_threads(1) and have Julia take care of the threads (the problem with this is that, while it performs quite well for mul!, you also loose all multithreading in LAPACK functions which you should then replace with Julia versions).

I am not sure how you are timing this, but you should not really compare this within one Julia session, unless you change BLAS.set_num_threads() in between. If you have JULIA_NUM_THREADS > 1 and also a number of blas threads bigger than 1 (e.g. the native/default setting), you are oversubscribing your number of cores, and I don't think there should be anything to gain from that.

So the threading in mul! is very distinct from the threading in the generic mapreducedim. That one will not perform well for matrix multiplication, not really due to the threading but just due to the kernel. The threading mechanism there uses a cost function which I generalised from the HPTT library: https://github.com/springer13/hptt

@Jutho Jutho merged commit d473d69 into Jutho:master Sep 4, 2020
@mcabbott
Copy link
Contributor Author

mcabbott commented Sep 4, 2020

Oh right, I misread that, and confused __mul! with _mul!. Too good to be true should have been a clue... this all makes more sense now. (I guess my timings were over-subscribing threads, but the penalty for using twice as many as I should seems not very large.)

Thanks for the link, will see what I can learn from HPTT.

@mcabbott mcabbott deleted the patch-1 branch September 4, 2020 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants