Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting Tropical numbers #201

Closed
GiggleLiu opened this issue Feb 17, 2021 · 16 comments
Closed

Supporting Tropical numbers #201

GiggleLiu opened this issue Feb 17, 2021 · 16 comments

Comments

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 17, 2021

Nice project! I tried the BLAS package Octavian built on top of LoopVectorization, the performance is amazing.

Now I wish to make a BLAS package for Tropcal algebra (a simple semi-ring algebra by replacing + with max, * with +) and more complicated algebra (see the CountingTropical bellow). Wondering how difficult it may be.

Now if I try using the Tropical type, the error information shows

julia> using TropicalNumbers, Octavian
julia> A, B = randn(1000, 1000), randn(1000, 1000);
julia> A2, B2 = Tropical.(A), Tropical.(B);
julia> Octavian.matmul(A2, B2);
ERROR: MethodError: no method matching stridedpointer(::Matrix{TropicalF64})
Closest candidates are:
  stridedpointer(::Ptr{T}, ::StaticInt{C}, ::StaticInt{B}, ::Val{R}, ::X, ::O) where {T<:Union{Bool, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, VectorizationBase.Bit}, C, B, R, N, X<:Tuple{Vararg{Any, N}}, O<:Tuple{Vararg{Any, N}}} at /home/leo/.julia/packages/VectorizationBase/bYkVf/src/strided_pointers/stridedpointers.jl:73
  stridedpointer(::BitVector) at /home/leo/.julia/packages/VectorizationBase/bYkVf/src/strided_pointers/stridedpointers.jl:206
  stridedpointer(::BitArray{N}) where N at /home/leo/.julia/packages/VectorizationBase/bYkVf/src/strided_pointers/stridedpointers.jl:207
  ...
Stacktrace:
 [1] zstridedpointer
   @ ~/.julia/packages/VectorizationBase/bYkVf/src/strided_pointers/stridedpointers.jl:95 [inlined]
 [2] _matmul!
   @ ~/.julia/packages/Octavian/KUsJ4/src/matmul.jl:232 [inlined]
 [3] matmul(A::Matrix{TropicalF64}, B::Matrix{TropicalF64})
   @ Octavian ~/.julia/packages/Octavian/KUsJ4/src/matmul.jl:191
 [4] top-level scope
   @ ./timing.jl:206 [inlined]
 [5] top-level scope
   @ ./REPL[19]:0

References

  • The functions define tropical numbers (L38 and L48)

https://github.com/TensorBFS/TropicalNumbers.jl/blob/547edebf9abd3247299975c712e59e08eff2df75/src/base.jl#L38

@chriselrod

@chriselrod
Copy link
Member

The easiest way to get this to work would be to swap regular numbers with tropical numbers, and leave LoopVectorization oblivious. This should be fine, because operations like + and * are similar in cost to max and +.

LoopVectorization works internally with SIMD types. So you'll have to make sure all the basic operations work correctly.

julia> using TropicalNumbers, VectorizationBase

julia> vx = Vec(ntuple(_ -> randn(), VectorizationBase.pick_vector_width(Float64))...)
Vec{8, Float64}<-0.4542360436646172, 0.808469891292917, 0.3402188425112263, -0.9854732349028669, -0.5457866173882676, 0.5384574165868918, -0.4639713295159042, 0.7095047672139283>

julia> vy = Vec(ntuple(_ -> randn(), VectorizationBase.pick_vector_width(Float64))...)
Vec{8, Float64}<-0.9020073240658397, 0.2134128364652452, -0.2290577287345638, -0.7468730791131707, 0.5319348585990041, -0.32909273153665797, -0.7241752504699737, -0.4699665265194333>

julia> Tropical(vx) + Tropical(vy)
Tropical(Vec{8, Float64}<-0.4542360436646172, 0.808469891292917, 0.3402188425112263, -0.7468730791131707, 0.5319348585990041, 0.5384574165868918, -0.4639713295159042, 0.7095047672139283>)

julia> Tropical(vx) * Tropical(vy)
Tropical(Vec{8, Float64}<-1.3562433677304568, 1.0218827277581621, 0.11116111377666252, -1.7323463140160376, -0.01385175878926348, 0.20936468505023387, -1.188146579985878, 0.23953824069449497>)

Thankfully, it looks like this is already handled correctly.

So the trickier part will be getting loads/stores from memory to work. I'd recommend defining a TropicalStridedPointer type, e.g. something like:

struct TropicalStridedPointer{T,N,C,B,R,X,O,SP <: VectorizationBase.AbstractStridedPointer{T,N,C,B,R,X,O} }  <: VectorizationBase.AbstractStridedPointer{T,N,C,B,R,X,O}
    sp::SP
end
function VectorizationBase.stridedpointer(A::AbstractArray{Tropical{T}}) where {T}
    TropicalStridedPointer(stridedpointer(reinterpret(T, A)))
end
@inline function VectorizationBase.stridedpointer(
    ptr::Ptr{Tropical{T}}, ::StaticInt{C}, ::StaticInt{B}, ::Val{R}, strd::X, offsets::O
) where {T<:NativeTypesExceptBit,C,B,R,N,X<:Tuple{Vararg{Integer,N}},O<:Tuple{Vararg{Integer,N}}}
    TropicalStridedPointer(StridedPointer{T,N,C,B,R,X,O}(Base.unsafe_convert(Ptr{T}, ptr), strd, offsets))
end

and then working on overloads for vload and vstore! on this type, so that it wraps/strips the Tropical.
Here are the unit tests, if you want something to run to test.
You should be able to replace A with

A = reshape(collect(Tropical(Float64(0)):Tropical(Float64(prod(dims)-1))), dims);

and still have the following load/store tests pass, and of course make sure the loaded vectors are Tropical!

Note that it was stridedpointer that errored when trying to use Octavian.

LoopVectorization will also wrap these these tropical vectors with various types, like OffsetPreCalc and will start using GroupedStridedPointer soon. These should hopefully work as is.

Once all this works, you'll also need to define this method:

LoopVectorization.check_type(::Type{Tropical{T}}) where {T} = LoopVectorization.check_type(T)

If this returns false, LoopVectorization will silently use a slow fallback just applying (@inbounds @fastmath) to the loops.

That's all that immediately comes to mind. From there, you can try simple problems first, like a dot product:

function mydot(a,b)
    s = zero(promote_type(eltype(a),eltype(b)))
    @avx for i in eachindex(a,b)
        s += a[i]*b[i]
    end
    s
end

Once this works, you can try matrix multiply examples, e.g simple loops, and then finally Octavian.

Note that Octavian currently acts really badly if it gets an error in multithreaded code. You'll probably have to restart Julia if you do, and might even need to use a task manager to kill all the threads (which could keep going after you exit Julia).
I've made sure it doesn't/can't throw errors with the standard data types, but while trying to get something new to work, mistakes happen. So I'd try matmul_serial first, and once that works switch to matmul.

I hope that helps, please let me know if you have any questions/run into any issues.

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 18, 2021

Hi, @chriselrod Thank you very much for your help. Here is a gist following your instruction. It is easier than I expected and the results of mydot looks correct. Both matmul and matmul_serial do not break my Julia PREL and gives results consistent to each other.

https://gist.github.com/GiggleLiu/a6d2bed21731fa344f4d7c1660f35952

However, the results are regular matmul rather than tropical matmul. It seems the program does not recognize the element type. Do you know how to fix?

(EDIT: I notice that I forgot to overload the vload and vstore, come to you later.)

@chriselrod
Copy link
Member

chriselrod commented Feb 18, 2021

I would add an

@test LoopVectorization.check_args(a, b)

to make sure it returns true.

A couple examples to try:

vload(stridedpointer(a), (MM{4}(1),1)) # load a[1:4,1]
vload(stridedpointer(a), (1,MM{4}(1))) # load a[1,1:4]
vload(stridedpointer(a), Unroll{2,1,3,1,4}((1,1))) # load a[1:4,1:3]

If a is a Matrix{Float64}:

julia> a = rand(4,4)
4×4 Matrix{Float64}:
 0.199901   0.737377  0.535227  0.423361
 0.0957741  0.398867  0.574244  0.898327
 0.139651   0.878405  0.316808  0.31598
 0.283851   0.267324  0.975793  0.326679

julia> vload(stridedpointer(a), (MM{4}(1),1)) # load a[1:4,1]
Vec{4, Float64}<0.199900943073283, 0.0957741305262807, 0.13965061919067523, 0.2838507750142387>

julia> vload(stridedpointer(a), (1,MM{4}(1))) # load a[1,1:4]
Vec{4, Float64}<0.199900943073283, 0.7373773483950572, 0.5352265912879297, 0.42336121283806105>

julia> vload(stridedpointer(a), Unroll{2,1,3,1,4}((1,1))) # load a[1:4,1:3]
3 x Vec{4, Float64}
Vec{4, Float64}<0.199900943073283, 0.0957741305262807, 0.13965061919067523, 0.2838507750142387>
Vec{4, Float64}<0.7373773483950572, 0.39886661759802355, 0.8784048837196166, 0.2673242656533237>
Vec{4, Float64}<0.5352265912879297, 0.5742437648414183, 0.31680796319268056, 0.975792905256238>

With your Matrix{Tropical{Float64}}, make sure these loaded values are all Tropical (and then also that storing them works).

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 24, 2021

@chriselrod Thanks for your instructions. I feel now I am very close to implementing it correctly. The most exciting thing is: the tests now passes! Here is the gist:

https://gist.github.com/GiggleLiu/a6d2bed21731fa344f4d7c1660f35952

But,

it requires several changes:

  1. In fileVectorizationBase/src/strided_pointers/stridedpointers.jl, the pointer type restriction should be removed.
-abstract type AbstractStridedPointer{T<:NativeTypes,N,C,B,R,X<:Tuple{Vararg{Any,N}},O<:Tuple{Vararg{Any,N}}} end
+abstract type AbstractStridedPointer{T,N,C,B,R,X<:Tuple{Vararg{Any,N}},O<:Tuple{Vararg{Any,N}}} end
  1. you will see a TODO in the script, this is a very ungly patch that I used to make it work
# TODO: FIX!!!!!!
function Base.promote(a::Int, b::Tropical{Vec{4,Float64}})
    elem = a == 0 ? -Inf : 0.0
    @show a
    Tropical(Vec(elem, elem, elem, elem)), b
end

function Base.promote(a::Int, b::Tropical{Vec{4,Float64}}, c::Tropical{Vec{4,Float64}})
    elem = a == 0 ? -Inf : 0.0
    @show a
    Tropical(Vec(elem, elem, elem, elem)), b, c
end

There is no promotion relation between Int and Tropical numbers. I guess we should use the more standard zero(T) and one(T) to generate zero and one elements? However, I can not locate the position of this promotion. Hope the following error message would help.

The error message is

Error During Test at /home/leo/jcode/lab/tropicalblas.jl:88
  Test threw exception
  Expression: Octavian.matmul_serial(a, b)  a * b
  promotion of types Int64 and Tropical{Vec{4, Float64}} failed to change any arguments
  Stacktrace:
    [1] error(::String, ::String, ::String)
      @ Base ./error.jl:42
    [2] sametype_error(input::Tuple{Int64, Tropical{Vec{4, Float64}}})
      @ Base ./promotion.jl:316
    [3] not_sametype(x::Tuple{Int64, Tropical{Vec{4, Float64}}}, y::Tuple{Int64, Tropical{Vec{4, Float64}}})
      @ Base ./promotion.jl:310
    [4] promote
      @ ./promotion.jl:293 [inlined]
    [5] mul_fast(::Int64, ::Tropical{Vec{4, Float64}})
      @ Base.FastMath ./fastmath.jl:267
    [6] macro expansion
      @ ~/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:526 [inlined]
    [7] _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###zero###9###"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x0000000000020301, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000004, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x06), :LoopVectorization, :mul_fast, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000708, LoopVectorization.compute, 0x00, 0x07), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000060509, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000003, 0x0000000000000000, 0x000000000000000a, LoopVectorization.memstore, 0x03, 0x08))}, #unused#::Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B, Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000301, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C, Symbol("##vptr##_C")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000))}, #unused#::Val{(0, (), (6, 7), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, #unused#::Val{(:n, :m, :k)}, _vargs::Tuple{Tuple{LoopVectorization.CloseOpen{StaticInt{0}, Int64}, LoopVectorization.CloseOpen{StaticInt{0}, Int64}, LoopVectorization.CloseOpen{StaticInt{0}, Int64}}, Tuple{VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tuple{StaticInt{8}, Int64}, Tuple{StaticInt{0}, StaticInt{0}}}, StaticInt{1}, StaticInt{0}}})
      @ LoopVectorization ~/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:526
    [8] loopmul!
      @ ~/.julia/dev/Octavian/src/macrokernels.jl:83 [inlined]
    [9] _matmul_serial!
      @ ~/.julia/dev/Octavian/src/matmul.jl:162 [inlined]
   [10] matmul_serial(A::Matrix{TropicalF64}, B::Matrix{TropicalF64})
      @ Octavian ~/.julia/dev/Octavian/src/matmul.jl:103
   [11] top-level scope
      @ ~/jcode/lab/tropicalblas.jl:88
   [12] eval
      @ ./boot.jl:360 [inlined]
   [13] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
      @ Base ./loading.jl:1090
   [14] invokelatest(::Any, ::Any, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ Base ./essentials.jl:707
   [15] invokelatest(::Any, ::Any, ::Vararg{Any, N} where N)
      @ Base ./essentials.jl:706
   [16] inlineeval(m::Module, code::String, code_line::Int64, code_column::Int64, file::String; softscope::Bool)
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:192
   [17] (::VSCodeServer.var"#63#67"{Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:150
   [18] withpath(f::VSCodeServer.var"#63#67"{Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams}, path::String)
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/repl.jl:135
   [19] (::VSCodeServer.var"#62#66"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:148
   [20] hideprompt(f::VSCodeServer.var"#62#66"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/repl.jl:36
   [21] (::VSCodeServer.var"#61#65"{Bool, Bool, Module, String, Int64, Int64, String, VSCodeServer.ReplRunCodeRequestParams})()
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:116
   [22] with_logstate(f::Function, logstate::Any)
      @ Base.CoreLogging ./logging.jl:491
   [23] with_logger
      @ ./logging.jl:603 [inlined]
   [24] (::VSCodeServer.var"#60#64"{VSCodeServer.ReplRunCodeRequestParams})()
      @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:182
   [25] #invokelatest#2
      @ ./essentials.jl:707 [inlined]
   [26] invokelatest(::Any)
      @ Base ./essentials.jl:706
   [27] macro expansion
      @ ~/.vscode/extensions/julialang.language-julia-1.1.27/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
   [28] (::VSCodeServer.var"#58#59")()
      @ VSCodeServer ./task.jl:406
ERROR: LoadError: There was an error during testing
in expression starting at /home/leo/jcode/lab/tropicalblas.jl:88

Thanks again!

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 24, 2021

I did some benchmark, find some more issues.

The output sometimes contains some NaNs, it only appears in large matrices. When it does not contain Nan, the result is correct. It does no behave consistently when I run the function multiple times, sometime error and sometime correct. This is so wield, what can be the possible causes?

julia> Octavian.matmul_serial(a, b)
20×20 Matrix{TropicalF64}:
                Tropical(NaN)   Tropical(4.024269734839327)    Tropical(4.26327341398722)     Tropical(4.204939545709253)  Tropical(NaN)   Tropical(4.776889810790789)
                Tropical(NaN)   Tropical(2.194497613870535)   Tropical(2.058543360062661)     Tropical(3.5946191821421216)  Tropical(NaN)  Tropical(3.0040889036978973)
 Tropical(2.9011941950744964)  Tropical(2.1627096262173797)  Tropical(2.8782146191689657)     Tropical(3.8420567150697646)  Tropical(NaN)   Tropical(2.152284655932344)
                Tropical(NaN)  Tropical(1.7681295634003424)   Tropical(2.921440672411346)     Tropical(2.2011623659906308)  Tropical(NaN)  Tropical(1.9797996146204884)
                Tropical(NaN)    Tropical(2.38970559790015)   Tropical(3.544039189063668)      Tropical(2.267337659239235)  Tropical(NaN)   Tropical(2.818109225827046)
 Tropical(2.9752555409299384)   Tropical(4.297949125446652)  Tropical(3.6685467944558394)    Tropical(4.1397369630940934)  Tropical(NaN)   Tropical(3.352553282694595)
  Tropical(2.397845997221319)  Tropical(1.8569225628500599)  Tropical(1.2275202318592475)     Tropical(2.8172671791289803)  Tropical(NaN)  Tropical(1.4759796716922038)
 Tropical(2.1965028088271117)   Tropical(2.140797817614342)  Tropical(3.1723901113113544)      Tropical(2.434120168918318)  Tropical(NaN)   Tropical(2.230749053520497)
                                                                                                                                         
                Tropical(NaN)   Tropical(2.730009619936266)   Tropical(2.553235863336632)      Tropical(3.876715789552953)  Tropical(NaN)   Tropical(3.153954560543511)
                Tropical(NaN)  Tropical(2.6111525663946153)   Tropical(2.448263484633223)     Tropical(2.2721453419157123)  Tropical(NaN)  Tropical(2.8627405231265053)
                Tropical(NaN)   Tropical(2.616236581455606)   Tropical(2.887718641691456)     Tropical(2.833083099894104)  Tropical(NaN)   Tropical(2.308935514119695)
 Tropical(2.3711691743062926)    Tropical(1.39657464107459)  Tropical(2.0919814326461355)                    Tropical(NaN)  Tropical(NaN)   Tropical(1.713597248742208)
 Tropical(1.1941119905234814)  Tropical(2.1598929755724314)  Tropical(2.8373250789251263)                    Tropical(NaN)  Tropical(NaN)  Tropical(1.8250873513607369)
 Tropical(1.8801722402476346)  Tropical(2.0994643359840515)  Tropical(1.6684035006197018)                    Tropical(NaN)  Tropical(NaN)                 Tropical(NaN)
  Tropical(1.289189787032358)  Tropical(2.9596107706634043)  Tropical(2.1714404638502094)                    Tropical(NaN)  Tropical(NaN)                 Tropical(NaN)

BTW: i see a speed up like ~15 for a 1000 x 1000 matrix! Even with the above patch.

julia> a = Tropical.(randn(1000, 1000));

julia> @time Octavian.matmul_serial(a, a);
  0.358573 seconds (2 allocations: 7.629 MiB)

julia> @time a * a;
  5.164423 seconds (8 allocations: 7.630 MiB, 0.55% gc time)

This is a speed that I never imagined, LoopVectorization is great package!

@chriselrod
Copy link
Member

chriselrod commented Feb 24, 2021

In fileVectorizationBase/src/strided_pointers/stridedpointers.jl, the pointer type restriction should be removed.

Feel free to make a PR.

Taking a real quick look, instead of:

using Base.Cartesian: @nexprs
@generated function VectorizationBase.fma(x::Tropical{Vec{N,T}}, y::Tropical{Vec{N,T}}, z::Tropical{Vec{N,T}}) where {N,T}
    Expr(:call, :Tropical, Expr(:call, :Vec, [:(max(content(z).data[$i].value, content(x).data[$i].value+content(y).data[$i].value)) for i=1:N]...))
end

do

@inline VectorizationBase.fma(x::Tropical{V}, y::Tropical{V}, z::Tropical{V}) where {V<:AbstractSIMD} = Tropical(max(z, x + y))

Also:

function VectorizationBase._vzero(in1::StaticInt{W}, ::Type{T}, in3::StaticInt{RS}) where {W,T<:Tropical{FT},RS} where FT
    Tropical(VectorizationBase._vzero(in1, FT, in3))
end

This should probably be

function VectorizationBase._vzero(in1::StaticInt{W}, ::Type{T}, ::StaticInt{RS}) where {W,T<:Tropical{FT},RS} where FT
    Tropical(VectorizationBase._vbroadcast(StaticInt{W}(), FT(-Inf), StaticInt{RS}()))
end

I'll hopefully look into the zeros and NaNs fairly soon. There's a chance it was because of that fma definition, since I've soon code like that cause problems before, but it's more often something wrong with memory loads/stores (but I would've hoped a problem like that would've gotten caught by Octavian's tests).
If you want to check if it's related to reading junk data outside of the array bounds, a trick I often use is to

  1. Make large arrays, filled with something obviously wrong. Maybe fill(1e6, 1000, 1000), or reshape(collect(range(1000.0, step=100,length=1000^2)), (1000,1000)).
  2. Define the array's you're using as views of these arrays.
  3. Fill the views with sensible values, e.g.
Abig = reshape(collect(range(1e4, step=1e3,length=100^2)), (100,100)); # 100 x 100 parent
A = view(A, 41:60, 41:60); # A is 20 x 20
A .= rand.();

Bbig = reshape(collect(range(1e16, step=1e15,length=100^2)), (100,100)); # 100 x 100 parent, much bigger scale
B = view(B, 41:60, 41:60); # A is 20 x 20
B .= rand.();

Cbig = fill(-123456.789, 100, 100);
C = view(C, 41:60, 41:60);

Octavian.matmul_serial!(C, A, B)
@test C  A * B # make sure answers are correct
C .= -123456.789;
@test all(==(-123456.789), Cbig) #make sure everything outside of `C` is still `-123456.789`

Now A and B have values between 0 and 1, and if I accidentally read out of bounds, I'll get something extreme (and the numbers will look different depending on A vs B). I can also make sure no writes into C were out of bounds, by confirming that Cbig still has whatever unique looking numbers I filled it with outside of the 41:60,41:60 range of the view.

Also, as a heads up, LoopVectorization will be moving to using VecUnrolls in many places soon.
Just defining things in terms of Vecs should work, in that definitions of functions like fma(::VecUnroll,::VecUnroll,::VecUnroll) are defined in terms of calling repeatedly on Vecs, but just a heads up.
It's more likely to impact indexing, since it'll be using Unrolls.

If you want to document what it takes and add some integration tests to show what it takes to add support to a custom type, that could be a great contribution!
With the disclaimer that those types would have to map 1-1 to native numbers for it to work well, because LoopVectorization assumes that is the case in a way critical to it's ability to optimize code. So this works for Tropical numbers, but it wouldn't (for example) Complex or ForwardDiff.Dual numbers, quarternions, or RGB colors.
I'll probably get around to making things like these work eventually using the AbstractInterpreter interface, but the "todo" list before I get there is still quite long.

Finally, I think it should be a lot faster than 0.35 seconds for a 1000x1000 matrix.
What speed do you get for regular Float64 arrays?
It varies by hardware, but Tropical{Float64} should either be about the same fast, or 2x slower than Float64. But not more than that. So I'd expect it to take around 0.04 seconds on this computer, for example:

julia> using Octavian

julia> A = rand(1000,1000);

julia> @time Octavian.matmul_serial(A,A); # time to first matmul is bad
 21.730979 seconds (22.93 M allocations: 1.511 GiB, 3.31% gc time)

julia> @time Octavian.matmul_serial(A,A);
  0.020565 seconds (2 allocations: 7.629 MiB)

julia> @time Octavian.matmul_serial(A,A);
  0.020673 seconds (2 allocations: 7.629 MiB)

julia> @time Octavian.matmul_serial(A,A);
  0.020467 seconds (2 allocations: 7.629 MiB)

If performance for Tropical{Float64} is ever much worse than 2x Float64 for something like this, that is a bug.

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 24, 2021

A quick update: The resulting matrix does not contain NaN anymore after I use the new definition of fma as:

@inline function VectorizationBase.fma(x::Tropical{V}, y::Tropical{V}, z::Tropical{V}) where {V<:VectorizationBase.AbstractSIMD}
    Tropical(max(content(z), content(x) + content(y)))
end

But it is still interesting to kown the cause in order to make it work in more general cases.

It is also faster with the above fixes.

julia> A = rand(1000,1000);

julia> @btime A * A;  # mkl
  34.047 ms (2 allocations: 7.63 MiB)

julia> @btime A * A;  # openblas
  155.493 ms (2 allocations: 7.63 MiB)

julia> @btime Octavian.matmul_serial(A,A);
  39.028 ms (2 allocations: 7.63 MiB)

julia> a = Tropical.(randn(1000, 1000));

julia> @btime Octavian.matmul_serial(a, a);
  138.467 ms (2 allocations: 7.63 MiB)

I think it should be faster, because the naive (reordered) for loop is only 3.3x slower in the Tropical number case, but is 4.8x slower in the Float64 case.
I am using Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz, which is a not so bad desktop CPU.
Why your CPU runs so much faster?

julia> function naivemm!(o::Matrix, a::Matrix, b::Matrix)
    @assert size(a, 2) == size(b, 1) && size(o) == (size(a, 1), size(b, 2))
    for j=1:size(b, 2)
        for k=1:size(a, 2)
            for i=1:size(a, 1)
                @inbounds o[i,j] += a[i,k] * b[k,j]
            end
        end
    end
    return o
end

julia> @btime naivemm!(zero(A), A, A);
  188.712 ms (2 allocations: 7.63 MiB)

julia> @btime naivemm!(zero(a), a, a);
  461.117 ms (2 allocations: 7.63 MiB)

I did a profile

julia> Profile.print(mincount=10; format=:flat)
 Count  Overhead File                                                            Line Function
 =====  ======== ====                                                            ==== ========
   469         0 @Base/Base.jl                                                     39 eval
    15        15 @Base/bool.jl                                                     37 |
   469         0 @Base/boot.jl                                                    360 eval
   469         0 @Base/essentials.jl                                              707 #invokelatest#2
   469         0 @Base/essentials.jl                                              706 invokelatest(::Any)
   469         0 @Base/logging.jl                                                 603 with_logger
   469         0 @Base/logging.jl                                                 491 with_logstate(f::Function, logstate::Any)
    15         0 @Base/operators.jl                                               328 <=
   469         0 @Base/task.jl                                                    406 (::VSCodeServer.var"#58#59")()
   465         0 @LoopVectorization/src/reconstruct_loopset.jl                    526 _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###z...
   469         0 @LoopVectorization/src/reconstruct_loopset.jl                    526 macro expansion
   458         0 @Octavian/src/macrokernels.jl                                     83 loopmul!
   458         0 @Octavian/src/macrokernels.jl                                    248 packaloopmul!
   469         0 @Octavian/src/matmul.jl                                          165 _matmul_serial!
   469         0 @Octavian/src/matmul.jl                                          120 matmul_serial!(C::Matrix{TropicalF64}, A::Matrix{TropicalF64}, B::Matrix{TropicalF64})
   469         0 @Octavian/src/matmul.jl                                          133 matmul_serial!
    57         0 @Octavian/src/matmul.jl                                           58 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tup...
   408         0 @Octavian/src/matmul.jl                                           60 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{TropicalF64, 2, 1, 0, (1, 2), Tup...
   469         0 @Octavian/src/matmul.jl                                          178 matmul_st_pack_dispatcher!
   123         0 @VectorizationBase/src/base_defs.jl                               82 +
   209         0 @VectorizationBase/src/base_defs.jl                               82 max
   123       123 @VectorizationBase/src/llvm_intrin/binary_ops.jl                  62 macro expansion
   123         0 @VectorizationBase/src/llvm_intrin/binary_ops.jl                  62 vadd
   330         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl               269 _vfmadd_fast
   209       209 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl               192 macro expansion
   330         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl               252 vfma_fast
   330         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl               271 vfmadd_fast
   209         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl               192 vmax
    10        10 @VectorizationBase/src/llvm_intrin/memory_addr.jl                530 macro expansion
    11         0 @VectorizationBase/src/llvm_intrin/memory_addr.jl                530 vload
   102         0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl                  33 _vbroadcast
   102       102 @VectorizationBase/src/llvm_intrin/vbroadcast.jl                  65 macro expansion
   102         0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl                  69 vbroadcast
    11         0 @VectorizationBase/src/strided_pointers/stridedpointers.jl       139 vload
   469         0 @VSCodeServer/src/eval.jl                                         34 macro expansion
   469         0 @VSCodeServer/src/repl.jl                                         93 (::VSCodeServer.var"#44#46"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
   469         0 @VSCodeServer/src/repl.jl                                         92 (::VSCodeServer.var"#45#47"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
   469         0 @VSCodeServer/src/repl.jl                                        117 repleval(m::Module, code::Expr, #unused#::String)
   330         0 /home/leo/jcode/lab/tropicalblas.jl                               41 fma
   102         0 /home/leo/jcode/lab/tropicalblas.jl                               31 vbroadcast
    11         0 /home/leo/jcode/lab/tropicalblas.jl                               27 vload
Total snapshots: 938

Comparing with Float64

julia> Profile.print(mincount=10; format=:flat)
 Count  Overhead File                                                        Line Function
 =====  ======== ====                                                        ==== ========
   155         0 @Base/Base.jl                                                 39 eval
    14        14 @Base/bool.jl                                                 37 |
   155         0 @Base/boot.jl                                                360 eval
   155         0 @Base/essentials.jl                                          707 #invokelatest#2
   155         0 @Base/essentials.jl                                          706 invokelatest(::Any)
   155         0 @Base/logging.jl                                             603 with_logger
   155         0 @Base/logging.jl                                             491 with_logstate(f::Function, logstate::Any)
    14         0 @Base/operators.jl                                           328 <=
   155         0 @Base/task.jl                                                406 (::VSCodeServer.var"#58#59")()
   154         0 @LoopVectorization/src/reconstruct_loopset.jl                526 _avx_!(#unused#::Val{(0, 0, 0, 4, 32, 15, 64)}, #unused#::Val{(:numericconstant, Symbol("###z...
   155         0 @LoopVectorization/src/reconstruct_loopset.jl                526 macro expansion
   153         0 @Octavian/src/macrokernels.jl                                 83 loopmul!
   153         0 @Octavian/src/macrokernels.jl                                248 packaloopmul!
   155         0 @Octavian/src/matmul.jl                                      165 _matmul_serial!
   155         0 @Octavian/src/matmul.jl                                      120 matmul_serial!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
   155         0 @Octavian/src/matmul.jl                                      133 matmul_serial!
    36         0 @Octavian/src/matmul.jl                                       58 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{S...
   118         0 @Octavian/src/matmul.jl                                       60 matmul_st_pack_A_and_B!(C::VectorizationBase.StridedPointer{Float64, 2, 1, 0, (1, 2), Tuple{S...
   155         0 @Octavian/src/matmul.jl                                      178 matmul_st_pack_dispatcher!
    10         0 @VectorizationBase/src/base_defs.jl                           82 add_fast
    10         0 @VectorizationBase/src/base_defs.jl                          236 vfma_fast
    10        10 @VectorizationBase/src/llvm_intrin/binary_ops.jl              63 macro expansion
    10         0 @VectorizationBase/src/llvm_intrin/binary_ops.jl              63 vadd_fast
    82         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl           269 _vfmadd_fast
    72        72 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl           237 macro expansion
    72         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl           237 vfma_fast
    82         0 @VectorizationBase/src/llvm_intrin/intrin_funcs.jl           271 vfmadd_fast
    11        11 @VectorizationBase/src/llvm_intrin/memory_addr.jl            530 macro expansion
    11         0 @VectorizationBase/src/llvm_intrin/memory_addr.jl            530 vload
    28         0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl              33 _vbroadcast
    28        28 @VectorizationBase/src/llvm_intrin/vbroadcast.jl              65 macro expansion
    28         0 @VectorizationBase/src/llvm_intrin/vbroadcast.jl              69 vbroadcast
    11         0 @VectorizationBase/src/strided_pointers/stridedpointers.jl   139 vload
   155         0 @VSCodeServer/src/eval.jl                                     34 macro expansion
   155         0 @VSCodeServer/src/repl.jl                                     93 (::VSCodeServer.var"#44#46"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
   155         0 @VSCodeServer/src/repl.jl                                     92 (::VSCodeServer.var"#45#47"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
   155         0 @VSCodeServer/src/repl.jl                                    117 repleval(m::Module, code::Expr, #unused#::String)
Total snapshots: 310

It looks like the Tropical fma is much slower than the Float64 version.

EDIT: the latest benchmark of tropical gemm is halved.
TensorBFS/TropicalGEMM.jl#1 (review)

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 24, 2021

If you want to document what it takes and add some integration tests to show what it takes to add support to a custom type, that could be a great contribution!
With the disclaimer that those types would have to map 1-1 to native numbers for it to work well, because LoopVectorization assumes that is the case in a way critical to it's ability to optimize code. So this works for Tropical numbers, but it wouldn't (for example) Complex or ForwardDiff.Dual numbers, quarternions, or RGB colors.
I'll probably get around to making things like these work eventually using the AbstractInterpreter interface, but the "todo" list before I get there is still quite long.

Will do, also I am planning to write a blog post about it. Looking forward to the AbstractInterpreter interface, I have some more type eagering to get a speed up!

@GiggleLiu
Copy link
Contributor Author

Hi, @chriselrod To make the Tropical blas more accessible to public, I created a repo here and have invited you as a collaborator.

https://github.com/TensorBFS/TropicalGEMM.jl

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Mar 18, 2021

Wield, in the profile, the vbroadcast in TropicalF64 takes much more time than that in Float64. It looks like vbroadcastsd does not call into AVX?

julia> @code_native vbroadcast(Val(4), Tropical(2.0))
	.text
; ┌ @ gemm.jl:28 within `vbroadcast'
	movq	%rdi, %rax
; │ @ gemm.jl:29 within `vbroadcast' @ vbroadcast.jl:77
; │┌ @ vbroadcast.jl:41 within `_vbroadcast'
; ││┌ @ vbroadcast.jl:73 within `macro expansion'
	vbroadcastsd	(%rsi), %ymm0
; │└└
; │ @ gemm.jl:29 within `vbroadcast'
	vmovaps	%ymm0, (%rdi)
	vzeroupper
	retq
; └

julia> @code_native vbroadcast(Val(4), 2.0)
	.text
; ┌ @ vbroadcast.jl:77 within `vbroadcast'
	movq	%rdi, %rax
; │┌ @ vbroadcast.jl:41 within `_vbroadcast'
; ││┌ @ vbroadcast.jl:73 within `macro expansion'
	vbroadcastsd	%xmm0, %ymm0
; │└└
	vmovaps	%ymm0, (%rdi)
	vzeroupper
	retq
; └

I checked it for many times, this behavior is consistent.

@chriselrod
Copy link
Member

I'll take a look and profile soon now that LoopVectorization 0.12 is out, and Octavian has been updated to use it.

Wield, in the profile, the vbroadcast in TropicalF64 takes much more time than that in Float64.

Note that it should be exactly the same in the actual function, but will be a little different when called from the REPL in this way.
That's because a Tropical(2.0) is a struct and passed to the function call in memory, so @code_native vbroadcast(Val(4), Tropical(2.0)) will load it from memory:

vbroadcastsd	(%rsi), %ymm0

while 2.0 is passed in a register, and broadcast from there:

vbroadcastsd	%xmm0, %ymm0

So it also wouldn't really be accurate to benchmark vbroadcast in the REPL this way.

That also means when called inside a function, this line:

vmovaps	%ymm0, (%rdi)

won't be there. Because Vecs are structs, it returns the value in memory.
(Note that for this reason, when LoopVectorization does pass data through function calls, it actually unwraps them to pass them in registers.)

If called inside a function, like Octavian's gemm, vbroadcast will be inlined and therefore should produce exactly the same code: broadcast from the memory of B into a register.

It looks like vbroadcastsd does not call into AVX?

Not sure what you mean here. vbroadcastsd is an assembly instruction that basically copies a value to fill the entirety of a register.

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Mar 20, 2021

Just summarize this dicussion a bit before closing it by TropicalGEMM. The final performance on my host is close to theoretical optimal (halves the benchmark above: #201 (comment))

julia> using TropicalNumbers, Octavian, TropicalGEMM, BenchmarkTools

julia> a = Tropical.(randn(1000, 1000));

julia> @benchmark Octavian.matmul_serial!($(zero(a)), $a, $a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     64.938 ms (0.00% GC)
  median time:      65.147 ms (0.00% GC)
  mean time:        65.371 ms (0.00% GC)
  maximum time:     67.139 ms (0.00% GC)
  --------------
  samples:          77
  evals/sample:     1

The theoretical optimal is 58.1ms on my host, for more information about how @chriselrod made it, please see the discussion TensorBFS/TropicalGEMM.jl#1

(I will summarize this project a bit later in a blog post/or maybe in a documentation)

Cheer!

@chriselrod
Copy link
Member

chriselrod commented Mar 20, 2021

I am using Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz, which is a not so bad desktop CPU.
Why your CPU runs so much faster?

julia> using Octavian, TropicalGEMM

julia> a = Tropical.(randn(1000, 1000)); c = similar(a);

julia> @benchmark Octavian.matmul_serial!($c, $a, $a) # single thread
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     31.516 ms (0.00% GC)
  median time:      31.618 ms (0.00% GC)
  mean time:        31.619 ms (0.00% GC)
  maximum time:     31.972 ms (0.00% GC)
  --------------
  samples:          159
  evals/sample:     1

julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.888 ms (0.00% GC)
  median time:      1.965 ms (0.00% GC)
  mean time:        1.982 ms (0.00% GC)
  maximum time:     2.280 ms (0.00% GC)
  --------------
  samples:          2522
  evals/sample:     1

julia> versioninfo()
Julia Version 1.7.0-DEV.707
Commit d1d0646390* (2021-03-14 18:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

I have a few different CPUs I test on, those were on the i9 10980XE, which has full-speed AVX512 support.
That means peak flops (operations/nanosecond) are
clock speed (clocks/nanosecond) * arithmetic instructions/clock * operations / instruction
This computer runs AVX512 at 4.1 GHz (4.1 cycles/nanosecond) regardless of number of cores, set in the bios.
It does 2 arithmetic floating point instructions/clock.

For Float64, each of these instructions can do 16 operations: 8 multiplications and 8 additions.
For Tropical{Float64}, it can do 8 operations: either 8 additions or 8 max.

Hence Float64 is about twice as fast as Tropical{Float64}.

The primary difference between this CPU and the i5-10400, is that the i5-10400 can do:

For Float64, each of these instructions can do 8 operations: 4 multiplications and 4 additions.
For Tropical{Float64}, it can do 8 operations: either 4 additions or 4 max.

Clock speeds are probably similar, i.e. if you didn't change any settings in the bios, your CPU will boost much higher than the 2.9 GHz. Thus this 16/8 vs 8/4 operations per instruction is why there's roughly a 2x difference in per-core performance.

Also, while the new Rocket Lake CPUs have AVX512, it'll be half-rate (meaning instructions/cycle = 1, not 2) for these core arithmetic instructions, just like it is on my laptop with a Tiger Lake cpu:

julia> using Octavian, TropicalGEMM

julia> a = Tropical.(randn(1000, 1000)); c = similar(a);

julia> @benchmark Octavian.matmul_serial!($c, $a, $a) # single thread
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     61.200 ms (0.00% GC)
  median time:      62.718 ms (0.00% GC)
  mean time:        62.722 ms (0.00% GC)
  maximum time:     63.368 ms (0.00% GC)
  --------------
  samples:          80
  evals/sample:     1

julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.217 ms (0.00% GC)
  median time:      28.867 ms (0.00% GC)
  mean time:        28.826 ms (0.00% GC)
  maximum time:     57.157 ms (0.00% GC)
  --------------
  samples:          174
  evals/sample:     1

julia> versioninfo()
Julia Version 1.7.0-DEV.713
Commit ae26fc6d5f* (2021-03-16 06:50 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

This CPU does the same amount of operations/instruction as my 10980XE, but half the instructions/cycle.
So the formula would be clock speed * 1 * ops/instruction instead of ...* 2 *... like on the i5 10400 or i9 10980XE.

While I haven't tested an M1 Mac, I believe they will have 4 instructions/cycle, but with only 2 operations/instructions, meaning same speed per clock cycle as the i5 10400 or the i7-1165G7, but half the 10980XE (for these sorts of linear algebra operations).

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Mar 20, 2021

julia> @benchmark Octavian.matmul!($c, $a, $a) # multiple threads
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.888 ms (0.00% GC)
  median time:      1.965 ms (0.00% GC)
  mean time:        1.982 ms (0.00% GC)
  maximum time:     2.280 ms (0.00% GC)
  --------------
  samples:          2522
  evals/sample:     1

Well, you win over Titan V GPU

using TropicalNumbers, CUDA, BenchmarkTools

julia> a = Tropical.(CUDA.randn(1000, 1000));

julia> @benchmark CUDA.@sync LinearAlgebra.mul!($(zero(a)), $a, $a)
BenchmarkTools.Trial: 
  memory estimate:  2.19 KiB
  allocs estimate:  68
  --------------
  minimum time:     2.045 ms (0.00% GC)
  median time:      2.200 ms (0.00% GC)
  mean time:        2.385 ms (0.00% GC)
  maximum time:     6.760 ms (0.00% GC)
  --------------
  samples:          2095
  evals/sample:     1

Thank you for your very detailed explaination, I do not envy your machines at all. Actually I have two questions,

  1. In your calculation, do you use the Trubo frequency or the frequency shown in the CPU type? e.g. for my i5 CPU, the turbo frequency is 4.3GHz, but the frequency in the type is 2.9GHz.

  2. I googled around, but can not find any information about

It does 2 arithmetic floating point instructions/clock.

@chriselrod
Copy link
Member

chriselrod commented Mar 20, 2021

You could buy three 10980XEs for the price of one Titan V ;) .

Although the Titan V will probably beat a single 10980XE for large enough arrays (e.g. 10_000 x 10_000), because it has 7.45 peak theoretical TFLOPS for Float64, while my 10980XE has around 2.3 TFLOPS Float64.
The 10980's TFLOPS for Tropical{Float64} is half that, but I assume the Titan V's TFLOPS for Tropical{Float64} are also half, i.e. around 3.225, but I'm not sure.

In your calculation, do you use the Trubo frequency or the frequency shown in the CPU type? e.g. for my i5 CPU, the turbo frequency is 4.3GHz, but the frequency in the type is 2.9GHz.

Ideally, I'd use whatever it actually runs at.

julia> using LinuxPerf#master # needs the master branch

julia> foreachf(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf (generic function with 1 method)

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf(Octavian.matmul_serial!, 30, c, a, a)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.95e+09   49.9%  #  4.1 cycles per ns
┌ instructions             1.02e+10   75.0%  #  2.6 insns per cycle
│ branch-instructions      1.41e+08   75.0%  #  1.4% of instructions
└ branch-misses            2.00e+05   75.0%  #  0.1% of branch instructions
┌ task-clock               9.71e+08  100.0%  # 970.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    5.95e+08   25.0%  # 28.3% of dcache loads
│ L1-dcache-loads          2.10e+09   25.0%
└ L1-icache-load-misses    7.57e+04   25.0%
┌ dTLB-load-misses         1.15e+04   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               2.10e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

foreachf lets me run a function many times. Above, I ran it 30 times to get enough samples and let the CPU ramp up.

Note that it says 4.1 cycles per ns, cycles per ns is the same as billion cycles per s, i.e. 4.1 GHz.

For comparison, this is my laptop:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
        foreachf!(Octavian.matmul_serial!, 30, c, a, a)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               7.68e+09   50.0%  #  4.2 cycles per ns
┌ instructions             1.02e+10   75.1%  #  1.3 insns per cycle
│ branch-instructions      1.41e+08   75.1%  #  1.4% of instructions
└ branch-misses            1.86e+05   75.1%  #  0.1% of branch instructions
┌ task-clock               1.81e+09  100.0%  #  1.8 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    5.89e+08   75.1%  # 28.1% of dcache loads
│ L1-dcache-loads          2.10e+09   75.1%
└ L1-icache-load-misses    3.92e+04   75.1%
┌ dTLB-load-misses         4.61e+05   24.9%  #  0.0% of dTLB loads
└ dTLB-loads               2.10e+09   24.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It averaged at 4.2 GHz, and performed the same number of total instructions (1.02e10), but at half the rate: just 1.3 instructions/clock instead of 2.6.

I googled around, but can not find any information about

uops.info is a great resource.
For example, for vmaxpd zmm zmm zmm it says:

Ice Lake

    Measurements
        Latencies
            Latency operand 2 → 1: 4
            Latency operand 3 → 1: 4
        Throughput
            Computed from the port usage: 1.00
            Measured (loop): 1.00
            Measured (unrolled): 1.00
        Number of μops
            Executed: 1
            Retire slots: 1
            Decoded (MITE): 1
            Microcode Sequencer (MS): 0
        Port usage: 1*p0
    Documentation
        Latency: 4.0
        Throughput: 1.0

Cascade Lake

    Measurements
        Latencies
            Latency operand 2 → 1: 4
            Latency operand 3 → 1: 4
        Throughput
            Computed from the port usage: 0.50
            Measured (loop): 0.50
            Measured (unrolled): 0.50
        Number of μops
            Executed: 1
            Retire slots: 1
            Decoded (MITE): 1
            Microcode Sequencer (MS): 0
        Port usage: 1*p05

"Throughput" actually means reciprocal throughput, i.e. if you're doing a lot of them, how many clock cycles does it take on average per instruction. So "0.5" means that every 10 clock cycles, it does 10/0.5 = 20 of them. You can think of it as an average time per instruction.
"Latencies" is how long it takes from start to finish. 4 means that if it starts on clock cycle 1, the answer won't be usable for a dependent operation until clock cycle 5. Latencies are really important for reductions, like a sum or dot product. If you do them the naive way, s += a[i], then you have a long chain of dependent operations, where each one has to wait the full latency before starting. Unrolling, so you break up the chain, solves the problem (LLVM and LoopVectorization both do this for you).

But for matrix multiplication, it's just the throughput or average time per instruction that matters.
And there, the Cascadelake (like the 10980XE) is twice as good as Ice Lake: 0.5 vs 1.

Tiger Lake (my laptop) and Rocket Lake are both architecturally very similar to Ice Lake, and should have essentially all the same numbers (main differences are that Rocket Lake is backported to 14 nm, while Tiger Lake features upgraded caches).

The reason why Cascadelake is twice as fast is given by the "Port usage" line:

Ice Lake
        Port usage: 1*p0

Cascade Lake
        Port usage: 1*p05

The 1* means that it requires 1, while p0 means it requires port 0, and p05 means it requires either port 0 or port 5.
So basically, Cascadelake is twice as fast in terms of throughput because it has twice as many ports it can use. It's the same fast in terms of latency because the ports aren't faster, its cores are just more parallel.

You can look up other instructions like vaddpd zmm zmm zmm or vfmadd231pd zmm zmm zmm, and they look the same.

For your i5 10400, I'd look up ymm versions for comparison. Its architecture is Skylake (not Skylake-X, which is almost identical to Cascadelake).

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Mar 21, 2021

WOW, I totally underestimated the hardness to measure a CPU's computing power, learnt a lot from you. I also notice your answer is much more clear than those on stackoverflow. In case you are interested in repost your answer to stackoverflow, here is the link to the question:
https://stackoverflow.com/questions/6289745/how-to-compute-the-theoretical-peak-performance-of-cpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants