New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Languages without any tensor libraries #1
Comments
Ah one thing that I missed: Is the data always contiguous? If no, do you have a maximum number of dimensions? As shown by the research on TRIOT (Template Recursive Iterations over Tensors) having a max number of dimensions can allow compile-time unrolling of iterations. More pragmatically in Arraymancer I choose to support up to rank 7 tensors, having a limit allows me to carry tensor metadata on the stack and not stressing the GC. |
Thanks for your submission :)
No! Just the core functionality of the broadcast itself! So if that tensor library already offers something broadcast alike, you shouldn't use that for the core loop ;) You should also show, that you can construct the lazy representation of the broadcast as a normal user type of the language you target, and that it is fully transparent to the implementor of |
@SimonDanisch I feel like there must be a better way to phrase the "No Libraries" rule. The point was to stop anyone from just having the whole implementation of broadcast in the library, I feel like maybe just removing that rule entirely, might lead to less confusion. @mratsim I am excited by you submitting. Julia should specialize on Integer multiplication. |
I did remove it :) |
@oxinabox you're welcome to ping me in JuliaLang The secret sauce is basically recreating a generic BLAS that can be used for integers (or float if a BLAS library is missing) instead of doing naive triple for-loops. |
Julia can use SIMD acceleration for custom structs, if it figures out how to do it ;) So of course not for all types, but I found it to work well for the cases I care about! So to truly impress, user defined types should get SIMD acceleration in your implementation! This allows Julia developers to create packages like ForwardDiff, which creates a custom Dual number type, that still profits from SIMD acceleration :) |
Let me know if you have any questions :) Btw, the challenge was born when somebody told me, that they can easily implement Julia's broadcast in some other language - and I realized that I need a minimal, self contained implementation, to actually set a reference and being able to judge if that statement was true! So if you can boil down the current broadcast implementation that you already have, in a couple of hours and end up with a self contained, performant toy version that fulfills the requirements, than you're pretty close to winning the challenge :) |
A completely different challenge for the Julia implementation would be to get it as fast as your current NIM implementation - disregarding implementation complexity and time to solution ;) Would you mind to provide a minimal NIM broadcast example with included benchmarking code, that I can run, maybe with a link to some basic install + setup steps? :) |
I've added a Nim solution in #3 which allows the following syntax: when isMainModule:
import math
let x = randomTensor([1, 2, 3], 10)
let y = randomTensor([5, 2], 10)
echo x # (shape: [1, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1, 10, 5, 5, 7, 3]))
echo y # (shape: [5, 2], strides: [2, 1], offset: 0, storage: (data: @[8, 3, 7, 9, 3, 8, 5, 3, 7, 1]))
block: # Simple assignation
let a = broadcast(x, y):
x * y
echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[8, 80, 40, 15, 21, 9, 7, 70, 35, 45, 63, 27, 3, 30, 15, 40, 56, 24, 5, 50, 25, 15, 21, 9, 7, 70, 35, 5, 7, 3]))
block: # Complex multi statement with type conversion
let a = broadcast(x, y):
let c = cos x.float64
let s = sin y.float64
sqrt(c.pow(2) + s.pow(2))
echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1.12727828058919, 1.297255090978019, 1.029220081237957, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.5003057878335346, 0.859191628789455, 1.072346394223034, 0.5584276483137685, 0.8508559734652587, 0.3168265963213802, 1.029220081237957, 1.243864280886628, 1.399612404734566, 1.100664502137075, 1.274196529364651, 1.0, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.8879964266455946, 1.129797339073468, 1.299291561428286]))
block: # Variadic number of types with proc declaration inside
var u, v, w, x, y, z = randomTensor([3, 3], 10)
let c = 2
let a = broadcast(u, v, w, x, y, z):
# ((u * v * w) div c) mod (if not zero (x - y + z) else 42)
proc ifNotZero(val, default: int): int =
if val == 0: default
else: val
let uvw_divc = u * v * w div c
let xmypz = x - y + z
uvw_divc mod ifNotZero(xmypz, 42)
echo a # (shape: [3, 3], strides: [3, 1], offset: 0, storage: (data: @[0, 0, 0, 7, 4, 0, 0, 2, 0])) Instead of reusing Arraymancer I just reused what is in Nim standard lib. The main differences with Arraymancer map/broadcast/iterations:
|
Awesome!! Thanks a lot :) Excited to try this out! using BenchmarkTools
a = rand(1000, 1000); # 1000x1000 Float64 matrix
b = rand(1000); # 1000 Float64 vector
c = 1.0 # Float64 scalar
out = similar(a); # uninitialized copy of a
br = @broadcast a + b - sin(c) # construct the lazy representation of our call tree
@btime materialize!($out, $br) # materialize the broadcast && benchmark it with some statistical rigour |
I think this can be closed |
Hello Simon,
I'm the author of Arraymancer, a tensor library written from scratch with Numpy + PyTorch like syntax with a deep learning focus in the Nim programming language.
How would your challenge work with languages without built-in N-dimensional arrays libraries? Do we need to rebuild a tensor library from scratch? Or can we map on the basic sequence/vector type?
On my from scratch tensor lib
A * X + B
into a single GEMM callBasically the only thing missing is variadic number of arguments support (I only hardcoded 2 and 3).
Alternatively on basic seq types
I'm also the author of the loop-fusion library that allows the following with variadic number of arguments:
Loopfusion is a very short lib (300 lines can be brought down a lot)
I can easily rebuild one shorter with OpenMP parallel support just for map (as it support reduce, filter as well) in a couple dozens lines.
Types
Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.
Furthermore Julia doesn't seem to optimize for integer types, for example here are benchmarks I did on integer matrix multiplication.
Archlinux, E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)
MacOS + i5-5257U (Broadwell dual-core mobile 2.7GHz, turbo 3.1)
As you can see Julia is 10x slower (grows by n^3 with matrix size) compared to Nim on my quad-core machine.
The text was updated successfully, but these errors were encountered: