Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Languages without any tensor libraries #1

Closed
mratsim opened this issue Sep 13, 2018 · 11 comments
Closed

Languages without any tensor libraries #1

mratsim opened this issue Sep 13, 2018 · 11 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Sep 13, 2018

Hello Simon,

I'm the author of Arraymancer, a tensor library written from scratch with Numpy + PyTorch like syntax with a deep learning focus in the Nim programming language.

How would your challenge work with languages without built-in N-dimensional arrays libraries? Do we need to rebuild a tensor library from scratch? Or can we map on the basic sequence/vector type?


On my from scratch tensor lib

Basically the only thing missing is variadic number of arguments support (I only hardcoded 2 and 3).

Alternatively on basic seq types

I'm also the author of the loop-fusion library that allows the following with variadic number of arguments:

import loopfusion

block: # Simple
  let a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip x in a, y in b, z in c:
    echo (x + y) * z

  # 120
  # 140
  # 160

block: # With index
  let a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]
  var d: seq[int] = @[]

  forZip i, x in a, y in b, z in c:
    d.add i + x + y + z

  doAssert d == @[22, 25, 28]

block: # With mutation
  var a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip x in var a, y in b, z in c:
    x += y * z

  doAssert a == @[111, 122, 133]

block: # With mutation, index and multiple statements
  var a = @[1, 2, 3]
  let b = @[11, 12, 13]
  let c = @[10, 10, 10]

  forZip i, x in var a, y in b, z in c:
    let tmp = i * (y - z)
    x += tmp

  doAssert a == @[1, 4, 9]

block: # With iteration on seq of different types
  let a = @[1, 2, 3]
  let b = @[false, true, true]

  forZip integer in a, boolean in b:
    if boolean:
      echo integer

block: # With an expression
  let a = @[1, 2, 3]
  let b = @[4, 5, 6]


  let c = forZip(x in a, y in b):
    x + y

  doAssert c == @[5, 7, 9]


block: # With arrays + seq, mutation, index and multiple statements
  var a = [1, 2, 3]
  let b = [11, 12, 13]
  let c = @[10, 10, 10]

  forZip i, x in var a, y in b, z in c:
    let tmp = i * (y - z)
    x += tmp

  doAssert a == [1, 4, 9]

Loopfusion is a very short lib (300 lines can be brought down a lot)

I can easily rebuild one shorter with OpenMP parallel support just for map (as it support reduce, filter as well) in a couple dozens lines.

Types

Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.

Furthermore Julia doesn't seem to optimize for integer types, for example here are benchmarks I did on integer matrix multiplication.

Archlinux, E3-1230v5 (Skylake quad-core 3.4 GHz, turbo 3.8)

Input 1500x1500 random large int64 matrix
Arraymancer 0.2.90 (master branch 2017-10-10)
Language Speed Memory
Nim 0.17.3 (devel) + OpenMP 0.36s 55.5 MB
Julia v0.6.0 3.11s 207.6 MB
Python 3.6.2 + Numpy 1.12 compiled from source 8.03s 58.9 MB

MacOS + i5-5257U (Broadwell dual-core mobile 2.7GHz, turbo 3.1)

Input 1500x1500 random large int64 matrix
Arraymancer 0.2.90 (master branch 2017-10-31)

no OpenMP compilation: nim c -d:native -d:release --out:build/integer_matmul --nimcache:./nimcache benchmarks/integer_matmul.nim
with OpenMP: nim c -d:openmp --cc:gcc --gcc.exe:"/usr/local/bin/gcc-6" --gcc.linkerexe:"/usr/local/bin/gcc-6"  -d:native -d:release --out:build/integer_matmul --nimcache:./nimcache benchmarks/integer_matmul.nim
Language Speed Memory
Nim 0.18.0 (devel) - GCC 6 + OpenMP 0.95s 71.9 MB
Nim 0.18.0 (devel) - Apple Clang 9 - no OpenMP 1.73s 71.7 MB
Julia v0.6.0 4.49s 185.2 MB
Python 3.5.2 + Numpy 1.12 9.49s 55.8 MB

As you can see Julia is 10x slower (grows by n^3 with matrix size) compared to Nim on my quad-core machine.

@mratsim
Copy link
Contributor Author

mratsim commented Sep 13, 2018

Ah one thing that I missed:

Is the data always contiguous?

If no, do you have a maximum number of dimensions?

As shown by the research on TRIOT (Template Recursive Iterations over Tensors) having a max number of dimensions can allow compile-time unrolling of iterations.

More pragmatically in Arraymancer I choose to support up to rank 7 tensors, having a limit allows me to carry tensor metadata on the stack and not stressing the GC.

@SimonDanisch
Copy link
Owner

Thanks for your submission :)
I'll need to take a better look at this, but let me quickly answer the simple questions:

Do we need to rebuild a tensor library from scratch?

No! Just the core functionality of the broadcast itself! So if that tensor library already offers something broadcast alike, you shouldn't use that for the core loop ;) You should also show, that you can construct the lazy representation of the broadcast as a normal user type of the language you target, and that it is fully transparent to the implementor of materialize!.

@oxinabox
Copy link

oxinabox commented Sep 13, 2018

@SimonDanisch I feel like there must be a better way to phrase the "No Libraries" rule.
Since it keeps confusing people.
Possibly just removing it.

The point was to stop anyone from just having the whole implementation of broadcast in the library,
right?
So as to prevent anyone submitting: from GiantUglyLibrary import broadcast.

I feel like maybe just removing that rule entirely, might lead to less confusion.
And if anyone makes a PR that does that, then you can just comment on the PR


@mratsim I am excited by you submitting.
Sounds like you have just the expertise to show how this is done.

Julia should specialize on Integer multiplication.
(Well I mean in compilation sense at least, since julia specialises on everything.)
It may be we are missing a good dispatch opportunity there.
Do you mind if someone pings you in the JuliaLang repo?

@SimonDanisch
Copy link
Owner

I feel like maybe just removing that rule entirely, might lead to less confusion.

I did remove it :)

@mratsim
Copy link
Contributor Author

mratsim commented Sep 13, 2018

@oxinabox you're welcome to ping me in JuliaLang

The secret sauce is basically recreating a generic BLAS that can be used for integers (or float if a BLAS library is missing) instead of doing naive triple for-loops.

@SimonDanisch
Copy link
Owner

Can you confirm the types that are used in the challenge? Julia is dynamic but integer vs float32 vs float64 vs Complex has a huge implication in SIMD.

Julia can use SIMD acceleration for custom structs, if it figures out how to do it ;) So of course not for all types, but I found it to work well for the cases I care about! So to truly impress, user defined types should get SIMD acceleration in your implementation! This allows Julia developers to create packages like ForwardDiff, which creates a custom Dual number type, that still profits from SIMD acceleration :)

@SimonDanisch
Copy link
Owner

Let me know if you have any questions :)

Btw, the challenge was born when somebody told me, that they can easily implement Julia's broadcast in some other language - and I realized that I need a minimal, self contained implementation, to actually set a reference and being able to judge if that statement was true!

So if you can boil down the current broadcast implementation that you already have, in a couple of hours and end up with a self contained, performant toy version that fulfills the requirements, than you're pretty close to winning the challenge :)

@SimonDanisch
Copy link
Owner

A completely different challenge for the Julia implementation would be to get it as fast as your current NIM implementation - disregarding implementation complexity and time to solution ;)

Would you mind to provide a minimal NIM broadcast example with included benchmarking code, that I can run, maybe with a link to some basic install + setup steps? :)

@mratsim
Copy link
Contributor Author

mratsim commented Sep 14, 2018

I've added a Nim solution in #3 which allows the following syntax:

when isMainModule:
  import math

  let x = randomTensor([1, 2, 3], 10)
  let y = randomTensor([5, 2], 10)

  echo x # (shape: [1, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1, 10, 5, 5, 7, 3]))
  echo y # (shape: [5, 2], strides: [2, 1], offset: 0, storage: (data: @[8, 3, 7, 9, 3, 8, 5, 3, 7, 1]))

  block: # Simple assignation
    let a = broadcast(x, y):
      x * y

    echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[8, 80, 40, 15, 21, 9, 7, 70, 35, 45, 63, 27, 3, 30, 15, 40, 56, 24, 5, 50, 25, 15, 21, 9, 7, 70, 35, 5, 7, 3]))

  block: # Complex multi statement with type conversion
    let a = broadcast(x, y):
      let c = cos x.float64
      let s = sin y.float64

      sqrt(c.pow(2) + s.pow(2))

    echo a # (shape: [5, 2, 3], strides: [6, 3, 1], offset: 0, storage: (data: @[1.12727828058919, 1.297255090978019, 1.029220081237957, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.5003057878335346, 0.859191628789455, 1.072346394223034, 0.5584276483137685, 0.8508559734652587, 0.3168265963213802, 1.029220081237957, 1.243864280886628, 1.399612404734566, 1.100664502137075, 1.274196529364651, 1.0, 0.3168265963213802, 0.7669963922853442, 0.9999999999999999, 0.8506221091780486, 1.065679324094626, 0.7156085706291233, 0.8879964266455946, 1.129797339073468, 1.299291561428286]))

  block: # Variadic number of types with proc declaration inside
    var u, v, w, x, y, z = randomTensor([3, 3], 10)

    let c = 2

    let a = broadcast(u, v, w, x, y, z):
      # ((u * v * w) div c) mod (if not zero (x - y + z) else 42)

      proc ifNotZero(val, default: int): int =
        if val == 0: default
        else: val

      let uvw_divc = u * v * w div c
      let xmypz = x - y + z

      uvw_divc mod ifNotZero(xmypz, 42)

    echo a # (shape: [3, 3], strides: [3, 1], offset: 0, storage: (data: @[0, 0, 0, 7, 4, 0, 0, 2, 0]))

Instead of reusing Arraymancer I just reused what is in Nim standard lib.


The main differences with Arraymancer map/broadcast/iterations:

  • Arraymancer unfortunately doesn't support variadic broadcast yet, only up to 3 inputs. It's planned but the iteration scheme is different and a Nim bug on containers subtypes extraction within macros prevents me from using variadic broadcast with my iteration scheme and clean code.

  • Iteration scheme:

    • Arraymancer uses the following strategy per tensor and then get the corresponding coordinates or value (or both). This avoids recurrent getindex computation at the cost of having an inner if/else branch and having one loop per tensor. It also allows simple for looping for contiguous tensors.

      template strided_iteration[T](t: Tensor[T], strider: IterKind): untyped =
        ## Iterate over a Tensor, displaying data as in C order, whatever the strides.
      
        ## Iterator init
        var coord = newSeq[int](t.rank) # Coordinates in the n-dimentional space
        var backstrides: seq[int] = @[] # Offset between end of dimension and beginning
        for i,j in zip(t.strides,t.shape):
          backstrides.add(i*(j-1))
      
        var iter_pos = t.offset
      
        ## Iterator loop
        for i in 0 ..< t.shape.product:
      
          ## Templating the return value
          when strider == IterKind.Values: yield t.data[iter_pos]
          elif strider == IterKind.Coord_Values: yield (coord, t.data[iter_pos])
          elif strider == IterKind.MemOffset: yield iter_pos
          elif strider == IterKind.MemOffset_Values: yield (iter_pos, t.data[iter_pos])
      
          ## Computing the next position
          for k in countdown(t.rank - 1,0):
            if coord[k] < t.shape[k]-1:
              coord[k] += 1
              iter_pos += t.strides[k]
              break
            else:
              coord[k] = 0
              iter_pos -= backstrides[k]
    • The broadcast implementation I just pulled iterates on all possible coordinate values and then feed that to getIndex like the Julia code (I think). The scheme is much simpler, is easier to use with a variabe number of inputs, produces much less code, however moving from one element to the next is more costly due to getIndex vs simple incrementation.

  • If you're studying iteration optimization I suggest you look into TRIOT and the resources I've gathered.

@SimonDanisch
Copy link
Owner

Awesome!! Thanks a lot :) Excited to try this out!
Could you help me a bit to get started, by only supplying the equivalent of the first benchmarking challenge?

using BenchmarkTools
a = rand(1000, 1000); # 1000x1000 Float64 matrix
b = rand(1000); # 1000 Float64 vector
c = 1.0 # Float64 scalar
out = similar(a); # uninitialized copy of a
br = @broadcast a + b - sin(c) # construct the lazy representation of our call tree
@btime materialize!($out, $br) # materialize the broadcast && benchmark it with some statistical rigour

@mratsim
Copy link
Contributor Author

mratsim commented Sep 22, 2018

I think this can be closed

@mratsim mratsim closed this as completed Sep 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants