Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD types #2299

Closed
ViralBShah opened this issue Feb 14, 2013 · 48 comments · Fixed by #4042
Closed

SIMD types #2299

ViralBShah opened this issue Feb 14, 2013 · 48 comments · Fixed by #4042
Assignees
Labels
performance Must go faster

Comments

@ViralBShah
Copy link
Member

I have been floating the idea to expose and use SIMD instructions in a generic way. @JeffBezanson has suggested that this is quite feasible. @StefanKarpinski suggested VecInt and VecFloat bits types in this thread:

https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/h8EGtQvdq9U

To refine this further, it may be that we need VecInt32, VecInt64, VecFloat32, and VecFloat64 bits types. Arrays of these Vec types could correspond to simply reinterpreting arrays of the Int32, Int64, Float32, Float64.

We already do 16-byte alignment for memory allocation, although we may need 32-byte alignment for AVX instructions. This could be wasteful for small arrays, but there could be real performance gains for 2x2 and 3x3 matrix operations. As suggested in the thread above by @StefanKarpinski, we may need to do operations on the non-aligned portions in the head and tail separately.

Once we have this, we would probably want to implement all the vectorized base library operations using SIMD, which will give a nice performance boost until we have some kind of threading or fork/shmem based parallelism for Array objects.

Then there are also operations such as trigonometric and various math library functions, which are unlikely to be as accurate as openlibm, but a whole lot faster. Perhaps these should not be used by default, but available as vecsin etc.

@ViralBShah
Copy link
Member Author

@stevengj You have a fair bit of experience with this kind of stuff in FFTW, and it would be great to hear your thoughts on the topic.

@lindahua
Copy link
Contributor

I have been expecting the use of SIMD in Julia for quite a while -- this is definitely worth the efforts.

Here are something that might be useful (later) for implementing math functions with SIMD.

Agner maintains a lot of good stuff related to SIMD-based optimization: http://www.agner.org/optimize/#asmlib
He also developed a SIMD computation library (http://www.agner.org/optimize/vectorclass.pdf) which may be adapted for the use in Julia. (just that the license is GPL3 ...)

Another place that is worth looking at is NT2 -- the authors of this library developed a huge collection of SIMD routines (covering nearly every math function that people would ever use). This library itself has a heavy dependencies on Boost for expression templates --- but I think the SIMD core routines can be easily ported though. They seem to have extensively tested the accuracy and performance, but you may talk to them to confirm.

Intel has a Short Vector Math Library (SVML) that comes with Intel compiler, which provides a highly optimized SIMD routines for most C99 math functions (with both SSE and AVX version). If you purchase their compiler, this library is redistributable. Last time when I look at the *.so files that come with MATLAB (Linux), I found that they are using this library. The routines here are extremely highly optimized by Intel engineers, and the accuracy are strictly controlled (below 4 ULP, or in other words two least significant bits).

AMD LibM implements some SIMD math functions -- but far less than Intel SVML, have no AVX support, and the performance is not as good. It is free of charge though -- but not open source. You might have to pay attention to their license.

@JeffBezanson
Copy link
Sponsor Member

If we just want SIMD vector kernels, it may be easiest to find or write a library of kernels in C/ASM. It may be that the best interface to SIMD is just vectorized operations, not special SIMD types.

If we have a VecFloat64 type, and the purpose of it is to reinterpret arrays to that type if you want better performance, but you have to be careful about alignment, that's not a great programming model. I guess my question is how do we want to use this other than inside A+B? Do the SIMD types behave like little vectors? We could have pack and unpack functions, so e.g. pack(a,b,c,d)+pack(e,f,g,h) does 4 additions at once. Though that seems like something an autovectorizer could figure out.

Our current bitstypes are llvm integers, and bitcasting them to vector types may not be optimal. Maybe a slightly insane idea is to use tuples, and say that the representation of (Float64,Float64,Float64,Float64) is a 4xFloat64 SIMD vector. Or we could define

immutable type Vector4{T}
  w::T
  x::T
  y::T
  z::T
end

and say that that will be treated as a SIMD vector. I'm not sure how that will interact with treating these as C structs (@vtjnash ?), but that situation could hardly be worse so it may not matter. Tuples seem to have all the right properties though, including the virtue that they have no defined native representation yet.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

My vote is for using vectorized operations to access SIMD, multiple threads, and hopefully using the same infrastructure to offer GPU processing. Telling the JIT that you want to add to vectors would seem to give it a better chance of optimizing the code then having it figure out what you are doing in a loop.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

This is what I posted on the list last night about accessing LLVM IR from within Julia. I think the issues with focusing on direct SIMD instruction access may be similar.

So the more I look at this the less appealing it seems to be. At some point you end up having to rewrite the Julia Codegen code and LLVM C++ convenience classes in Julia. Then you would have to track them in a parallel development path. Not to mention what would have to be done when Julia statically compiles...

Maybe what I am looking for is a modularization the Julia's C++ JIT codegen code. The GPU (and "dangerous" CPU code with bounds checking moved to before the for loops, etc...) would then have direct access to parts of the stock Julia codegen (or access to the same LLVM library) and memory management.

The codegen modules could register requirements they have for memory allocation and then all memory allocations could meet their specific critiera (alignment, etc...). They would also be able to tag objects in memory, one use of this would be to lock arrays when they are being processed on the GPU. I think the intermediate GPU code could be kept in resources that the CPU JIT compiler would deal with, so they could be exported to libraries.

@lindahua
Copy link
Contributor

@JeffBezanson I do believe what we generally need is more than a set of vectorized functions implemented using SIMD.

It is easy to implement or port a C/C++ library that provides a set of function like

void vec_add(int n, const double *a, const double *b, double *r);
......

For people who have Intel VML, they may just do ccall on such functions.

The problem here can be a little bit more complicated. For example, if you want to do a + b .* c, then calling vec_mul and then vec_add is definitely not the optimal way to go -- it involves traversing the memory twice and probably the need of a temporary array to store the results of b .* c.

So just consider the expression r = (a + b) .* (a - b), the ideal code (in terms of C++) should look like the following

for (int i = 0; i < n; i += 4) 
{
    va = _mm256_load_pd(a + i);
    vb = _mm256_load_pd(b + i);
    t1 = _mm256_add_pd(va, vb);
    t2 = _mm256_sub_pd(va, vb);
    vr = _mm256_mul_pd(t1, t2);
    _mm256_store_pd(r + i, vr); 
}

We have to pack the computations into the loop body and do it in one pass, instead of calling several vectorized functions. Also, all such _mm256_* are CPU intrinsics, they should be directly translated into CPU instructions (or LLVM instructions). While one may use ccall to invoke such intrinsics (if they are wrapped into some C functions), it will just introduce overwhelming overhead for them.

Also, the use of SIMD is not just for doing element-wise calculation. A lot of things can benefit from it, which for example including small matrix algebra. The use of 2x2, 3x3, and even 4x4 matrices to transform geometric points are everywhere in graphics and image processing. With directly exposition of SIMD instructions, it is possible to develop a library for such kind of things (without compromising performance). It should be worth noting here that the performance of BLAS is very poor for matrices smaller than 16 x 16 (sometimes just an order of magnitude slower than naive C loops).

With VecFloat64 (or things of this sort), I can write optimized loops in Julia, like the following

width = get_vec_width(VecFloat64)
m = n / width
for i = 0 : m-1
    idx = i * width + 1
    va = vecload(a, idx)
    vb = vecload(b, idx)
    t1 = va + vb  # overloaded for VecFloat64
    t2 = va - vb
    vr = t1 .* t2 
    vecstore(vr, r, idx)
end

So we need a specific type to represent va, vb, t1, t2, vr above -- which can be the types that Stefan proposed. Of course, vecload, vecstore etc should be translated to instrinsics instead of function calls.

The codes above did not deal with trailing elements etc, but that would be much easier to do if we settle on how we do the main loop.

In terms of memory alignment. It is true that working with 32-byte (for AVX) or 16-byte (for SSE) aligned memory is faster than unaligned. But that is not a must. It is easy to guarantee that all instances of Array are aligned (as you can control the memory allocation). However, there is no such guarantee for SubArray etc, in such cases, you can simply use unaligned load for things that are not aligned --- the performance difference (in my benchmark and what I see from some others) is below 20% --- definitely better than first copying the things to a temporary.

Anyway, I think for all these such things to possibly happen, the first step is to expose SIMD instructions.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

I am not really opposed to adding in SIMD data structures and intrinsic, it would give Julia a speed boost in these situations. However my concern is that if you start using the SIMD instructions in loops at to high of a level, you start to hide the intent of the computation, so that by the time it gets to LLVM or a GPU backend it cannot safely do the optimizations that could be done (at least without Polly (http://polly.llvm.org/) and I am not certain what the overhead of the Polly evaluation is).

@StefanKarpinski
Copy link
Sponsor Member

@kk49: I'm not at all clear about what you're advocating for or against here. Part of the issue is that we don't really have clear or consistent terminology for the various approaches that people are discussing here.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

@StefanKarpinski I think adding SIMD operators and data types would be useful, but I don't think it is the final solution for implementing MAP and REDUCE operations for SIMD/multithreaded CPUs or GPUs. A solution will have to deal with memory management, kernel generation, and kernel scheduling. What @lindahua proposes makes sense but where should this transformation be done? As part of a Julia library, a replacement for array.jl, in codegen.cpp, or LLVM JIT/Static compilation?

@StefanKarpinski
Copy link
Sponsor Member

Honestly, I think that worrying about GPU and SIMD at the same time is too much – it's paralyzing. Let's tackle one massively hard problem at a time. The reason I like the approach I've laid out for SIMD (based on Viral's ideas) is a that it's simple, it's very flexible and fairly general, and it's clear what's going on – and it maps very cleanly onto LLVM code gen. There's no magical translation going on – it's just regular Julia code that implements vectorized computations by recasting bits types to corresponding SIMD bits types. Jacking things into code gen and memory management or "modularizing Julia's C++ JIT codegen" seems crazy to me.

@StefanKarpinski
Copy link
Sponsor Member

I think that with the right interface, writing SIMD versions of vectorized operations generically could be quite clean. Let me try to give an example.

function +(a::Array{Float64}, b::Array{Float64})
  # check that a and b have the same shape
  c = similar(a)
  ahead, abody, atail = simd_split(a)
  bhead, bbody, btail = simd_split(b)
  chead, cbody, ctail = simd_split(c)
  for i = 1:length(chead)
    chead[i] = ahead[i] + bhead[i]
  end
  for i = 1:length(cbody)
    cbody[i] = abody[i] + bbody[i]
  end
  for i = 1:length(ctail)
    ctail[i] = atail[i] + btail[i]
  end
  return c
end

The simd_split function gives three linear views into its dense array argument: head and tail are of the same element type as the original array while body is a vectorized view into the aligned bulk of the array. The code isn't super pretty, but it's simple enough and maps clearly down onto what a compiler needs to emit in terms of vectorized code: take care of the leading unaligned stuff, then do the bulk of the work using SIMD operations, then take care of the trailing stuff.

@lindahua
Copy link
Contributor

I agree with Stefan that introducing several SIMD types and a set of operations/functions on them is a clean and easy way, which is pretty much the same as what I was thinking.

@kk49 I do appreciate your long term perspective -- but what you suggested might require a fundamental restructuring of Julia code-base -- which is perhaps quite difficult in a near term.

You definitely have much more experience with GPU computation than I have. Here is just my two cents. GPU computation is fundamentally different from what we typically do on CPUs. You may have to plan ahead what should be put on device, what on host, and how they communicate (so as to reduce the traffic between host and devices).

For supporting GPU in Julia, what about the approach taken by Jacket? It is built on top of MATLAB -- which is worse than Julia in terms of language design -- and still achieves amazing performance. I am pretty sure they did not mess up with the internal code-gen of MATLAB -- all what they do is to build their own VM and interface with it through mex. Julia provides a more convenient way to interface with C functions, so I believe this is achievable without modifying the internal code-gen of Julia.

@lindahua
Copy link
Contributor

@StefanKarpinski The code example is nice.

Just one point. Why would you ever want a head part? I think the head part should always have a length of zero.

Just consider such an expression a + b

If both a and b are aligned -- then no head part is needed.

If a is aligned and b is not -- then your simd_split function may yield an empty head for a, and perhaps a non-empty head for b (say 1). Then how many iterations should be used in computing the head? 0 or 1?

In my opinion, for non-aligned arrays, you can simply used unaligned load. The performance difference is no more than 20% in various benchmarks.

Introducing a head part might lead to inconsistent splitting between operands in an expression.

@StefanKarpinski
Copy link
Sponsor Member

I was assuming that it was possible that our arrays were not aligned to the requirements of SIMD instructions, but if we can skip that part all the better. But the example shows that even handling arbitrarily aligned arrays would be feasible.

@ViralBShah
Copy link
Member Author

SubArrays could be unaligned. The choice of aligned and non aligned access will be with the programmer anyways. I agree the first order of work would be to get something working.

@StefanKarpinski
Copy link
Sponsor Member

I get dahua's point now – the head thing only helps if the two arrays are unaligned I the same way, which obviously they generally wouldn't be. Just doing unaligned fetches seems like the best way to go.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

@lindahua my DeMat library implements the core Jacket functionality of vector (MAP) kernel creation & launching and data movement. (maybe more, I'm not sure what they do for "devectorization"). Getting access to "complex" math operations like sin and cos is a problem right now, since the CUDA compiler generates the assembly for them. The LLVM PTX people are working on that...

@lindahua
Copy link
Contributor

Yes, that's what I mean. If two arrays are unaligned by a different offset, introducing the head part does no help in such situation --- this was the lesson that I learned when I was implementing a C++ library.

I also read the disassembly generated by Intel compiler with vectorization turned on. If I don't tell them the arrays are aligned, the compiler simply emits unaligned load/store -- which is nearly as fast as a hand-coded loop using aligned loads.

@kk49
Copy link
Contributor

kk49 commented Feb 14, 2013

@StefanKarpinski I definitely understand that it is crazy :) What about multithreading?

@StefanKarpinski
Copy link
Sponsor Member

Good god, you're a madman, @kk49! Are you trying to give me a heart attack?

@GunnarFarneback
Copy link
Contributor

I'm hoping for SIMD support that facilitates heavy optimization of spatial convolution from within Julia. That's more involved than just pointwise operations and the best implementation choice depends on the shape and size of the convolution kernel (and the exact set of available SIMD operations). For example a horizontal and a vertical 1D kernel would require very different approaches for optimal performance.

I guess reinterpreting arrays would be a functional way to access the data but base library operations won't cut it. I'd rather get my hands dirty at the abstraction level of C intrinsics.

@StefanKarpinski
Copy link
Sponsor Member

I'm finding that a bit hard to interpret? Which way of doing this are you saying will or won't cut if for your application?

@GunnarFarneback
Copy link
Contributor

The main problem is probably that I'm not really sure myself how far the original proposal would suffice. Maybe it can do more than I was thinking. I'll try to prepare some more specific examples of what I'd like to accomplish.

@GunnarFarneback
Copy link
Contributor

Here are two examples of spatial convolutions with small kernels. Let's first assume we filter an image with a horizontal 1x3 kernel. In Julia this would be, with one column of padding on each side of the image:

out = similar(in)
for j = 2:size(in,2)-1
    for i = 1:size(in,1)
        sum = 0.0f0
        for k = 1:3
            sum += in[i,j+(k-2)] * kernel[k]
        end
        out[i,j] = sum
    end
end

In mixed Julia and C intrinsics syntax this could relatively straighforwardly be implemented with SSE2 operations as

c = zeros(__m128, 3)
for k = 1:3
    c[k] = _mm_load1_ps(kernel[k])
end
out = similar(in)
for j = 2:size(in,2)-1
    for i = 1:4:size(in,1)
        sum = _mm_set1_ps(0.0f0)
        for k = 1:3
            x = _mm_load_ps(in[i,j+(k-2)])
            sum = _mm_add_ps(sum, _mm_mul_ps(x, c[k]))
        end
        _mm_storeu_ps(out[i,j], sum)
    end
end

Obviously this assumes that the height is a multiple of 4 so that alignment comes out right in all columns and there are no tails to handle. The image is supposed to be single precision and _mm_load_ps(a[i,j]) to load a[i:i+3,j] into a SIMD register.

If we instead look at a vertical 3x1 filter, the Julia code is nearly the same:

out = similar(in)
for j = 1:size(in,2)
    for i = 2:size(in,1)-1
        sum = 0.0f0
        for k = 1:3
            sum += in[i+(k-2),j] * kernel[k]
        end
        out[i,j] = sum
    end
end

For a SIMD implementation one option is to transpose the image and use the code above but the transposition is costly, especially for such a small kernel as this one. With SSSE3 an in-place option is to do something like this (ignoring heads and tails and assuming alignment comes out right):

c1 = _mm_load1_ps(kernel[1])
c2 = _mm_load1_ps(kernel[2])
c3 = _mm_load1_ps(kernel[3])
out = similar(in)
for j = 1:size(in,2)
    for i = 5:size(in,1)-4
        x1 = _mm_castps_si128(_mm_load_ps(in[i-4,j]))
        y2 = _mm_load_ps(in[i,j])
        x2 = _mm_castps_si128(y2)
        x3 = _mm_castps_si128(_mm_load_ps(in[i+4,j]))
        sum = _mm_mul_ps(y2, c2)
        y1 = _mm_castsi128_ps(_mm_alignr_epi8(x2,x1,12))
        sum = _mm_add_ps(sum, _mm_mul_ps(y1, c1))
        y3 = _mm_castsi128_ps(_mm_alignr_epi8(x3,x2,4))
        sum = _mm_add_ps(sum, _mm_mul_ps(y3, c3))
        _mm_storeu_ps(out[i,j], sum)
    end
end

After thinking more about it I can imagine the first case could fit with the original proposal but I'm more doubtful about the second case. Anyway I wouldn't mind having the option to program SIMD in Julia at this level of abstraction. It's not pretty, so there certainly should be more convenient alternatives to get SIMD vectorization, but Julia's code generation facilities would come in very handy to avoid having to do this entirely by hand.

@lindahua
Copy link
Contributor

@GunnarFarneback I guess probably your problem can be much easier addressed by faster implementations of conv and conv2 (probably in C with SIMD).

If the code is really performance critical, then I think the fastest way to do that without GPU is to do a ccall to the highly optimized convolution function ippsConv_32f in Intel IPP.

Currently, it may be just better to focus on making @StefanKarpinski's plan working.

@StefanKarpinski
Copy link
Sponsor Member

I should point out that this was originally @ViralBShah's plan; we then fleshed it out together and I just happen to be the one who put it up here.

@GunnarFarneback
Copy link
Contributor

This was only a simplified example of the kinds of things I want to do. Of course it can be done in C (or in practice more likely C++) but that is exactly what I'm trying to avoid.

@ViralBShah
Copy link
Member Author

Here's an old stackoverflow thread that is relevant. Seems like we are not the only ones who want such a thing:
http://stackoverflow.com/questions/1417681/simd-programming-languages

There is also the Intel SPMD program compiler that is BSD licensed:
http://ispc.github.com
https://github.com/ispc/ispc/

@lindahua
Copy link
Contributor

The Intel SPMD is quite interesting.

We probably may not be able to use it directly (as it is a compiler dedicated to a specific C-extension language) though. But might be good to borrow some of the concepts there.

Still, I think the easiest thing to do first is to provide SIMD types (either as opaque bit types or immutable composite, I am fine with either way), and expose intrinsic operations for them.

Everything else can always be built on top later through high level interface.

@ViralBShah
Copy link
Member Author

I had done some experimental work to add a vecFloat64 type. I have published this in the branch vs/simd. The major part of the work, as I understand, is to add support for boxing, unboxing, and the various intrinsics in intrinsics.cpp. I would like to get all the SIMD intrinsics available in Julia first to start with. Eventually, we can even try to fold them seamlessly within the language like ispc.

@pauljurczak
Copy link
Contributor

I'm very interested in having SIMD available in Julia. Any news on this topic?

I agree with many comments above, that providing inlined SIMD instructions is a sensible first step. I'm not familiar with LLVM/JIT infrastructure Julia is built on, so I must ask: is it technically possible to guarantee that a SIMD instruction in Julia program will be translated to an equivalent single SIMD instruction in executable binaries?

If this can be of any help, .NET and Mono, which are similarily JITed environments have SIMD API: http://tirania.org/blog/archive/2008/Nov-03.html and https://github.com/mono/Mono.Simd.Math.

For higher level features like SIMDfied matrix operations, I suggest to require proper memory alignment and 4n or 8n size for arguments. In most cases where performance is so critical, it is not a big problem to specify suitable alignment and size. Misaligned or weird sized data could use only non-SIMD functions initially.

@ViralBShah
Copy link
Member Author

Some of this work is likely to be similar to the recent Float16 work and the refactoring should help here as well.

Cc: @loladiro

@lindahua
Copy link
Contributor

lindahua commented Jul 5, 2013

I am looking forward to this for quite a while.

Once SIMD intrinsics are in place, I can start experimenting stuff in NumericExtensions.jl.

@ArchRobison
Copy link
Contributor

My take is that having two levels of SIMD support is good. The intrinsic level seems to work best for aggressive coders. Ideally, it should permit code to be written in a vector-width agnostic way (first there was SSE, then AVX, and AVX-512 has been announced). Overloading should be able to cover that need.

For higher level use, I'm a fan of the OpenMP 4.0 "simd" loops (descendant of Intel's "pragma simd"). It lets the programmer tell the compiler there are no "simd gotchas" (notably guarantees on iteration progress) and the compiler takes care of the details. Maybe it could be expressed as a @simd macro preceding a Julia for loop? Reduction functionality has to be considered. Markup similar to reduction clauses is one way. Pattern-matching "var op=" is another way. I used to think the pattern-matching approach was awful, but I seen systems do this and I now appreciate that it has a certain simplicity, particularly if trying to extend to prefix-scan patterns.

@ViralBShah
Copy link
Member Author

@ArchRobison Do take a look at the approach in #4202.

@ArchRobison
Copy link
Contributor

@ViralBShah Did you mean how CoinMP.jl invokes the Coin library or something else? I'm not clear on the relation to SIMD programming.

Is your vs/simd branch still around? I was having trouble finding it, but I'm still new at using Gitub.

@Keno
Copy link
Member

Keno commented Nov 1, 2013

@ViralBShah mistyped. He meant #4042, I believe.

@ViralBShah
Copy link
Member Author

That's right. My simd branch is no longer around around and the approach that keno is taking in the above pull request is the right way to exploit forward.

@Keno Keno reopened this Dec 7, 2013
@StefanKarpinski
Copy link
Sponsor Member

So tuple branch does not satisfy having SIMD types?

@jakebolewski
Copy link
Member

It would be nice to enforce a common naming scheme for typealiases of SIMD tuples in Base.

@Keno
Copy link
Member

Keno commented Dec 7, 2013

We have SIMD types now in the form of tuples, but since I separated out llvmcall we can't actually really do anything with them (LLVM will probably still be able to better optimize and emit appropriate SIMD operations).

@ViralBShah
Copy link
Member Author

The approach we have taken is to exploit SIMD through compiler optimizations rather than exposing SIMD types. Should we close this?

@StefanKarpinski
Copy link
Sponsor Member

Sure. We can always reopen if we change our minds.

@SimonDanisch
Copy link
Contributor

Is there a wrap up anywhere? I can't really infer, what this means for FixedSizeArrays and operations that work on homogeneous composite types. I guess if FixedSizeArrays are implemented via NTuple simd acceleration should be available, right? What about these constructs:

immutable Point{N, T}
vec::NTuple{N, T}
end

Construction of Point((1,2,3)) generates much more code than (1,2,3).
Will Point((1,2,3)) emit <3 x i64> at some point?

@JeffBezanson
Copy link
Sponsor Member

In limited cases (local variables) we already map tuples to simd types. With the tuple redesign this will apply much more broadly. You'll be able to have a flat array of NTuple{4,Float32}, at which point we'll basically have full simd types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.