Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Float16 type #3467

Closed
rwgardner opened this Issue Jun 20, 2013 · 28 comments

Comments

Projects
None yet
7 participants

This is a request for support for half-precision floating point numbers (Float16s).

(If there has been any discussion about adding support for these, which I would expect there was, I did not find it.)

Although the precision is low, Float16s are still useful when you have a very large quantity of floating point numbers (which is what we have) and want to reduce memory footprint, cache impact, or disk storage. (Currently, we manually convert our half precision floats with bit manipulations and reinterpretation, but the code would be cleaner if Julia supported them natively.)

Thanks.

Member

nolta commented Jun 20, 2013

LLVM 3.1 added support for half floats, so this should be doable. Marking as 'up for grabs'.

Owner

StefanKarpinski commented Jun 20, 2013

Since this is strictly a storage type, very few operations are needed – mostly conversion to and from larger float types.

Owner

timholy commented Jun 20, 2013

@rwgardner, my guess is that this will happen sooner if you submit a pull request. ("Up for grabs" is a good choice here, and it basically means "waiting for someone to do it." Since you want the feature...) It's good that you first submitted it as an issue, however, in case there were strong objections; since that doesn't seem to be the case, it looks like the way is clear for you to add this feature.

Some time in the not-too-distant past, support for Int128 was added. Perhaps a good start might be browsing the commit history (with git log and git show) to find out exactly how that was done---it might be a great model for this case.

Owner

StefanKarpinski commented Jun 21, 2013

Float16 should be substantially easier than Int128. Up for grabs is more like "waiting for someone to do it and pretty nicely isolated and doable by a determined newcomer."

Owner

ViralBShah commented Jun 21, 2013

The cool thing about Int128 was that it was done fully in julia. I believe that to get a fast Float16 implementation, one may need to leverage LLVM's Float16 capabilities in intrinsics.cpp and codegen.cpp.

I believe a first cut implementation can be done by leveraging bitshifts and such the way @rwgardner has already done, and it would be nice to receive that as a pull request as a starting point.

Sounds good. I'm not "grabbing" this yet, but I will if I really want it done. (Unfortunately, I don't get paid to work on Julia for the most part, which means I need to do this in my free time. That's something I'd love to do, but in short, a new first baby due any day has been and will be dominating that for a while.)

Owner

ViralBShah commented Jun 21, 2013

Is it possible for you to isolate the code that you have already written for Float16 and submit that?

Owner

StefanKarpinski commented Jun 22, 2013

Outline of what needs to be done:

  • Add intrinsics for floating point truncation and extension to and from 16-bit floats
    • can be done either by adding specific intrinsics or generalizing the existing ones
  • Add convert methods to/from Float16 and other numeric types
  • Add promotion rules for Float16 and other numeric types

@JeffBezanson, any thoughts on whether it's better to add new specific intrinsics (fptrunc16 and fpext32) or generalize the existing ones? I was leaning towards generalizing the existing ones and renaming fptrunc32 => fptrunc and fpext64 => fpext.

@ghost

ghost commented Jun 22, 2013

If rwgardner is alright with it, I can try implementing Float16. I've wanted to find a way to get my hands dirty in Julia.

Owner

ViralBShah commented Jun 22, 2013

@mattgallivan Please jump in. More the merrier. @StefanKarpinski 's outline is basically what needs to be done, and one can follow the Float32 implementation in src and base.

Owner

StefanKarpinski commented Jun 22, 2013

Just to expand on what I mean by "generalizing the existing ones", this means turning the fptrunc and fpext intrinsics into versions that aren't specific to bit sizes but use type info to figure out the appropriate sizes and call the corresponding LLVM instructions. We've gradually been moving from specific versions with bit sizes in their names to more generic ones.

Owner

ViralBShah commented Jun 22, 2013

The Int stuff already does that and it would be nice to do so with FloatingPoint too. I wonder if we should take this opportunity to also add Float128 at the same time, assuming LLVM supports it.

Owner

Keno commented Jun 22, 2013

Since there is no hardware support for quad-precision arithmetic, adding Float128, is quite a bit more complicated.

Owner

StefanKarpinski commented Jun 22, 2013

Yeah, that's a whole different can of worms. You actually want to compute with Float128 or it's completely useless. For Float16, it's fine to just be able to store them.

@mattgallivan all sounds good. I would love to contribute and would have a lot of fun doing it, but my life is about as insane as it's ever been right now. Hopefully I can contribute in other ways in the future.

You may not want this (I'm sure it could be written more efficiently, etc., and you may want to do it in fortran or C), but here's what I have. It also hasn't been heavily validated yet, but you might use it for validation by comparing it to your code. I haven't done any conversion back to Float16.

bitstype 16 MyFloat16

function convert(::Type{Float32}, val::MyFloat16)
    val = uint32(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint32

    if exp == 0
        if sig == 0
            sign = sign << 31
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
            bit = bit >> 1
            end
            sign = sign << 31
            exp = (-14 - n_bit + 127) << 23
            sig = ((sig & (~bit)) << n_bit) << (23 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
        if sign == 0
                ret = 0x7f800000
            else
            ret = 0xff800000
            end
    else
            ret = 0xffffffff
    end
    else
        sign = sign << 31
    exp  = (exp - 15 + 127) << 23
    sig  = sig << (23 - 10)
    ret = sign | exp | sig
    end
    return reinterpret(Float32, ret)
end

function convert(::Type{Float64}, val::MyFloat16)
    val = uint64(reinterpret(Uint16, val))
    sign = (val & 0x8000) >> 15
    exp  = (val & 0x7c00) >> 10
    sig  = (val & 0x3ff) >> 0
    ret::Uint64

    if exp == 0
    if sig == 0
            sign = sign << 63
            ret = sign | exp | sig
        else
            n_bit = 1
            bit = 0x0200
            while (bit & sig) == 0
                n_bit = n_bit + 1
                bit = bit >> 1
            end
            sign = sign << 63
            exp = (-14 - n_bit + 1023) << 52
            sig = ((sig & (~bit)) << n_bit) << (52 - 10)
            ret = sign | exp | sig
        end
    elseif exp == 0x1f
        if sig == 0
            if sign == 0
                ret = 0x7ff0000000000000
            else
                ret = 0xfff0000000000000
            end
        else
            ret = 0xffffffffffffffff
        end
    else
        sign = sign << 63
        exp  = (exp - 15 + 1023) << 52
        sig  = sig << (52 - 10)
        ret = sign | exp | sig
    end

    return reinterpret(Float64, ret)
end

We could convert to only Float32 or Float64 and then use existing code to convert between those. It seems more efficient to convert to/from both directly in most cases, but it may not be on some architectures, partly depending on whether there is hardware support for converting between Float32 and Float64. (I don't know if that's something floating point units typically support or not.)

Owner

ViralBShah commented Jun 27, 2013

@StefanKarpinski Would it be good to start off with this as a pure julia implementation and get it in base to begin with?

Owner

ViralBShah commented Jul 16, 2013

Until the LLVM bug is sorted out, it may be worthwhile to put @rwgardner 's julia implementation in Base. That way, at least the storage format can be used, and the conversions could be potentially faster when the LLVM issue is fixed.

@loladiro Does LLVM 3.3 fix the Float16 bugs?

Owner

StefanKarpinski commented Jul 16, 2013

Even using @rwgardner's conversions, the following patch unfortunately still causes LLVM failures:

https://gist.github.com/StefanKarpinski/9092d04bc24c44493d08

julia> float16(1.5)
LLVM ERROR: Cannot select: 0x104151b10: ch = store 0x102070910, 0x10421df10, 0x104231d10, 0x10434d410<ST2[%14]> [ORD=77165] [ID=35]
  0x10421df10: f16,ch = load 0x10434dc10, 0x102070010, 0x10434d410<LD2[FixedStack0]> [ORD=77156] [ID=27]
    0x102070010: i64 = FrameIndex<0> [ORD=77155] [ID=4]
    0x10434d410: i64 = undef [ORD=77150] [ID=2]
  0x104231d10: i64 = add 0x104233910, 0x1041a7810 [ORD=77163] [ID=33]
    0x104233910: i64,ch,glue = CopyFromReg 0x104087a10, 0x104088010, 0x104087a10:1 [ORD=77157] [ID=32]
      0x104088010: i64 = Register %RAX [ORD=77157] [ID=10]
      0x104087a10: ch,glue = callseq_end 0x10434da10, 0x104264310, 0x104264310, 0x10434da10:1 [ORD=77157] [ID=31]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x104264310: i64 = TargetConstant<0> [ORD=77155] [ID=5]
        0x10434da10: ch,glue = X86ISD::CALL 0x104279410, 0x104232910, 0x104085410, 0x10417a710, 0x104279410:1 [ORD=77157] [ID=30]
          0x104232910: i64 = X86ISD::Wrapper 0x104085310 [ID=16]
Owner

Keno commented Jul 16, 2013

You'll still want to leave in the disable in the compiler, otherwise LLVM will generate bad code. LLVM 3.3 does not fix this.

Owner

JeffBezanson commented Jul 16, 2013

Yes, with this implementation no compiler changes are needed; it's just a 16-bit bitstype.

Owner

StefanKarpinski commented Jul 16, 2013

Ok, if someone wants to finish this, I'm away for the day.

Owner

ViralBShah commented Aug 2, 2013

Bump.

Owner

Keno commented Aug 2, 2013

@StefanKarpinski do you just want to apply your patch?

Owner

StefanKarpinski commented Aug 3, 2013

I don't think just applying the patch works. There was a bunch of changes it needed to work.

@Keno Keno closed this in ea1f3b2 Aug 7, 2013

Owner

ViralBShah commented Aug 14, 2013

It would be nice to have a nicer show() method for float16. Asking the question here in case this was done by design.

julia> float16(100.25)
Float16(0x5644)
Owner

StefanKarpinski commented Aug 14, 2013

Printing 16-bit floats correctly and minimally is quite non-trivial. Our 32-bit and 64-bit float printing are handled by the double-conversion library which does not support 16-bit floats. It might be possible to figure out a hack that approximates correct minimal Float16 printing using the printing routines for Float32, but it's not obvious how.

Owner

ViralBShah commented Aug 14, 2013

I wonder what is going on here:

julia> a = float16(rand(5,5))
5x5 Float16 Array:
 0.445801  0.154785  0.431641   0.384521  0.188354 
 0.4646    0.281006  0.766602   0.563965  0.0402222
 0.685059  0.92627   0.921875   0.933594  0.468994 
 0.841797  0.582031  0.0185242  0.481934  0.151367 
 0.348877  0.952637  0.672852   0.864746  0.166138 
Owner

JeffBezanson commented Aug 14, 2013

Float16 printing has several problems right now, e.g.

julia> print_shortest(STDOUT,NaN16)
NaN32

(plus NaN16 does not work properly)
I'm about to commit some fixes.

showcompact has a fallback definition that is printing the Float16s in that array by converting them to Float64. The question is whether we should print the f0 suffix. For now I'll say that is specific to Float32, and leave it off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment