Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method to read the first bytes of a float (and assume the rest are 0) #70

Closed
wants to merge 7 commits into from

Conversation

Projects
None yet
6 participants
@SamWhited
Copy link
Contributor

SamWhited commented Mar 1, 2017

I'm not sure if this is broadly useful enough to be worth adding to the library, but I've found myself using a similar function to this several times in a recent project so I thought I'd submit a PR in case you wanted it.

The idea is to read the first n bytes of an f64 as a uint and then assume that the rest of the bytes are 0. This is very useful if you're parsing lots of compressed floats in little-endian format where the low bits (which are often zero) can be dropped (eg. due to a form of run length encoding, or other compression that drops sequences of zeros).

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 1, 2017

See #71

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 1, 2017

@BurntSushi I don't think this is related, is it (other than the fact that this method would also have a similar problem)? This PR is for a general float method that reads less than 8 bytes (and assumes the rest are 0), but it's likely that I'm just misunderstanding what you meant.

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 1, 2017

@SamWhited It's related in that it would be adding more methods with the same problem as the existing read_f32/read_f64 methods.

FWIW, we do already have read_uint/read_int, which seem like analogous methods, right? So I think I'm at least in principle in favor of these.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 1, 2017

Ah yes, if you consider reading signaling NaN's a problem (although that sounds like it's up in the air? I'm inclined to say that it is a problem personally EDIT: Yup, they cause undefined behavior in LLVM.) this would be adding another method with the same problem. So maybe it's something for the 2.0 release (or maybe it doesn't really cause any issues, since there are already methods with the problem so having another doesn't really change anything)?

FWIW, we do already have read_uint/read_int, which seem like analogous methods, right? So I think I'm at least in principle in favor of these.

Yes, it's more or less analogous

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 28, 2017

One interesting idea brought up in the other thread was that we could mask out the signaling NaN bit before doing the transmute to a float. I don't really like that, but we either do that, or we need to change the API to return a Result for floating point conversions.

cc @est31 @retep998 @nagisa @Amanieu @petrochenkov @dwrensha @valarauca @alexcrichton

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 28, 2017

we could mask out the signaling NaN bit

I don't love the idea of losing that information; if we are decoding something to a signaling NaN we either want to know in case it was intentional, or we want to know because it's a bug (or maybe it actually means something) in the encoded stream and we need to display some error, or take some action.

Having float operations return a result feels poor to me too because there's the overhead of unwrapping a result on what appears at first glance to be a simple operation that should "just work", but since it's not actually that simple under the hood I think this is the lesser of two evils personally.

I haven't really thought through any use cases but my own (decoding Go's Gob format), so I'll be curious to see what others say.

@valarauca

This comment has been minimized.

Copy link

valarauca commented Mar 28, 2017

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 28, 2017

As long as an alternative function exists that may return a signal NaN.

I didn't think about that, having a second unsafe version of the function that can return a signaling NaN also makes sense to me, but also bloats the API: I'm not sure if the tradeoff there for something people may never (or rarely) use is worth it.

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 28, 2017

At the moment, I am somewhat inclined to clarify the contract of read_f32 and read_f64 to state that the return value is always defined. At an implementation level, we can mask out the signaling NaN bit and be done with this.

Whether we should add more (unsafe) methods to the API to support getting the f32/f64 without any masking isn't clear, because I don't actually know what the use cases are. And if there aren't any use cases (or if they are very rare/niche), then it seems reasonable to expect callers to drop into transmute.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

The discussion on rust-lang/rust#39271 also seems to have gone that way; I pushed a new commit that I think should turn sNaN's into qNaN's by flipping the most significant fraction bit. Review by someone who's more comfortable with floating point math than I am would be appreciated.

I'm also not 100% sure that the test I wrote for it doesn't introduce undefined behavior itself.

EDIT: On second glance, I think I messed up my masks for the qNaN and the sNaN… double checking which one has the MSB of the mantissa set and which doesn't… please hold.

Yup, I did; fixed.

@est31

This comment has been minimized.

Copy link

est31 commented Mar 29, 2017

Hmm maybe I should do it like you and just flip that single bit.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

@est31 I couldn't think of any cases where it would matter one way or the other, but I can't claim to have any real domain knowledge here. I'm not really sure who to go to for advice either; maybe some other project has done something similar and we could copy it or ask them?

@est31

This comment has been minimized.

Copy link

est31 commented Mar 29, 2017

@SamWhited for my use cases it wouldn't be really useful either, but apparently you can use the lower part of the fraction field for a payload. I think I'll just do something like:

if v.is_nan() { v |= 1 << pos_of_highest_bit_of_fraction_field }
@est31

This comment has been minimized.

Copy link

est31 commented Mar 29, 2017

Hmm just realized that is_nan is not defined on u64/u32. I'll keep the old version.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

@est31 If you have a u64 you should be able to do something like:

// The exponent is 1's  &&  the mantissa has at least one bit set 
   (n & 0x7FF == 0x7FF) && (n & 0x000FFFFFFFFFFFFF != 0)

to check for NaN

EDIT: oops, probably need to shift the exponent left 52 places.

src/lib.rs Outdated
#[test]
fn uint_bigger_buffer() {
use {ByteOrder, LittleEndian};
let n = LittleEndian::read_uint(&[1, 2, 3, 4, 5, 6, 7, 8], 5);
assert_eq!(n, 0x0504030201);
}

// TODO: How is transmute implemented? Does it count as an operation on the sNaN?
// Is this test undefined behavior?

This comment has been minimized.

@BurntSushi

BurntSushi Mar 29, 2017

Owner

By the time you call transmute, I think you've already convert the sNaN to a qNaN, so I think you're safe.

(Even so, since it's only a test, I'm fine with being in murky territory.)

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Rebased and removed TODO comments based on @BurntSushi's feedback. I think the last unopened question is what variants of this method do we want and what should they be named? Eg. do we want both an f32/f64 version fo this method? Do we want write methods (I haven't even thought about those), etc.

read_f{32,64} sounds closest to read_uint, but it's already taken. read_nf32? read_float<T: Float>?

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 29, 2017

@SamWhited Thanks! The read_uint and read_int methods always yield a u64 and i64, respectively. So to be congruous with those, I think we just want a single read_float that returns a f64.

Also, since you're adding the sNaN masking to read_float, would you be willing to also add it to read_f64/read_f32 as well?

And yes, since we have write_int/write_uint, we should also have write_float.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Also, since you're adding the sNaN masking to read_float, would you be willing to also add it to read_f64/read_f32 as well?

Sure thing

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Done ⤴

To reiterate, I'm reasonably sure this is correct and that all my lengths for the various parts of IEEE floats are correct (thanks to the lovely diagrams on Wikipedia), but review by someone who knows floats would be appreciated.

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Mar 29, 2017

@SamWhited I'm not a floating point expert either unfortunately, but I will do some reading and try to do an independent review before merging. It's important that this is right. (It might take me a little bit to get to though.)

src/lib.rs Outdated
let mut u = Self::read_u32(buf);
// The exponent is 1's && the mantissa has at least one bit set (aka. is_nan):
if (u & 0xFF<<23 == 0xFF<<23) && (u & 0x3FFFFF != 0) {
u |= 1;

This comment has been minimized.

@nagisa

nagisa Mar 29, 2017

Citing ieee754-2008:

All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit
string should be encoded with the first bit (d₁) of the trailing significand field T being 1. A signaling NaN
bit string should be encoded with the first bit of the trailing significand field being 0. If the first bit of the
trailing significand field is 0, some other bit of the trailing significand field must be non-zero to distinguish
the NaN from infinity. In the preferred encoding just described, a signaling NaN shall be quieted by setting
d₁ to 1, leaving the remaining bits of T unchanged.

I believe the d₁ here is the first bit after the exponent, not the least significant bit of the encoding, as per:

Figure 3.1 from ieee754-2008

making this code not actually mask out the signalling bit.

This comment has been minimized.

@SamWhited

SamWhited Mar 29, 2017

Author Contributor

facepalm You're right, it should be the MSB. Thanks!

This comment has been minimized.

@SamWhited

SamWhited Mar 29, 2017

Author Contributor

Fixed for the read_f{32,64} methods (I think, please double check my bit widths :) Thanks!).

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Note to self: Outstanding bug: the read_float implmentations don't actually detect nan's properly (they need to check every byte in the mantisa and see if any of them contain a set bit).

EDIT: Also fixed; it should be accounting for the entire mantissa now

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Another open question: I just noticed that read_f32 and read_f64 don't actually bother with endianness, which is fair since the floating point spec doesn't mention it. Should I do the same for read_float? Right now there are two different implementations, and the Linux kernel at least appears to consider endianness when reading floating point numbers, but I could see this going either way.

@nagisa

This comment has been minimized.

Copy link

nagisa commented Mar 29, 2017

Endianness is handled by read_u{32,64}. By the time these functions return, the data is in host endianness already.

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Oh right, nevermind. *dissapears

@SamWhited

This comment has been minimized.

Copy link
Contributor Author

SamWhited commented May 8, 2017

@BurntSushi ping; just wanted to make sure this didn't fall off the radar. No rush though.

@est31

This comment has been minimized.

Copy link

est31 commented Jun 17, 2017

@BurntSushi friendly ping :)

You said that you wanted to review the sNaN masking in this PR before we stabilize the transmute functions: rust-lang/rust#39271 (comment)

Would be nice to have stable float<-> int transmute in Rust 1.20.

@le-jzr

This comment has been minimized.

Copy link

le-jzr commented Jun 28, 2017

I'm no expert, but according to wikipedia, some implementations of IEEE 754 use opposite meaning for the signalling/quiet bit.
IMO the conversion should set the bit to a known-quiet reference, instead of always setting to one, otherwise it's still unsafe on some architectures.

@est31

This comment has been minimized.

Copy link

est31 commented Jul 3, 2017

@le-jzr great link! Setting the value to a known quiet NAN without any masking might indeed be the way to go forward.

@le-jzr

This comment has been minimized.

Copy link

le-jzr commented Jul 3, 2017

@BurntSushi

This comment has been minimized.

Copy link
Owner

BurntSushi commented Jul 8, 2017

@SamWhited @est31 In the interest of moving things forward, I've just merged the part of this PR that makes reading floats safe. In particular, I updated it to use @est31's implementation that is now in std.

@SamWhited I tried to salvage the read_float stuff, but I found the test coverage insufficient. I'm pretty sure there is something wrong with the implementation in this PR. In particular, I would expect the implementations to be built on top of {LittleEndian, BigEndian}::read_uint, but when I tried to do that, existing doc tests failed. I think it would be best to approach that in a separate PR, and in particular, add read_float/write_float together. I think there should be at least as many tests for read_float/write_float as there are for read_uint/write_uint.

@BurntSushi BurntSushi closed this Jul 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.