Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method to read the first bytes of a float (and assume the rest are 0) #70

Closed
wants to merge 7 commits into from
Closed

Conversation

SamWhited
Copy link
Contributor

@SamWhited SamWhited commented Mar 1, 2017

I'm not sure if this is broadly useful enough to be worth adding to the library, but I've found myself using a similar function to this several times in a recent project so I thought I'd submit a PR in case you wanted it.

The idea is to read the first n bytes of an f64 as a uint and then assume that the rest of the bytes are 0. This is very useful if you're parsing lots of compressed floats in little-endian format where the low bits (which are often zero) can be dropped (eg. due to a form of run length encoding, or other compression that drops sequences of zeros).

@BurntSushi
Copy link
Owner

See #71

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 1, 2017

@BurntSushi I don't think this is related, is it (other than the fact that this method would also have a similar problem)? This PR is for a general float method that reads less than 8 bytes (and assumes the rest are 0), but it's likely that I'm just misunderstanding what you meant.

@BurntSushi
Copy link
Owner

@SamWhited It's related in that it would be adding more methods with the same problem as the existing read_f32/read_f64 methods.

FWIW, we do already have read_uint/read_int, which seem like analogous methods, right? So I think I'm at least in principle in favor of these.

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 1, 2017

Ah yes, if you consider reading signaling NaN's a problem (although that sounds like it's up in the air? I'm inclined to say that it is a problem personally EDIT: Yup, they cause undefined behavior in LLVM.) this would be adding another method with the same problem. So maybe it's something for the 2.0 release (or maybe it doesn't really cause any issues, since there are already methods with the problem so having another doesn't really change anything)?

FWIW, we do already have read_uint/read_int, which seem like analogous methods, right? So I think I'm at least in principle in favor of these.

Yes, it's more or less analogous

@BurntSushi
Copy link
Owner

One interesting idea brought up in the other thread was that we could mask out the signaling NaN bit before doing the transmute to a float. I don't really like that, but we either do that, or we need to change the API to return a Result for floating point conversions.

cc @est31 @retep998 @nagisa @Amanieu @petrochenkov @dwrensha @valarauca @alexcrichton

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 28, 2017

we could mask out the signaling NaN bit

I don't love the idea of losing that information; if we are decoding something to a signaling NaN we either want to know in case it was intentional, or we want to know because it's a bug (or maybe it actually means something) in the encoded stream and we need to display some error, or take some action.

Having float operations return a result feels poor to me too because there's the overhead of unwrapping a result on what appears at first glance to be a simple operation that should "just work", but since it's not actually that simple under the hood I think this is the lesser of two evils personally.

I haven't really thought through any use cases but my own (decoding Go's Gob format), so I'll be curious to see what others say.

@valarauca
Copy link

valarauca commented Mar 28, 2017 via email

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 28, 2017

As long as an alternative function exists that may return a signal NaN.

I didn't think about that, having a second unsafe version of the function that can return a signaling NaN also makes sense to me, but also bloats the API: I'm not sure if the tradeoff there for something people may never (or rarely) use is worth it.

@BurntSushi
Copy link
Owner

At the moment, I am somewhat inclined to clarify the contract of read_f32 and read_f64 to state that the return value is always defined. At an implementation level, we can mask out the signaling NaN bit and be done with this.

Whether we should add more (unsafe) methods to the API to support getting the f32/f64 without any masking isn't clear, because I don't actually know what the use cases are. And if there aren't any use cases (or if they are very rare/niche), then it seems reasonable to expect callers to drop into transmute.

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

The discussion on rust-lang/rust#39271 also seems to have gone that way; I pushed a new commit that I think should turn sNaN's into qNaN's by flipping the most significant fraction bit. Review by someone who's more comfortable with floating point math than I am would be appreciated.

I'm also not 100% sure that the test I wrote for it doesn't introduce undefined behavior itself.

EDIT: On second glance, I think I messed up my masks for the qNaN and the sNaN… double checking which one has the MSB of the mantissa set and which doesn't… please hold.

Yup, I did; fixed.

@est31
Copy link

est31 commented Mar 29, 2017

Hmm maybe I should do it like you and just flip that single bit.

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

@est31 I couldn't think of any cases where it would matter one way or the other, but I can't claim to have any real domain knowledge here. I'm not really sure who to go to for advice either; maybe some other project has done something similar and we could copy it or ask them?

@est31
Copy link

est31 commented Mar 29, 2017

@SamWhited for my use cases it wouldn't be really useful either, but apparently you can use the lower part of the fraction field for a payload. I think I'll just do something like:

if v.is_nan() { v |= 1 << pos_of_highest_bit_of_fraction_field }

@est31
Copy link

est31 commented Mar 29, 2017

Hmm just realized that is_nan is not defined on u64/u32. I'll keep the old version.

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

@est31 If you have a u64 you should be able to do something like:

// The exponent is 1's  &&  the mantissa has at least one bit set 
   (n & 0x7FF == 0x7FF) && (n & 0x000FFFFFFFFFFFFF != 0)

to check for NaN

EDIT: oops, probably need to shift the exponent left 52 places.

src/lib.rs Outdated
#[test]
fn uint_bigger_buffer() {
use {ByteOrder, LittleEndian};
let n = LittleEndian::read_uint(&[1, 2, 3, 4, 5, 6, 7, 8], 5);
assert_eq!(n, 0x0504030201);
}

// TODO: How is transmute implemented? Does it count as an operation on the sNaN?
// Is this test undefined behavior?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time you call transmute, I think you've already convert the sNaN to a qNaN, so I think you're safe.

(Even so, since it's only a test, I'm fine with being in murky territory.)

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Rebased and removed TODO comments based on @BurntSushi's feedback. I think the last unopened question is what variants of this method do we want and what should they be named? Eg. do we want both an f32/f64 version fo this method? Do we want write methods (I haven't even thought about those), etc.

read_f{32,64} sounds closest to read_uint, but it's already taken. read_nf32? read_float<T: Float>?

@BurntSushi
Copy link
Owner

@SamWhited Thanks! The read_uint and read_int methods always yield a u64 and i64, respectively. So to be congruous with those, I think we just want a single read_float that returns a f64.

Also, since you're adding the sNaN masking to read_float, would you be willing to also add it to read_f64/read_f32 as well?

And yes, since we have write_int/write_uint, we should also have write_float.

@SamWhited
Copy link
Contributor Author

Also, since you're adding the sNaN masking to read_float, would you be willing to also add it to read_f64/read_f32 as well?

Sure thing

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Done ⤴

To reiterate, I'm reasonably sure this is correct and that all my lengths for the various parts of IEEE floats are correct (thanks to the lovely diagrams on Wikipedia), but review by someone who knows floats would be appreciated.

@BurntSushi
Copy link
Owner

@SamWhited I'm not a floating point expert either unfortunately, but I will do some reading and try to do an independent review before merging. It's important that this is right. (It might take me a little bit to get to though.)

src/lib.rs Outdated
let mut u = Self::read_u32(buf);
// The exponent is 1's && the mantissa has at least one bit set (aka. is_nan):
if (u & 0xFF<<23 == 0xFF<<23) && (u & 0x3FFFFF != 0) {
u |= 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citing ieee754-2008:

All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit
string should be encoded with the first bit (d₁) of the trailing significand field T being 1. A signaling NaN
bit string should be encoded with the first bit of the trailing significand field being 0. If the first bit of the
trailing significand field is 0, some other bit of the trailing significand field must be non-zero to distinguish
the NaN from infinity. In the preferred encoding just described, a signaling NaN shall be quieted by setting
d₁ to 1, leaving the remaining bits of T unchanged.

I believe the d₁ here is the first bit after the exponent, not the least significant bit of the encoding, as per:

Figure 3.1 from ieee754-2008

making this code not actually mask out the signalling bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

facepalm You're right, it should be the MSB. Thanks!

Copy link
Contributor Author

@SamWhited SamWhited Mar 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed for the read_f{32,64} methods (I think, please double check my bit widths :) Thanks!).

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Note to self: Outstanding bug: the read_float implmentations don't actually detect nan's properly (they need to check every byte in the mantisa and see if any of them contain a set bit).

EDIT: Also fixed; it should be accounting for the entire mantissa now

@SamWhited
Copy link
Contributor Author

Another open question: I just noticed that read_f32 and read_f64 don't actually bother with endianness, which is fair since the floating point spec doesn't mention it. Should I do the same for read_float? Right now there are two different implementations, and the Linux kernel at least appears to consider endianness when reading floating point numbers, but I could see this going either way.

@nagisa
Copy link

nagisa commented Mar 29, 2017

Endianness is handled by read_u{32,64}. By the time these functions return, the data is in host endianness already.

@SamWhited
Copy link
Contributor Author

SamWhited commented Mar 29, 2017

Oh right, nevermind. *dissapears

@SamWhited
Copy link
Contributor Author

@BurntSushi ping; just wanted to make sure this didn't fall off the radar. No rush though.

@est31
Copy link

est31 commented Jun 17, 2017

@BurntSushi friendly ping :)

You said that you wanted to review the sNaN masking in this PR before we stabilize the transmute functions: rust-lang/rust#39271 (comment)

Would be nice to have stable float<-> int transmute in Rust 1.20.

@le-jzr
Copy link

le-jzr commented Jun 28, 2017

I'm no expert, but according to wikipedia, some implementations of IEEE 754 use opposite meaning for the signalling/quiet bit.
IMO the conversion should set the bit to a known-quiet reference, instead of always setting to one, otherwise it's still unsafe on some architectures.

@est31
Copy link

est31 commented Jul 3, 2017

@le-jzr great link! Setting the value to a known quiet NAN without any masking might indeed be the way to go forward.

@le-jzr
Copy link

le-jzr commented Jul 3, 2017

@BurntSushi
Copy link
Owner

@SamWhited @est31 In the interest of moving things forward, I've just merged the part of this PR that makes reading floats safe. In particular, I updated it to use @est31's implementation that is now in std.

@SamWhited I tried to salvage the read_float stuff, but I found the test coverage insufficient. I'm pretty sure there is something wrong with the implementation in this PR. In particular, I would expect the implementations to be built on top of {LittleEndian, BigEndian}::read_uint, but when I tried to do that, existing doc tests failed. I think it would be best to approach that in a separate PR, and in particular, add read_float/write_float together. I think there should be at least as many tests for read_float/write_float as there are for read_uint/write_uint.

@BurntSushi BurntSushi closed this Jul 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants