Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"infer_deserialize" inference causes loss of precision #212

Closed
bchallenor opened this issue Oct 18, 2020 · 5 comments
Closed

"infer_deserialize" inference causes loss of precision #212

bchallenor opened this issue Oct 18, 2020 · 5 comments

Comments

@bchallenor
Copy link

What version of the csv crate are you using?

1.1.3

Briefly describe the question, bug or feature request.

rust-csv's implementation of deserialize_any will try to guess the type of the CSV field based on its contents, so if the field looks like a float, it will end up creating an f64 and passing it to visit_f64. This is a problem for types whose serialization looks like that of a float, but in fact is more precise than a float. For these types, it is important that they are deserialized from the original str, without an intermediate conversion to float.

Include a complete program demonstrating a problem.

use serde::{Deserialize, Deserializer};
use std::fmt;

#[derive(Debug, PartialEq)]
struct BigNum(String);

impl BigNum {
    pub fn new(s: String) -> Option<BigNum> {
        for b in s.bytes() {
            if b < b'0' || b > b'9' {
                return None;
            }
        }
        Some(BigNum(s))
    }
}

impl<'de> Deserialize<'de> for BigNum {
    fn deserialize<D>(deserializer: D) -> Result<Self, <D as Deserializer<'de>>::Error>
    where
        D: Deserializer<'de>,
    {
        struct Visitor;

        impl<'de> serde::de::Visitor<'de> for Visitor {
            type Value = BigNum;

            fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                write!(formatter, "a decimal value")
            }

            // Could also implement u8, u16, u32 and u64, but they won't be needed here
            fn visit_u128<E>(self, v: u128) -> Result<Self::Value, E>
            where
                E: serde::de::Error,
            {
                Ok(BigNum(v.to_string()))
            }

            fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
            where
                E: serde::de::Error,
            {
                BigNum::new(v.into()).ok_or_else(|| E::custom("not a decimal value"))
            }
        }

        deserializer.deserialize_any(Visitor)
    }
}

#[derive(Deserialize, Debug, PartialEq)]
struct Row {
    num: BigNum,
}

fn main() {
    // this is u128::max_value() + 1
    let csv = "num\r\n340282366920938463463374607431768211456\r\n";
    let mut reader = csv::ReaderBuilder::new().from_reader(csv.as_bytes());
    let row = reader.deserialize::<Row>().next().unwrap().unwrap();
    assert_eq!(
        Row {
            num: BigNum("340282366920938463463374607431768211456".into())
        },
        row
    );
}

What is the observed behavior of the code above?

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Deserialize { pos: Some(Position { byte: 4, line: 1, record: 1 }), err: DeserializeError { field: None, kind: Message("invalid type: floating point `340282366920938500000000000000000000000`, expected a decimal value") } })', src/main.rs:61:59

What is the expected or desired behavior of the code above?

Assertion succeeds.

Proposed change

I tried patching rust-csv to disable the type inference in infer_deserialize:

    fn infer_deserialize<'de, V: Visitor<'de>>(
        &mut self,
        visitor: V,
    ) -> Result<V::Value, DeserializeError> {
        let x = self.next_field()?;
        visitor.visit_str(x)
    }

This causes the assertion to succeed. Could this be configurable?

An alternative that does not involve patching rust-csv would be to change deserializer.deserialize_any(Visitor) to deserializer.deserialize_str(Visitor) in BigNum. While it would fix this case, it would prevent BigNum from being deserialized via another (non-CSV) serde backend that happens to produce integers.

infer_deserialize also causes similar problems where deserialize_any is invoked by Serde itself, rather than by user code. I think this corresponds to cases where Serde needs to do backtracking over the same input. #151 is one such example, and I've run into another with #[serde(untagged)] (I can open another issue for that if you like). The proposed change would fix those too.

Serde says "Untagged and internally tagged enums are only supported in self-describing formats" but I think a legitimate view of CSV is that it is a self-describing format where every type is a string.

@qm3ster
Copy link

qm3ster commented Dec 2, 2020

Seems like a lot of problems stem from this type inference. Which cases does it actually improve?

@BurntSushi
Copy link
Owner

Which cases does it actually improve?

For when you need to call deserialize_any.

I don't really understand this bug report. deserialize_any is being called, which is basically akin to saying, "I don't know what type to expect." The docs even specifically to say to avoid using it in a Deserialize implementation:

Require the Deserializer to figure out how to drive the visitor based on what data type is in the input.

When implementing Deserialize, you should avoid relying on Deserializer::deserialize_any unless you need to be told by the Deserializer what type is in the input. Know that relying on Deserializer::deserialize_any means your data type will be able to deserialize from self-describing formats only, ruling out Bincode and many others.

@bchallenor
Copy link
Author

I think my mental model of the serde Deserialize API was incorrect when I opened this issue. I was assuming that serde's Deserialize instance for (e.g.) i32 would know how to deserialize an i32 not only from an i32, but also from a str. If this were the case, rust-csv could just infer every type as a string, and rely on the specific Deserialize instance to parse the string. But now I'm pretty sure that the string parsing has to happen in rust-csv itself, so the change I suggested to infer_deserialize would not really work.

You might be right that I shouldn't be calling deserialize_any from inside Deserialize above. If I change the call from deserialize_any to deserialize_str then the assertion passes.

There are certain serde features that rely on deserialize_any though, e.g. untagged enums, and in those cases the type inference is still problematic for precision. Specifically, it is a problem when a numeric type is inferred that is less precise than the value represented by the string. But that is already covered by another issue so I'm happy to close this one.

@BurntSushi
Copy link
Owner

Yeah, I think there are two problems working against you here. Firstly, Serde is quite complex and it supports some use cases a lot better than others. Secondly, CSV is a really weak format. It doesn't have any data types. So if the csv crate wants to support the deserialize_any API (and I think it does), it has to do something. And if it has to give up some weird corner cases where f64 isn't enough, then that's probably the right choice.

If there's a way to have our cake and eat it too, then great.

But to be honest, it's hard for me to make progress on these sorts of problems. Serde and its interaction with the csv crate is so complex that I have to re-contextualize in order to come up with an answer, and re-contextualizing is not cheap. :-/

@qm3ster
Copy link

qm3ster commented Dec 4, 2020

@bchallenor I think it would be better for your type to do a kind of its own deserialize_any by doing deserialize_string first, and then if that fails trying other types instead of bailing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants