"infer_deserialize" inference causes loss of precision #212

bchallenor · 2020-10-18T23:27:15Z

What version of the `csv` crate are you using?

1.1.3

Briefly describe the question, bug or feature request.

rust-csv's implementation of deserialize_any will try to guess the type of the CSV field based on its contents, so if the field looks like a float, it will end up creating an f64 and passing it to visit_f64. This is a problem for types whose serialization looks like that of a float, but in fact is more precise than a float. For these types, it is important that they are deserialized from the original str, without an intermediate conversion to float.

Include a complete program demonstrating a problem.

use serde::{Deserialize, Deserializer};
use std::fmt;

#[derive(Debug, PartialEq)]
struct BigNum(String);

impl BigNum {
    pub fn new(s: String) -> Option<BigNum> {
        for b in s.bytes() {
            if b < b'0' || b > b'9' {
                return None;
            }
        }
        Some(BigNum(s))
    }
}

impl<'de> Deserialize<'de> for BigNum {
    fn deserialize<D>(deserializer: D) -> Result<Self, <D as Deserializer<'de>>::Error>
    where
        D: Deserializer<'de>,
    {
        struct Visitor;

        impl<'de> serde::de::Visitor<'de> for Visitor {
            type Value = BigNum;

            fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
                write!(formatter, "a decimal value")
            }

            // Could also implement u8, u16, u32 and u64, but they won't be needed here
            fn visit_u128<E>(self, v: u128) -> Result<Self::Value, E>
            where
                E: serde::de::Error,
            {
                Ok(BigNum(v.to_string()))
            }

            fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
            where
                E: serde::de::Error,
            {
                BigNum::new(v.into()).ok_or_else(|| E::custom("not a decimal value"))
            }
        }

        deserializer.deserialize_any(Visitor)
    }
}

#[derive(Deserialize, Debug, PartialEq)]
struct Row {
    num: BigNum,
}

fn main() {
    // this is u128::max_value() + 1
    let csv = "num\r\n340282366920938463463374607431768211456\r\n";
    let mut reader = csv::ReaderBuilder::new().from_reader(csv.as_bytes());
    let row = reader.deserialize::<Row>().next().unwrap().unwrap();
    assert_eq!(
        Row {
            num: BigNum("340282366920938463463374607431768211456".into())
        },
        row
    );
}

What is the observed behavior of the code above?

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Deserialize { pos: Some(Position { byte: 4, line: 1, record: 1 }), err: DeserializeError { field: None, kind: Message("invalid type: floating point `340282366920938500000000000000000000000`, expected a decimal value") } })', src/main.rs:61:59

What is the expected or desired behavior of the code above?

Assertion succeeds.

Proposed change

I tried patching rust-csv to disable the type inference in infer_deserialize:

    fn infer_deserialize<'de, V: Visitor<'de>>(
        &mut self,
        visitor: V,
    ) -> Result<V::Value, DeserializeError> {
        let x = self.next_field()?;
        visitor.visit_str(x)
    }

This causes the assertion to succeed. Could this be configurable?

An alternative that does not involve patching rust-csv would be to change deserializer.deserialize_any(Visitor) to deserializer.deserialize_str(Visitor) in BigNum. While it would fix this case, it would prevent BigNum from being deserialized via another (non-CSV) serde backend that happens to produce integers.

infer_deserialize also causes similar problems where deserialize_any is invoked by Serde itself, rather than by user code. I think this corresponds to cases where Serde needs to do backtracking over the same input. #151 is one such example, and I've run into another with #[serde(untagged)] (I can open another issue for that if you like). The proposed change would fix those too.

Serde says "Untagged and internally tagged enums are only supported in self-describing formats" but I think a legitimate view of CSV is that it is a self-describing format where every type is a string.

The text was updated successfully, but these errors were encountered:

qm3ster · 2020-12-02T18:05:31Z

Seems like a lot of problems stem from this type inference. Which cases does it actually improve?

BurntSushi · 2020-12-02T18:14:47Z

Which cases does it actually improve?

For when you need to call deserialize_any.

I don't really understand this bug report. deserialize_any is being called, which is basically akin to saying, "I don't know what type to expect." The docs even specifically to say to avoid using it in a Deserialize implementation:

Require the Deserializer to figure out how to drive the visitor based on what data type is in the input.

When implementing Deserialize, you should avoid relying on Deserializer::deserialize_any unless you need to be told by the Deserializer what type is in the input. Know that relying on Deserializer::deserialize_any means your data type will be able to deserialize from self-describing formats only, ruling out Bincode and many others.

bchallenor · 2020-12-02T23:14:46Z

I think my mental model of the serde Deserialize API was incorrect when I opened this issue. I was assuming that serde's Deserialize instance for (e.g.) i32 would know how to deserialize an i32 not only from an i32, but also from a str. If this were the case, rust-csv could just infer every type as a string, and rely on the specific Deserialize instance to parse the string. But now I'm pretty sure that the string parsing has to happen in rust-csv itself, so the change I suggested to infer_deserialize would not really work.

You might be right that I shouldn't be calling deserialize_any from inside Deserialize above. If I change the call from deserialize_any to deserialize_str then the assertion passes.

There are certain serde features that rely on deserialize_any though, e.g. untagged enums, and in those cases the type inference is still problematic for precision. Specifically, it is a problem when a numeric type is inferred that is less precise than the value represented by the string. But that is already covered by another issue so I'm happy to close this one.

BurntSushi · 2020-12-03T02:04:16Z

Yeah, I think there are two problems working against you here. Firstly, Serde is quite complex and it supports some use cases a lot better than others. Secondly, CSV is a really weak format. It doesn't have any data types. So if the csv crate wants to support the deserialize_any API (and I think it does), it has to do something. And if it has to give up some weird corner cases where f64 isn't enough, then that's probably the right choice.

If there's a way to have our cake and eat it too, then great.

But to be honest, it's hard for me to make progress on these sorts of problems. Serde and its interaction with the csv crate is so complex that I have to re-contextualize in order to come up with an answer, and re-contextualizing is not cheap. :-/

qm3ster · 2020-12-04T12:19:47Z

@bchallenor I think it would be better for your type to do a kind of its own deserialize_any by doing deserialize_string first, and then if that fails trying other types instead of bailing.

bchallenor closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"infer_deserialize" inference causes loss of precision #212

"infer_deserialize" inference causes loss of precision #212

bchallenor commented Oct 18, 2020

qm3ster commented Dec 2, 2020

BurntSushi commented Dec 2, 2020

bchallenor commented Dec 2, 2020

BurntSushi commented Dec 3, 2020

qm3ster commented Dec 4, 2020

"infer_deserialize" inference causes loss of precision #212

"infer_deserialize" inference causes loss of precision #212

Comments

bchallenor commented Oct 18, 2020

What version of the csv crate are you using?

Briefly describe the question, bug or feature request.

Include a complete program demonstrating a problem.

What is the observed behavior of the code above?

What is the expected or desired behavior of the code above?

Proposed change

qm3ster commented Dec 2, 2020

BurntSushi commented Dec 2, 2020

bchallenor commented Dec 2, 2020

BurntSushi commented Dec 3, 2020

qm3ster commented Dec 4, 2020

What version of the `csv` crate are you using?