Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help deserialize mixed tags and string in body $value (html text formatting) #147

Open
Rudo2204 opened this issue Feb 5, 2021 · 9 comments

Comments

@Rudo2204
Copy link

Rudo2204 commented Feb 5, 2021

I'm trying to deserialize some dictionary defitnitions and came across this one which contains mixed multiple tags with normal string (html text formatting).

<div style="margin-left:2em"><b>1</b> 〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《<i>pl</i>. quizzes》.</div>

I looked around in serde-xml-rs tests and tried this solution which seems to be close but it doesn't quite work

#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
    style: String,
    #[serde(rename = "$value")]
    definition: Vec<MyEnum>,
}

#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
    b(String),
    #[serde(rename = "$value")]
    String,
    i(String),
}

The error I'm getting is:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom("unknown variant `〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《`, expected one of `b`, `$value`, `i`")'

I can make it work for now by not using MyEnum and just use definition: Vec<String>, but then I wouldn't know which text is bold and which is italic.
How can I properly deserialize this?

@brassy-endomorph
Copy link

Has this been addressed? And if so, can it be documented somewhere? Being able to quickly process something like say XHTML would be very helpful.

@jac-cbi
Copy link

jac-cbi commented Mar 2, 2023

I think I might have encountered a similar, but different bug. I'm struggling to create a concise test case replicating the error.

thread 'requests::tests::basic_request_list' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { field: "unknown variant `DataBContent`, expected one of `RequestA`, `RequestB`, /* Other enums elided */" }', src/requests.rs:327:49

It seems to be a bad interaction between enum and Vec<>. Within my request types, one of the struct fields may be repeated. My first version of the code and test cases only permitted one instance of said field.

At a high level:

#[derive(Debug, Serialize, Deserialize, PartialEq)]
enum Requests {
    RequestA(RequestA),
    RequestB(RequestB),
    /* ... */
    RequestN(RequestN),
}

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct RequestA {
    ID: u64,
    DataAContent: DataAContent,
}

/* elided other struct RequestXX definitions */

#[derive(Debug, Serialize, Deserialize, PartialEq)]
struct DataAContent {
    Field1: String,
    Field2: u64,
    Field3: Email,
    Line: Line,       // <--- This is the line that may occur multiple times in the XML
}

#[derive(Debug, Serialize, Deserialize, PartialEq)]
Struct Line {
    Name: String,
    Data: String,
}

and my change:

     Field1: String,
     Field2: u64,
     Field3: Email,
-    Line: Line,
+    #[serde(rename = "$value")]
+    lines: Vec<Line>,
 }
 
 Struct Line {

Without the change, my unit tests pass successfully. With the change, I get the error above. Am I doing something stupid?

Note that I also update the assert_eq!() to index into the Vec

Here's some XML that should match the above:

<RequestA ID=1>
  <DataAContent>
    <Field1>Some text</Field1>
    <Field2>1000000</Field2>
    <Field3>me@example.com</Field3>
    <Line>
        <Name>Adjust</Name>
        <Data>This ia a quite long string of text</Data>
    </Line>
  </DataAContent>
</RequestA>
<RequestB ID=2>
  <DataBContent>
  <!------- elided ------>
  </DataBContent>
</RequestB>
<!------ elided -------->
<RequestN ID=72>

If I had to guess, It seems like the cursor isn't getting set properly when it encounters only a single item Line for the Vec. What I can't figure out is if I'm doing something wrong, or if I exercised well used code (serde-xml-rs, serge-derive, serde) in such a weird way that I've found a bug no one else has encountered?

Another Note: I've added test cases within serde-xml-rs to exercise Vec<Line> with XML containing only a single instance of Line. It passes with flying colors. Sigh. Any thoughts?

@jac-cbi
Copy link

jac-cbi commented Mar 2, 2023

Ok, I was able to reproduce the error with a unit test inside of serde-xml-rs

https://github.com/jac-cbi/serde-xml-rs/tree/fix/enum_vec_single_element

@punkstarman pls take a look

@jac-cbi
Copy link

jac-cbi commented Mar 2, 2023

If you put the enum fields in alpha order, no error (B before C), but if B is after C, it blows up.

@jac-cbi
Copy link

jac-cbi commented Mar 6, 2023

Ok, making progress on understanding how this works. My current theory is that data structures containing Vec of elements within a Vec of elements blows up. Here's why:

It seems that the child node inherits state from the parent node src/de/mod.rs:147

impl<'de, R: Read, B: BufferedXmlReader<R>> Deserializer<R, B> {
    fn child<'a>(&'a mut self) -> Deserializer<R, ChildXmlBuffer<'a, R>> {
        let Deserializer {
            buffered_reader,
            depth,
            is_map_value,
            non_contiguous_seq_elements,
            ..
        } = self;

        Deserializer {
            buffered_reader: buffered_reader.child_buffer(),
            depth: *depth,
            is_map_value: *is_map_value,   // <------ here
            non_contiguous_seq_elements: *non_contiguous_seq_elements,
            marker: PhantomData,
        }
    }

But then descending into the next child node consumes the state src/de/mod:194

    /// If `self.is_map_value`: Performs the read operations specified by `f` on the inner content of an XML element.
    /// `f` is expected to consume the entire inner contents of the element. The cursor will be moved to the end of the
    /// element.
    /// If `!self.is_map_value`: `f` will be performed without additional checks/advances for an outer XML element.
    fn read_inner_value<V: de::Visitor<'de>, T, F: FnOnce(&mut Self) -> Result<T>>(
        &mut self,
        f: F,
    ) -> Result<T> {
        if self.unset_map_value() {     // <-------------- Here
            debug_expect!(self.next(), Ok(XmlEvent::StartElement { name, .. }) => {
                let result = f(self)?;
                self.expect_end_element(name)?;
                Ok(result)
            })
        } else {
            f(self)
        }
    }

Here's .unset_map_value()

    pub fn unset_map_value(&mut self) -> bool {
        ::std::mem::replace(&mut self.is_map_value, false)
    }

@punkstarman , am I on the right track? Shouldn't the state have a parent_map_value so that we can track this more accurately? It seems to be written on the assumption that there will only ever be one Vec of elements in a given XML data structure. I'm not sure I'm understanding this correctly yet

@jac-cbi
Copy link

jac-cbi commented Mar 6, 2023

I've added some more unit tests attempting to isolate the bug. It seems that Vec<SomeEnum> where the enum elements contain Vec elements themselves is the core use case. enum is required. Without the enum, things work as expected.

@jac-cbi
Copy link

jac-cbi commented Mar 8, 2023

I reworked my code to not use Vec<enum> at all. Looks good so far.

The core problem is that nested Vec just can't handle enum types in either the outer or the inner Vec. Sigh. This is the first thing I've found that Rust can't do well.

For now, I'm just going to work around it. If I encounter it again, I'll look into a more proper fix.

@punkstarman
Copy link
Collaborator

Thanks @jac-cbi for spending so much time and sharing your thoughts and learning. I've been busy with prep for a conference. I'll try to look into this more this weekend.

@jac-cbi
Copy link

jac-cbi commented Mar 9, 2023

So I'm not good at GitHub. But it appears this has diverged significantly from the original issue. I'd like to keep this convo intact, but it should really be a separate issue. If you know how to fix that, pls do so and add me to the new issue.

As for the bug itself, I'm def still interested in a fix, as my outer Vec of many types has a counter in common that I need to sort on. It's much cleaner to do that if the outer is Vec<enum>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants