Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding gets messed up for CDATA #108

Closed
WhyNotHugo opened this issue Oct 16, 2023 · 4 comments
Closed

Encoding gets messed up for CDATA #108

WhyNotHugo opened this issue Oct 16, 2023 · 4 comments

Comments

@WhyNotHugo
Copy link
Contributor

This applies to master and not to 0.18.1.

Copy this into src/lib.rs and run cargo test:

#[test]
fn test_multi_get_parse_encoding_another() {
    let calendar_data = ExpandedName::from(("urn:ietf:params:xml:ns:caldav", "calendar-data"));

    let body = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<C:calendar-data xmlns:C=\"urn:ietf:params:xml:ns:caldav\"><![CDATA[BEGIN(de baño)VCALENDAR\r\n]]></C:calendar-data>\n";
    assert!(body.contains("baño"));

    let body = std::str::from_utf8(body.as_ref()).unwrap();
    assert!(body.contains("baño"));

    let doc = Document::parse(body).unwrap();

    let raw_data = doc
        .descendants()
        .find(|node| node.tag_name() == calendar_data)
        .unwrap()
        .text()
        .unwrap();
    std::dbg!(&raw_data);
    assert!(raw_data.contains("baño"));
}

Basically this parses a string to a roxmltree and then reads the text from the cdata. baño turns into baño (that's why the std::dbg! is there).

@WhyNotHugo
Copy link
Contributor Author

This regresses in 4f3566b.

@WhyNotHugo
Copy link
Contributor Author

WhyNotHugo commented Oct 16, 2023

The issue is in StringExt. The way it pushes u8 into String is not correct. Consider:

fn main() {
    let string = String::from("baño");
    assert_eq!(5, string.len());
    assert_eq!(5, string.as_bytes().len());
    assert_eq!(4, string.chars().count());
    assert_eq!("baño", string);

    let mut string2 = String::new();
    for byte in string.bytes() {
        string2.push(byte as char);
    }
    assert_eq!(7, string2.len());
    assert_eq!(7, string2.as_bytes().len());
    assert_eq!(5, string2.chars().count());
    assert_eq!("baño", string2);
}

@WhyNotHugo
Copy link
Contributor Author

Converting individual bytes to characters and pushing those individually is not the same as pushing a single multi-byte character.

If I understand correctly, converting u8 to char uses ISO-8859-1: https://doc.rust-lang.org/std/primitive.char.html#impl-From%3Cu8%3E-for-char

@RazrFalcon
Copy link
Owner

Ugh... my bad. Didn't thought about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants