Skip to content

Non-ASCII characters like ampersand (&) have their encoding changed, even when they should be considered valid HTML #856

Closed
@CaptainStack

Description

@CaptainStack

I noticed this issue while using PreMailer.Net and was able to isolate the cause down to AngleSharp. I describe the issue in some detail in an issue on PreMailer.Net's repository.

When the HTML I parse contains a non-ASCII value, like an ampersand, AngleSharp will encode the character. For example:

static async Task FirstExample()
{
    var config = Configuration.Default;
    var context = BrowsingContext.New(config);
    var document = await context.OpenAsync(req => req.Content("<html><head></head><body><p>&</p></body></html>"));
    Console.WriteLine(document.DocumentElement.OuterHtml);
}

The following code will take the input:

<html><head></head><body><p>&</p></body></html>

And will output:

<html><head></head><body><p>&amp;</p></body></html>

In my research on this issue, which has been reported and caused issues for many users, I found the following statement from @FlorianRappl in this closed issue.

This is by specification, see the string escaping that needs to be applied on attribute values..

However, as demonstrated in my above example, this is not just encoding attribute values, it is actually encoding innerHTML content which unless I am mistaken, is certainly valid HTML. I am not aware of any standards that say that HTML content must only include encoded strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions