Skip to content

Spec-conformant HTML pages with properly-encoded multi-parameter URLs hand the agent a broken URL #170

Description

@Shyam-723

Bug

_extract_links in web/fetcher.py reads href attribute values directly from raw HTML without decoding HTML entities. Because valid HTML requires & inside attribute values to be written as &, a page containing something as:

<a href="https://example.com/search?q=foo&amp;lang=en">Search</a>

causes the fetcher to surface https://example.com/search?q=foo&amp;lang=en as a link. When the web agent later calls fetch_url with that string, the request is sent with a literal &amp; in the query string, which many servers either reject or silently misparse, so the agent ends up fetching the wrong page or getting a 400.

Repro
Any real-world page whose query-string links are correctly HTML-encoded (which is the spec-required form) triggers this. For example, Google Search results, GitHub search pages, and most CMS-generated pages encode & as & in href attributes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions