Bug
_extract_links in web/fetcher.py reads href attribute values directly from raw HTML without decoding HTML entities. Because valid HTML requires & inside attribute values to be written as &, a page containing something as:
<a href="https://example.com/search?q=foo&lang=en">Search</a>
causes the fetcher to surface https://example.com/search?q=foo&lang=en as a link. When the web agent later calls fetch_url with that string, the request is sent with a literal & in the query string, which many servers either reject or silently misparse, so the agent ends up fetching the wrong page or getting a 400.
Repro
Any real-world page whose query-string links are correctly HTML-encoded (which is the spec-required form) triggers this. For example, Google Search results, GitHub search pages, and most CMS-generated pages encode & as & in href attributes.
Bug
_extract_linksinweb/fetcher.pyreadshrefattribute values directly from raw HTML without decoding HTML entities. Because valid HTML requires&inside attribute values to be written as&, a page containing something as:causes the fetcher to surface
https://example.com/search?q=foo&lang=enas a link. When the web agent later calls fetch_url with that string, the request is sent with a literal&in the query string, which many servers either reject or silently misparse, so the agent ends up fetching the wrong page or getting a 400.Repro
Any real-world page whose query-string links are correctly HTML-encoded (which is the spec-required form) triggers this. For example, Google Search results, GitHub search pages, and most CMS-generated pages encode & as & in href attributes.