spaces at the beginning are being trimmed now (change with python 3.10.12) #706

MeggyCal · 2023-06-29T11:25:40Z

gh-102153: urllib.parse.urlsplit() now strips leading C0 control and space characters following the specification for URLs defined by WHATWG in response to CVE-2023-24329. Patch by Illia Volochii.

therefore some tests fail:

[   15s] =================================== FAILURES ===================================
[   15s] _________________ test_urlparse[\t   :foo.com   \n-expected8] __________________
[   15s] 
[   15s] uri = '\t   :foo.com   \n'
[   15s] expected = ParseResult(scheme='', netloc='', path='   :foo.com   ', params='', query='', fragment='')
[   15s] 
[   15s]     @pytest.mark.parametrize(
[   15s]         "uri, expected",
[   15s]         [
[   15s]             ("", ParseResult()),
[   15s]             ("http://example\t.\norg", ParseResult(scheme="http", netloc="example.org")),
[   15s]             (
[   15s]                 "http://user:pass@foo:21/bar;par?b#c",
[   15s]                 ParseResult(
[   15s]                     scheme="http",
[   15s]                     netloc="user:pass@foo:21",
[   15s]                     path="/bar",
[   15s]                     params="par",
[   15s]                     query="b",
[   15s]                     fragment="c",
[   15s]                 ),
[   15s]             ),
[   15s]             ("https://test:@test", ParseResult(scheme="https", netloc="test:@test")),
[   15s]             ("https://:@test", ParseResult(scheme="https", netloc=":@test")),
[   15s]             (
[   15s]                 "non-special://test:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc="test:@test", path="/x"),
[   15s]             ),
[   15s]             (
[   15s]                 "non-special://:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc=":@test", path="/x"),
[   15s]             ),
[   15s]             ("http:foo.com", ParseResult(scheme="http", path="foo.com")),
[   15s]             # NOTE(willkg): The wpt tests set the scheme to http becaue that's what
[   15s]             # the base url is. Since our parser is not using a baseurl, it sets the
[   15s]             # scheme to "". Further, our parser includes spaces at the beginning,
[   15s]             # but I don't see that as being problematic.
[   15s]             ("\t   :foo.com   \n", ParseResult(path="   :foo.com   ")),
[   15s]             # NOTE(willkg): The wpt tests set the path to "/foo/foo.com" because
[   15s]             # the base url is at "/foo"
[   15s]             (" foo.com  ", ParseResult(path=" foo.com  ")),
[   15s]             ("a:\t foo.com", ParseResult(scheme="a", path=" foo.com")),
[   15s]             (
[   15s]                 "http://f:21/ b ? d # e ",
[   15s]                 ParseResult(
[   15s]                     scheme="http", netloc="f:21", path="/ b ", query=" d ", fragment=" e "
[   15s]                 ),
[   15s]             ),
[   15s]             (
[   15s]                 "lolscheme:x x#x x",
[   15s]                 ParseResult(scheme="lolscheme", path="x x", fragment="x x"),
[   15s]             ),
[   15s]             ("http://f:/c", ParseResult(scheme="http", netloc="f:", path="/c")),
[   15s]             ("http://f:0/c", ParseResult(scheme="http", netloc="f:0", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests normalize the 0000000000000 to 0 so the
[   15s]             # netloc should be "f:0".
[   15s]             (
[   15s]                 "http://f:00000000000000/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000", path="/c"),
[   15s]             ),
[   15s]             # NOTE(willkg): The wpt tests drop the 0000000000000000000 altogether
[   15s]             # so the netloc should be "f".
[   15s]             (
[   15s]                 "http://f:00000000000000000000080/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000000000080", path="/c"),
[   15s]             ),
[   15s]             # This is an invalid ipv6 url
[   15s]             ("http://2001::1]", ValueError),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:b/c", ParseResult(scheme="http", netloc="f:b", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f: /c", ParseResult(scheme="http", netloc="f: ", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:999999/c", ParseResult(scheme="http", netloc="f:999999", path="/c")),
[   15s]         ],
[   15s]     )
[   15s]     def test_urlparse(uri, expected):
[   15s]     
[   15s]         if inspect.isclass(expected) and issubclass(expected, BaseException):
[   15s]             with pytest.raises(expected):
[   15s]                 urlparse(uri)
[   15s]     
[   15s]         else:
[   15s]             parsed = urlparse(uri)
[   15s]             print(parsed)
[   15s]             assert parsed.scheme == expected.scheme
[   15s]             assert parsed.netloc == expected.netloc
[   15s] >           assert parsed.path == expected.path
[   15s] E           AssertionError: assert ':foo.com   ' == '   :foo.com   '
[   15s] E             -    :foo.com   
[   15s] E             ? ---
[   15s] E             + :foo.com
[   15s] 
[   15s] tests/test_parse_shim.py:108: AssertionError
[   15s] ----------------------------- Captured stdout call -----------------------------
[   15s] ParseResult(scheme='', netloc='', path=':foo.com   ', params='', query='', fragment='')
[   15s] _____________________ test_urlparse[ foo.com  -expected9] ______________________
[   15s] 
[   15s] uri = ' foo.com  '
[   15s] expected = ParseResult(scheme='', netloc='', path=' foo.com  ', params='', query='', fragment='')
[   15s] 
[   15s]     @pytest.mark.parametrize(
[   15s]         "uri, expected",
[   15s]         [
[   15s]             ("", ParseResult()),
[   15s]             ("http://example\t.\norg", ParseResult(scheme="http", netloc="example.org")),
[   15s]             (
[   15s]                 "http://user:pass@foo:21/bar;par?b#c",
[   15s]                 ParseResult(
[   15s]                     scheme="http",
[   15s]                     netloc="user:pass@foo:21",
[   15s]                     path="/bar",
[   15s]                     params="par",
[   15s]                     query="b",
[   15s]                     fragment="c",
[   15s]                 ),
[   15s]             ),
[   15s]             ("https://test:@test", ParseResult(scheme="https", netloc="test:@test")),
[   15s]             ("https://:@test", ParseResult(scheme="https", netloc=":@test")),
[   15s]             (
[   15s]                 "non-special://test:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc="test:@test", path="/x"),
[   15s]             ),
[   15s]             (
[   15s]                 "non-special://:@test/x",
[   15s]                 ParseResult(scheme="non-special", netloc=":@test", path="/x"),
[   15s]             ),
[   15s]             ("http:foo.com", ParseResult(scheme="http", path="foo.com")),
[   15s]             # NOTE(willkg): The wpt tests set the scheme to http becaue that's what
[   15s]             # the base url is. Since our parser is not using a baseurl, it sets the
[   15s]             # scheme to "". Further, our parser includes spaces at the beginning,
[   15s]             # but I don't see that as being problematic.
[   15s]             ("\t   :foo.com   \n", ParseResult(path="   :foo.com   ")),
[   15s]             # NOTE(willkg): The wpt tests set the path to "/foo/foo.com" because
[   15s]             # the base url is at "/foo"
[   15s]             (" foo.com  ", ParseResult(path=" foo.com  ")),
[   15s]             ("a:\t foo.com", ParseResult(scheme="a", path=" foo.com")),
[   15s]             (
[   15s]                 "http://f:21/ b ? d # e ",
[   15s]                 ParseResult(
[   15s]                     scheme="http", netloc="f:21", path="/ b ", query=" d ", fragment=" e "
[   15s]                 ),
[   15s]             ),
[   15s]             (
[   15s]                 "lolscheme:x x#x x",
[   15s]                 ParseResult(scheme="lolscheme", path="x x", fragment="x x"),
[   15s]             ),
[   15s]             ("http://f:/c", ParseResult(scheme="http", netloc="f:", path="/c")),
[   15s]             ("http://f:0/c", ParseResult(scheme="http", netloc="f:0", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests normalize the 0000000000000 to 0 so the
[   15s]             # netloc should be "f:0".
[   15s]             (
[   15s]                 "http://f:00000000000000/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000", path="/c"),
[   15s]             ),
[   15s]             # NOTE(willkg): The wpt tests drop the 0000000000000000000 altogether
[   15s]             # so the netloc should be "f".
[   15s]             (
[   15s]                 "http://f:00000000000000000000080/c",
[   15s]                 ParseResult(scheme="http", netloc="f:00000000000000000000080", path="/c"),
[   15s]             ),
[   15s]             # This is an invalid ipv6 url
[   15s]             ("http://2001::1]", ValueError),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:b/c", ParseResult(scheme="http", netloc="f:b", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f: /c", ParseResult(scheme="http", netloc="f: ", path="/c")),
[   15s]             # NOTE(willkg): The wpt tests show this as a parse error, but our
[   15s]             # parser "parses" it.
[   15s]             ("http://f:999999/c", ParseResult(scheme="http", netloc="f:999999", path="/c")),
[   15s]         ],
[   15s]     )
[   15s]     def test_urlparse(uri, expected):
[   15s]     
[   15s]         if inspect.isclass(expected) and issubclass(expected, BaseException):
[   15s]             with pytest.raises(expected):
[   15s]                 urlparse(uri)
[   15s]     
[   15s]         else:
[   15s]             parsed = urlparse(uri)
[   15s]             print(parsed)
[   15s]             assert parsed.scheme == expected.scheme
[   15s]             assert parsed.netloc == expected.netloc
[   15s] >           assert parsed.path == expected.path
[   15s] E           AssertionError: assert 'foo.com  ' == ' foo.com  '
[   15s] E             -  foo.com  
[   15s] E             ? -
[   15s] E             + foo.com
[   15s] 
[   15s] tests/test_parse_shim.py:108: AssertionError
[   15s] ----------------------------- Captured stdout call -----------------------------
[   15s] ParseResult(scheme='', netloc='', path='foo.com  ', params='', query='', fragment='')

If I change them like in this PR, they succeed again. But they can be changed in many ways, so please choose the way which looks best to you.

(python/cpython#102153)

willkg · 2023-06-29T12:19:51Z

It'd be more helpful to write this up as an issue. Then we can discuss it there. Thank you!

MeggyCal · 2023-06-29T20:44:25Z

OK, I was just trying to provide possible solution.

spaces at the beginning are being trimmed now

749c169

(python/cpython#102153)

willkg closed this Jun 29, 2023

MeggyCal mentioned this pull request Jun 29, 2023

bug: using OpenSUSE and Fedora packages which change the Bleach code, parse_shim tests fail with Python 3.10.12 #707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaces at the beginning are being trimmed now (change with python 3.10.12) #706

spaces at the beginning are being trimmed now (change with python 3.10.12) #706

MeggyCal commented Jun 29, 2023 •

edited

willkg commented Jun 29, 2023

MeggyCal commented Jun 29, 2023

spaces at the beginning are being trimmed now (change with python 3.10.12) #706

spaces at the beginning are being trimmed now (change with python 3.10.12) #706

Conversation

MeggyCal commented Jun 29, 2023 • edited

willkg commented Jun 29, 2023

MeggyCal commented Jun 29, 2023

MeggyCal commented Jun 29, 2023 •

edited