`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present. #6248

iulian-birlica · 2022-09-28T22:16:12Z

The http rfc (specifically section 4.2) allows multiple headers with the same name to be present, granted that they can be collapsed into a single header with comma separated values.

When encountered with a response containing these headers:

Content-Type: text/html; charset=UTF-8
Content-Type: text/javascript

requests parses them into a single dictionary value:

{'Content-Type': 'text/html; charset=UTF-8, text/javascript', ...}

but requests.utils._parse_content_type_header parses it wrong, by assuming that charset is UTF-8, text/javascript, instead of UTF-8

I expected requests to give the correct encoding, or at least one of the encodings from the two headers, preferably the first, UTF-8.

I get the encoding for the response set to UTF-8, text/javascript

Reproduction Steps

import requests

res = requests.get("url-that-responds-with-two-content-headers")
print(res.encoding)

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "4.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.12"
  },
  "cryptography": {
    "version": "36.0.2"
  },
  "idna": {
    "version": "3.3"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.2"
  },
  "platform": {
    "release": "4.9.0-16-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "101010ef",
    "version": "20.0.1"
  },
  "requests": {
    "version": "2.28.1"
  },
  "system_ssl": {
    "version": "101000cf"
  },
  "urllib3": {
    "version": "1.26.12"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}

Seeing that converting the dict to a list would be much more trouble, my proposed fix is to ignore characters after the first comma. requests.utils._parse_content_type_header would start with tokens = header.split(',')[0].split(";").

I can make a PR, if this bug is deemed true and if that is the solution.

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2022-09-29T15:44:25Z

A few things:

I don't believe that requests is responsible for collapsing the headers here. I think one of our dependencies needs to be more careful with folding.
Two Content-Type headers are not foldable if my memory serves correctly nor are the semantics of two Content-Type headers defined. Your assumption that utf8 is correct here is undefined behavior to the best of my knowledge. There's no clarity on this case

Content-Type: text/html; charset=cp-1252
Content-Type: text/plain; charset=utf-8

Which is the correct content type header? Those charsets are not compatible entirely. The server is sending mixed signals and shouldn't be sending two.

If we attempt to solve this, how do we handle N charsets from M Content-Type headers? Which one should be picked?

I agree currently it's not great but I don't think we should try to define behavior arbitrarily because that's going to break someone else. I would prefer we raise an exception if we can detect this case but that's slightly backwards incompatible as this might "Just Work" if the headers were reversed in the order it sends them.

goelbenj · 2023-11-26T20:27:08Z

Perhaps a warning could be raised instead? This is both backwards compatible and the warning message can describe the sunsetting of this behaviour with the exception-raising behaviour being introduced in a future version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present. #6248

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present. #6248

iulian-birlica commented Sep 28, 2022

sigmavirus24 commented Sep 29, 2022 •

edited

goelbenj commented Nov 26, 2023

requests.utils._parse_content_type_header gives the wrong value when multiple headers with the same name are present. #6248

requests.utils._parse_content_type_header gives the wrong value when multiple headers with the same name are present. #6248

Comments

iulian-birlica commented Sep 28, 2022

Reproduction Steps

System Information

sigmavirus24 commented Sep 29, 2022 • edited

goelbenj commented Nov 26, 2023

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present. #6248

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present. #6248

sigmavirus24 commented Sep 29, 2022 •

edited