Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requests.utils._parse_content_type_header gives the wrong value when multiple headers with the same name are present. #6248

Open
iulian-birlica opened this issue Sep 28, 2022 · 2 comments

Comments

@iulian-birlica
Copy link

The http rfc (specifically section 4.2) allows multiple headers with the same name to be present, granted that they can be collapsed into a single header with comma separated values.

When encountered with a response containing these headers:

Content-Type: text/html; charset=UTF-8
Content-Type: text/javascript

requests parses them into a single dictionary value:

{'Content-Type': 'text/html; charset=UTF-8, text/javascript', ...}

but requests.utils._parse_content_type_header parses it wrong, by assuming that charset is UTF-8, text/javascript, instead of UTF-8

I expected requests to give the correct encoding, or at least one of the encodings from the two headers, preferably the first, UTF-8.

I get the encoding for the response set to UTF-8, text/javascript

Reproduction Steps

import requests

res = requests.get("url-that-responds-with-two-content-headers")
print(res.encoding)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "4.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.12"
  },
  "cryptography": {
    "version": "36.0.2"
  },
  "idna": {
    "version": "3.3"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.2"
  },
  "platform": {
    "release": "4.9.0-16-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "101010ef",
    "version": "20.0.1"
  },
  "requests": {
    "version": "2.28.1"
  },
  "system_ssl": {
    "version": "101000cf"
  },
  "urllib3": {
    "version": "1.26.12"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}

Seeing that converting the dict to a list would be much more trouble, my proposed fix is to ignore characters after the first comma. requests.utils._parse_content_type_header would start with tokens = header.split(',')[0].split(";").

I can make a PR, if this bug is deemed true and if that is the solution.

@sigmavirus24
Copy link
Contributor

sigmavirus24 commented Sep 29, 2022

A few things:

  1. I don't believe that requests is responsible for collapsing the headers here. I think one of our dependencies needs to be more careful with folding.

  2. Two Content-Type headers are not foldable if my memory serves correctly nor are the semantics of two Content-Type headers defined. Your assumption that utf8 is correct here is undefined behavior to the best of my knowledge. There's no clarity on this case

Content-Type: text/html; charset=cp-1252
Content-Type: text/plain; charset=utf-8

Which is the correct content type header? Those charsets are not compatible entirely. The server is sending mixed signals and shouldn't be sending two.

If we attempt to solve this, how do we handle N charsets from M Content-Type headers? Which one should be picked?

I agree currently it's not great but I don't think we should try to define behavior arbitrarily because that's going to break someone else. I would prefer we raise an exception if we can detect this case but that's slightly backwards incompatible as this might "Just Work" if the headers were reversed in the order it sends them.

@goelbenj
Copy link

Perhaps a warning could be raised instead? This is both backwards compatible and the warning message can describe the sunsetting of this behaviour with the exception-raising behaviour being introduced in a future version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants