Mu-L · pull · May 23, 2022 · May 23, 2022 · May 23, 2022 · May 23, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,26 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## 0.23.0 (23rd May, 2022)
+
+### Changed
+
+* Drop support for Python 3.6. (#2097)
+* Use `utf-8` as the default character set, instead of falling back to `charset-normalizer` for auto-detection. To enable automatic character set detection, see [the documentation](https://www.python-httpx.org/advanced/#character-set-encodings-and-auto-detection). (#2165)
+
+### Fixed
+
+* Fix `URL.copy_with` for some oddly formed URL cases. (#2185)
+* Digest authentication should use case-insensitive comparison for determining which algorithm is being used. (#2204)
+* Fix console markup escaping in command line client. (#1866)
+* When files are used in multipart upload, ensure we always seek to the start of the file. (#2065)
+* Ensure that `iter_bytes` never yields zero-length chunks. (#2068)
+* Preserve `Authorization` header for redirects that are to the same origin, but are an `http`-to-`https` upgrade. (#2074)
+* When responses have binary output, don't print the output to the console in the command line client. Use output like `<16086 bytes of binary data>` instead. (#2076)
+* Fix display of `--proxies` argument in the command line client help. (#2125)
+* Close responses when task cancellations occur during stream reading. (#2156)
+* Fix type error on accessing `.request` on `HTTPError` exceptions. (#2158)
+
 ## 0.22.0 (26th January, 2022)
 
 ### Added

diff --git a/README.md b/README.md
@@ -128,7 +128,6 @@ The HTTPX project relies on these excellent libraries:
 * `httpcore` - The underlying transport implementation for `httpx`.
   * `h11` - HTTP/1.1 support.
 * `certifi` - SSL certificates.
-* `charset_normalizer` - Charset auto-detection.
 * `rfc3986` - URL parsing & normalization.
   * `idna` - Internationalized domain name support.
 * `sniffio` - Async library autodetection.

diff --git a/README_chinese.md b/README_chinese.md
@@ -129,7 +129,6 @@ HTTPX项目依赖于这些优秀的库:
   * `h11` - HTTP/1.1 support.
   * `h2` - HTTP/2 support. *(Optional, with `httpx[http2]`)*
 * `certifi` - SSL certificates.
-* `charset_normalizer` - Charset auto-detection.
 * `rfc3986` - URL parsing & normalization.
   * `idna` - Internationalized domain name support.
 * `sniffio` - Async library autodetection.

diff --git a/docs/advanced.md b/docs/advanced.md
@@ -145,6 +145,88 @@ URL('http://httpbin.org/headers')
 
 For a list of all available client parameters, see the [`Client`](api.md#client) API reference.
 
+---
+
+## Character set encodings and auto-detection
+
+When accessing `response.text`, we need to decode the response bytes into a unicode text representation.
+
+By default `httpx` will use `"charset"` information included in the response `Content-Type` header to determine how the response bytes should be decoded into text.
+
+In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.
+
+### Using the default encoding
+
+To understand this better let's start by looking at the default behaviour for text decoding...
+
+```python
+import httpx
+# Instantiate a client with the default configuration.
+client = httpx.Client()
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "utf-8".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "utf-8".
+```
+
+This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.
+
+### Using an explicit encoding
+
+In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.
+
+```python
+import httpx
+# Instantiate a client with a Japanese character set as the default encoding.
+client = httpx.Client(default_encoding="shift-jis")
+# Using the client...
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else "shift-jis".
+print(response.text)  # The text will either be decoded with the Content-Type
+                      # charset, or using "shift-jis".
+```
+
+### Using character set auto-detection
+
+In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.
+
+To use auto-detection you need to set the `default_encoding` argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.
+
+There are two widely used Python packages which both handle this functionality:
+
+* [`chardet`](https://chardet.readthedocs.io/) - This is a well established package, and is a port of [the auto-detection code in Mozilla](https://www-archive.mozilla.org/projects/intl/chardet.html).
+* [`charset-normalizer`](https://charset-normalizer.readthedocs.io/) - A newer package, motivated by `chardet`, with a different approach.
+
+Let's take a look at installing autodetection using one of these packages...
+
+ ```shell
+$ pip install httpx
+$ pip install chardet
+ ```
+
+Once `chardet` is installed, we can configure a client to use character-set autodetection.
+
+```python
+import httpx
+import chardet
+
+def autodetect(content):
+    return chardet.detect(content).get("encoding")
+
+# Using a client with character-set autodetection enabled.
+client = httpx.Client(default_encoding=autodetect)
+response = client.get(...)
+print(response.encoding)  # This will either print the charset given in
+                          # the Content-Type charset, or else the auto-detected
+                          # character set.
+print(response.text)
+```
+
+---
+
 ## Calling into Python Web Apps
 
 You can configure an `httpx` client to call directly into a Python web application using the WSGI protocol.

diff --git a/docs/async.md b/docs/async.md
@@ -170,27 +170,6 @@ trio.run(main)
     The `trio` package must be installed to use the Trio backend.
 
 
-### [Curio](https://github.com/dabeaz/curio)
-
-Curio is a [coroutine-based library](https://curio.readthedocs.io/en/latest/tutorial.html)
-for concurrent Python systems programming.
-
-```python
-import httpx
-import curio
-
-async def main():
-    async with httpx.AsyncClient() as client:
-        response = await client.get('https://www.example.com/')
-        print(response)
-
-curio.run(main)
-```
-
-!!! important
-    The `curio` package must be installed to use the Curio backend.
-
-
 ### [AnyIO](https://github.com/agronholm/anyio)
 
 AnyIO is an [asynchronous networking and concurrency library](https://anyio.readthedocs.io/) that works on top of either `asyncio` or `trio`. It blends in with native libraries of your chosen backend (defaults to `asyncio`).

diff --git a/docs/index.md b/docs/index.md
@@ -109,7 +109,6 @@ The HTTPX project relies on these excellent libraries:
 * `httpcore` - The underlying transport implementation for `httpx`.
   * `h11` - HTTP/1.1 support.
 * `certifi` - SSL certificates.
-* `charset_normalizer` - Charset auto-detection.
 * `rfc3986` - URL parsing & normalization.
   * `idna` - Internationalized domain name support.
 * `sniffio` - Async library autodetection.

diff --git a/httpx/__version__.py b/httpx/__version__.py
@@ -1,3 +1,3 @@
 __title__ = "httpx"
 __description__ = "A next generation HTTP client, for Python 3."
-__version__ = "0.22.0"
+__version__ = "0.23.0"
diff --git a/httpx/_client.py b/httpx/_client.py
@@ -168,6 +168,7 @@ def __init__(
         ] = None,
         base_url: URLTypes = "",
         trust_env: bool = True,
+        default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
     ):
         event_hooks = {} if event_hooks is None else event_hooks
 
@@ -185,6 +186,7 @@ def __init__(
             "response": list(event_hooks.get("response", [])),
         }
         self._trust_env = trust_env
+        self._default_encoding = default_encoding
         self._netrc = NetRCInfo()
         self._state = ClientState.UNOPENED
 
@@ -611,6 +613,9 @@ class Client(BaseClient):
     rather than sending actual network requests.
     * **trust_env** - *(optional)* Enables or disables usage of environment
     variables for configuration.
+    * **default_encoding** - *(optional)* The default encoding to use for decoding
+    response text, if no charset information is included in a response Content-Type
+    header. Set to a callable for automatic character set detection. Default: "utf-8".
     """
 
     def __init__(
@@ -637,6 +642,7 @@ def __init__(
         transport: typing.Optional[BaseTransport] = None,
         app: typing.Optional[typing.Callable] = None,
         trust_env: bool = True,
+        default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
     ):
         super().__init__(
             auth=auth,
@@ -649,6 +655,7 @@ def __init__(
             event_hooks=event_hooks,
             base_url=base_url,
             trust_env=trust_env,
+            default_encoding=default_encoding,
         )
 
         if http2:
@@ -1002,6 +1009,7 @@ def _send_single_request(self, request: Request) -> Response:
             response.stream, response=response, timer=timer
         )
         self.cookies.extract_cookies(response)
+        response.default_encoding = self._default_encoding
 
         status = f"{response.status_code} {response.reason_phrase}"
         response_line = f"{response.http_version} {status}"
@@ -1326,6 +1334,9 @@ class AsyncClient(BaseClient):
     rather than sending actual network requests.
     * **trust_env** - *(optional)* Enables or disables usage of environment
     variables for configuration.
+    * **default_encoding** - *(optional)* The default encoding to use for decoding
+    response text, if no charset information is included in a response Content-Type
+    header. Set to a callable for automatic character set detection. Default: "utf-8".
     """
 
     def __init__(
@@ -1352,6 +1363,7 @@ def __init__(
         transport: typing.Optional[AsyncBaseTransport] = None,
         app: typing.Optional[typing.Callable] = None,
         trust_env: bool = True,
+        default_encoding: str = "utf-8",
     ):
         super().__init__(
             auth=auth,
@@ -1364,6 +1376,7 @@ def __init__(
             event_hooks=event_hooks,
             base_url=base_url,
             trust_env=trust_env,
+            default_encoding=default_encoding,
         )
 
         if http2:
@@ -1708,6 +1721,7 @@ async def _send_single_request(self, request: Request) -> Response:
             response.stream, response=response, timer=timer
         )
         self.cookies.extract_cookies(response)
+        response.default_encoding = self._default_encoding
 
         status = f"{response.status_code} {response.reason_phrase}"
         response_line = f"{response.http_version} {status}"

diff --git a/httpx/_models.py b/httpx/_models.py
@@ -7,8 +7,6 @@
 from collections.abc import MutableMapping
 from http.cookiejar import Cookie, CookieJar
 
-import charset_normalizer
-
 from ._content import ByteStream, UnattachedStream, encode_request, encode_response
 from ._decoders import (
     SUPPORTED_DECODERS,
@@ -445,6 +443,7 @@ def __init__(
         request: typing.Optional[Request] = None,
         extensions: typing.Optional[dict] = None,
         history: typing.Optional[typing.List["Response"]] = None,
+        default_encoding: typing.Union[str, typing.Callable[[bytes], str]] = "utf-8",
     ):
         self.status_code = status_code
         self.headers = Headers(headers)
@@ -461,6 +460,8 @@ def __init__(
         self.is_closed = False
         self.is_stream_consumed = False
 
+        self.default_encoding = default_encoding
+
         if stream is None:
             headers, stream = encode_response(content, text, html, json)
             self._prepare(headers)
@@ -569,14 +570,18 @@ def encoding(self) -> typing.Optional[str]:
 
         * `.encoding = <>` has been set explicitly.
         * The encoding as specified by the charset parameter in the Content-Type header.
-        * The encoding as determined by `charset_normalizer`.
-        * UTF-8.
+        * The encoding as determined by `default_encoding`, which may either be
+          a string like "utf-8" indicating the encoding to use, or may be a callable
+          which enables charset autodetection.
         """
         if not hasattr(self, "_encoding"):
             encoding = self.charset_encoding
             if encoding is None or not is_known_encoding(encoding):
-                encoding = self.apparent_encoding
-            self._encoding = encoding
+                if isinstance(self.default_encoding, str):
+                    encoding = self.default_encoding
+                elif hasattr(self, "_content"):
+                    encoding = self.default_encoding(self._content)
+            self._encoding = encoding or "utf-8"
         return self._encoding
 
     @encoding.setter
@@ -598,19 +603,6 @@ def charset_encoding(self) -> typing.Optional[str]:
 
         return params["charset"].strip("'\"")
 
-    @property
-    def apparent_encoding(self) -> typing.Optional[str]:
-        """
-        Return the encoding, as determined by `charset_normalizer`.
-        """
-        content = getattr(self, "_content", b"")
-        if len(content) < 32:
-            # charset_normalizer will issue warnings if we run it with
-            # fewer bytes than this cutoff.
-            return None
-        match = charset_normalizer.from_bytes(self.content).best()
-        return None if match is None else match.encoding
-
     def _get_content_decoder(self) -> ContentDecoder:
         """
         Returns a decoder instance which can be used to decode the raw byte

diff --git a/requirements.txt b/requirements.txt
@@ -4,7 +4,10 @@
 # Reference: https://github.com/encode/httpx/pull/1721#discussion_r661241588
 -e .[brotli,cli,http2,socks]
 
-charset-normalizer==2.0.6
+# Optional charset auto-detection
+# Used in our test cases
+chardet==4.0.0
+types-chardet==4.0.4
 
 # Documentation
 mkdocs==1.3.0

diff --git a/setup.py b/setup.py
@@ -57,7 +57,6 @@ def get_packages(package):
     zip_safe=False,
     install_requires=[
         "certifi",
-        "charset_normalizer",
         "sniffio",
         "rfc3986[idna2008]>=1.3,<2",
         "httpcore>=0.15.0,<0.16.0",