Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Response can contain invalid UTF-8 sequences #2976

Closed
faucct opened this issue Aug 28, 2018 · 2 comments
Closed

Response can contain invalid UTF-8 sequences #2976

faucct opened this issue Aug 28, 2018 · 2 comments
Labels
st-wontfix Known issue, no plans to fix it currenlty

Comments

@faucct
Copy link

faucct commented Aug 28, 2018

curl http://:8123 --data "SELECT '\xF8' FROM system.one" -v | hexdump
* Rebuilt URL to: http://:8123/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to  (::1) port 8123 (#0)
> POST / HTTP/1.1
> Host: :8123
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 19
> Content-Type: application/x-www-form-urlencoded
> 
} [19 bytes data]
* upload completely sent off: 19 out of 19 bytes
< HTTP/1.1 200 OK
< Date: Tue, 28 Aug 2018 07:33:43 GMT
< Connection: Keep-Alive
< Content-Type: text/tab-separated-values; charset=UTF-8
< X-ClickHouse-Server-Display-Name: f5cef54f26d7
< Transfer-Encoding: chunked
< Keep-Alive: timeout=3
< 
{ [7 bytes data]
100    21    0     2  100    19    206   1961 --:--:-- --:--:-- --:--:--  2111
* Connection #0 to host  left intact
0000000 f8 0a                                          
0000002
curl http://:8123 --data "SELECT '\xF8' FROM system.one" | iconv -f UTF-8
iconv: (stdin):1:0: incomplete character or shift sequence

Am I understanding right that Content-Type: text/tab-separated-values; charset=UTF-8 means that response should have C3 B8 instead of F8?

@faucct
Copy link
Author

faucct commented Aug 28, 2018

I think the only way to make header consistent with response encoding is to change header, as making response body UTF-8 will most likely double encode it.

@alexey-milovidov
Copy link
Member

Yes, charset=UTF-8 is technically wrong.
Strings in ClickHouse can contain arbitary binary data,
that is typically assumed to be UTF-8 for text strings,
but the data type and the server itself is charset-agnostic.

We definitely should not recode this binary data in neither way. If we will write C3 B8 instead of F8 we assume that our data is in latin1 encoding or something like that, that is even more wrong. (This is an antipattern that has very large widespread.)

But if we simply remove charset=UTF-8, then browsers will not display our data correctly if our strings are actually in UTF-8. And assumption about UTF-8 is a reasonable default.

@blinkov blinkov added the st-wontfix Known issue, no plans to fix it currenlty label Sep 4, 2018
@blinkov blinkov closed this as completed Sep 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
st-wontfix Known issue, no plans to fix it currenlty
Projects
None yet
Development

No branches or pull requests

3 participants