fix: fix the bytes encode/decode for redis cache#153
fix: fix the bytes encode/decode for redis cache#153gontarzpawel merged 4 commits intoContentSquare:masterfrom
Conversation
|
A example to reproduce the json marshal/unmarshal bug. |
|
Hello @wangxinalex! Thank you for contribution. I'd like to understand couple of aspects:
|
|
Dear Gontarz:
Thank you for replying.
1. Actually I found out this issue when I try to connect the chproxy with DataGrip. The request is quite simple `SELECT 1 FORMAT TabSeparatedWithNamesAndTypes`. The actual length of the response payload is 90, but the declared length is 62. And that causes the http write to report an error.
1. When the result is first put into the redis cache, the payload is `bb012032a452485b03ae25a2d507665582120000000800000080310a55496e74380ade79cf087fb635049db816df195b016b820c0000000200000020310a`, the actual length and declared length are both 62.
2. However, when the same result is retrieved from cache and unmarshaled, the payload becomes `efbfbd012032efbfbd52485b03efbfbd25efbfbdefbfbd076655efbfbd1200000008000000efbfbd310a55496e74380aefbfbd79efbfbd087fefbfbd3504efbfbdefbfbd16efbfbd195b016befbfbd0c0000000200000020310a`, the actual length becomes 90, thus the actual length and declared length differ, which causes the http write error.
3. More detailed debug code can be found here. https://github.com/wangxinalex/chproxy/blob/1f0a5e7a94ae8c2351937188e1b0c94d140847f8/cache/redis_cache.go#L149
2. In my opinion, the root cause of this issue is `string(bytes)` is not the canonical way to encode byte array. Especially when the payload is to be marshaled/unmarshaled. The root problem may be described in the comment of `encoding/json/decode.go:96`.
// When unmarshaling quoted strings, invalid UTF-8 or
// invalid UTF-16 surrogate pairs are not treated as an error.
// Instead, they are replaced by the Unicode replacement
// character U+FFFD.
The original byte array is just an arbitrary byte stream, and `Unmarshal` function takes it with UTF-8/UTF-16 charset and replaces some bytes silently. Thus why the length of original bytes and unmarshaled bytes are different. As you can see, the frequent `efbfbd` in the retrieved result is actually `U+FFFD`.
The `SetEscapeHtml` cannot solve this behavior. As shown in https://gist.github.com/wangxinalex/885f3d53047bae62c8a454de620c9717.
Therefore I use base64 to encode/decode the byte array and avoid the UTF-8/UTF-16 problem.
在 2022年3月30日 +0800 19:54,Paweł Gontarz ***@***.***>,写道:
Hello @wangxinalex! Thank you for contribution.
I'd like to understand couple of aspects:
• what was the payload returned from Clickhouse that made you find out the invalid utf-8/16 bytes? would it be possible to provide a failing test fixed by your change?
• it seems that json encoder can be configured to not replace invalid bytes: SetEscapeHTML(false).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Thank you for your clear explanation!
I've launched tests locally to verify your change does not break anything and there's one test failing.
make test
truncated output
....
--- FAIL: TestServe (0.47s)
--- FAIL: TestServe/http_requests_with_caching_in_redis_ (0.01s)
main_test.go:369: result from cache query is wrong: {"l":4,"t":"text/plain; charset=utf-8","enc":"","payload":"T2suCg=="}
Could you adapt it to your change please?
It'd be also beneficial if we could have the failing example that you provided, added as unit test. Could you also do that?
FYI we're working on adding CI step to verify tests.
EDIT:
rebased your PR on master please to have CI (github actions) enabled
|
Dear @Garnek20, the failing case is adapted and a new test case is added for the changed behavior. |
Using `string(data)` to convert the byte array to string introduces error in json marshal/unmarshal, hence causes error when returning cached response from redis. The reason is `Unmarshal` function in `encode/json` would replace invalid UTF-8 or invalid UTF-16 pairs with `U+FFFD`, therefore the `payload` string in `redisCachePayload` will actually change after json marshal/unmarshal since the byte array may contain invalid UTF-8/UTF-16 byte, the length of payload will thereby change, resulting the http server to find the declared length in header `Content-Length` mismatches the actual length of payload. The fix is to base64-encode/decode the byte array to string, thereby eliminates invalid UTF-8/UTF-16 bytes.
add test cases for base64 encode/decode the cached value
8226e08 to
648df1b
Compare
|
@Garnek20 It seems that |
|
@Garnek20 My best guess is Line 254 in 5b23001 time.Sleep(time.Millisecond * 5) to sleep longer than it should be, so as to suppress the queue_overflow_error. So my suggestion is maybe should minimize the sleep time and try again.
|
…s ci minimize the waiting time between two consecutive requests
gontarzpawel
left a comment
There was a problem hiding this comment.
Thank you for adding the test! One last comment and we will be ready to merge it 🙂
gontarzpawel
left a comment
There was a problem hiding this comment.
Thank you @wangxinalex for you contribution!
Using
string(data)to convert the byte array to string introduces error in json marshal/unmarshal,hence causes error when returning cached response from redis.
The reason is
Unmarshalfunction inencode/jsonwould replace invalid UTF-8 or invalid UTF-16 pairs withU+FFFD, therefore thepayloadstring inredisCachePayloadwill actually change after json marshal/unmarshal since thebyte array may contain invalid UTF-8/UTF-16 byte, the length of payload will thereby change,
resulting the http server to find the declared length in header
Content-Lengthmismatches theactual length of payload.
The fix is to base64-encode/decode the byte array to string, thereby
eliminates invalid UTF-8/UTF-16 bytes.