Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

tve · 2023-12-14T08:40:37Z

Moddable SDK version: 4.3
Target device: esp32

Description
In an HTTP prepareResponse handler I'm returning a string body that happens to contain NUL characters ('\0'). On the client side these come across as 0xc0 0x80 byte pairs and a byte is missing at the end.

Steps to Reproduce

Run the following modification of the httpserverputfile example:

import { Server } from "http"
import { File } from "file"
import Net from "net"

new Server({}).callback = function (message, value) {
  switch (message) {
    case Server.headersComplete: // prepare for request body
      return String

    case Server.prepareResponse: // request body received
      return { status: 200, body: "Here is a [\0] byte" }
  }
}

trace(`Available on Wi-Fi "${Net.get("SSID")}"\n`)
trace(
  `curl --data-binary "@/users/[your directory path here]/test.txt"  http://${Net.get(
    "IP"
  )}/test.txt -v\n`
)

Perform a curl request and observe the weird response:

> curl http://192.168.0.216/ | od -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    18  100    18    0     0    198      0 --:--:-- --:--:-- --:--:--   200
0000000   H   e   r   e       i   s       a       [ 300 200   ]       b
0000020   y   t
0000022

Notice the 0xc0 (0300) and 0x80 (0200) instead of the NUL and the missing 'e' at the end.

Run a tcpdump to verify that this is not a curl misbehavior and also that the final 'e' is actually transmitted but omitted by curl due to the content-length:

        0x0020:  5019 1fb4 1aed 0000 4854 5450 2f31 2e31  P.......HTTP/1.1
        0x0030:  2032 3030 204f 4b0d 0a63 6f6e 6e65 6374  .200.OK..connect
        0x0040:  696f 6e3a 2063 6c6f 7365 0d0a 636f 6e74  ion:.close..cont
        0x0050:  656e 742d 6c65 6e67 7468 3a20 3138 0d0a  ent-length:.18..
        0x0060:  0d0a 4865 7265 2069 7320 6120 5bc0 805d  ..Here.is.a.[..]
        0x0070:  2062 7974 65                             .byte

Update: any response string with a non-ascii unicode character will trigger the incorrect content-length issue. For example, return { status: 200, body: "Here is a [©] byte" } results in

(main) /h/s/m/j/host> curl http://192.168.0.216/ | od -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    18  100    18    0     0     34      0 --:--:-- --:--:-- --:--:--    34
0000000   H   e   r   e       i   s       a       [ 302 251   ]       b
0000020   y   t
0000022

Note the missing 'e' at the end.

The text was updated successfully, but these errors were encountered:

phoddie · 2023-12-15T01:27:58Z

Yes, this is known. No plans to fix it in this server. The caller can use TextEncoder to convert the string to binary UTF-8 and write that instead.

FWIW – the ECMA-419 HTTP client and server only operate on binary data (not strings). That's deliberate so that these text conversion issues remain external.

tve · 2023-12-15T01:29:58Z

Yes, this is known. No plans to fix it in this server.

Noting this in the docs would have saved me a bunch of time and aggravation...

tve · 2023-12-15T01:36:05Z

FWIW – the ECMA-419 HTTP client and server

Thanks for pointing this out again, ~~I'm not finding the HTTPServer implementation in the Moddable SDK, am I missing something?~~ found it

phoddie · 2023-12-15T18:39:05Z

(Added a note to the docs)

tve · 2023-12-16T01:34:22Z

FYI: the issue with NUL characters seems to also affect Array.fromString. Some uses, like in the ecma-419 mqtt client look vulnerable. I assume this is known and won't-fix.

phoddie · 2023-12-17T06:50:49Z

The default XS string encoding is almost UTF-8. The exception is NULLs which use the CESU-8 encoding. Here's a modified version of ArrayBuffer.fromString that handles the NULL encoding.

fx_ArrayBuffer_fromString

void fx_ArrayBuffer_fromString(txMachine* the)
{
	txSize length;
	if (mxArgc < 1)
		mxTypeError("no argument");
	length = mxStringLength(fxToString(the, mxArgv(0)));
	
	txString c = mxArgv(0)->value.string, end = c + length - 1;
	txInteger nulls = 0;
	while (c < end) {
		if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1]))
			nulls += 1;
		c++;
	} 
	
	fxConstructArrayBufferResult(the, mxThis, length - nulls);
	if (!nulls)
		c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
	else {
		txString c = mxArgv(0)->value.string, end = c + length;
		txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
		while (c < end) {
			if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1])) {
				*out++ = 0;
				c += 2;
			}
			else
				*out++ = *c++;
		} 
	}
}

Here's the corresponding change to String.fromArrayBuffer:

fx_String_fromArrayBuffer

void fx_String_fromArrayBuffer(txMachine* the)
{
	txSlot* slot;
	txSlot* arrayBuffer = C_NULL, *sharedArrayBuffer = C_NULL;
	txSlot* bufferInfo;
	txInteger limit, offset;
	txInteger inLength, outLength = 0, nulls = 0;
	unsigned char *in;
	txString string;
	if (mxArgc < 1)
		mxTypeError("no argument");
	slot = mxArgv(0);
	if (slot->kind == XS_REFERENCE_KIND) {
		slot = slot->value.reference->next;
		if (slot) {
			bufferInfo = slot->next;
			if (slot->kind == XS_ARRAY_BUFFER_KIND)
				arrayBuffer = slot;
			else if (slot->kind == XS_HOST_KIND) {
				if (!(slot->flag & XS_HOST_CHUNK_FLAG) && bufferInfo && (bufferInfo->kind == XS_BUFFER_INFO_KIND))
					sharedArrayBuffer = slot;
			}
		}
	}
	if (!arrayBuffer && !sharedArrayBuffer)
		mxTypeError("argument is no ArrayBuffer instance");
	limit = bufferInfo->value.bufferInfo.length;
	offset = fxArgToByteLength(the, 1, 0);
	if (limit < offset)
		mxRangeError("out of range byteOffset %ld", offset);
	inLength = fxArgToByteLength(the, 2, limit - offset);
	if ((limit < (offset + inLength)) || ((offset + inLength) < offset))
		mxRangeError("out of range byteLength %ld", inLength);

	in = offset + (unsigned char *)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
	while (inLength > 0) {
		unsigned char first = c_read8(in++), clen;
		if (first < 0x80){
			if (0 == first)
				nulls += 1;
			inLength -= 1;
			outLength += 1;
			continue;
		}

		if (0xC0 == (first & 0xE0))
			clen = 2;
		else if (0xE0 == (first & 0xF0))
			clen = 3;
		else if (0xF0 == (first & 0xF0))
			clen = 4;
		else
			goto badUTF8;

		inLength -= clen;
		if (inLength < 0)
			goto badUTF8;

		outLength += clen;
		clen -= 1;
		do {
			if (0x80 != (0xc0 & c_read8(in++)))
				goto badUTF8;
		} while (--clen > 0);
	}

	string = fxNewChunk(the, outLength + nulls + 1);
	if (!nulls)
		c_memcpy(string, offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data), outLength);
	else {
		txString c = string, end = c + outLength + nulls;
		txString buf = offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
		while (c < end) {
			txByte b = *buf++;
			if (b)
				*c++ = b;
			else {
				*c++ = 0xC0;
				*c++ = 0x80;
			}
		}
	}
	string[outLength + nulls] = 0;
	mxResult->value.string = string;
	mxResult->kind = XS_STRING_KIND;

	return;

badUTF8:
	mxTypeError("invalid UTF-8");
}

Let me know if this works for you. If it does, I'll merge it.

tve · 2023-12-17T07:10:01Z

CESU-8: learn something new every day...
I'll plug it in and report back, thanks!

phoddie · 2023-12-18T19:42:21Z

I'll plug it in and report back, thanks!

Great. How'd it go?

tve · 2023-12-19T02:19:23Z

My test for Array.fromString looks good, ~the impact on larger buffers (1KB) is ~~2x but on smaller strings I would call it "in the noise"~~ (using esp32):

import Time from "time"
import TextEncoder from "text/encoder"

const ten = "0123456789"
const hundred = ten + ten + ten + ten + ten + ten + ten + ten + ten + ten
const thousand =
  hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred

const ten0 = "01234\x006789"
const hundred0 = ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0
const thousand0 =
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0

function timeit(desc: string, f: () => void): void {
  const t0 = Time.ticks
  f()
  const t1 = Time.ticks
  trace(`${desc} took ${t1 - t0}ms\n`)
}

const tests = [
  ["ten", ten],
  ["hundred", hundred],
  ["thousand", thousand],
  ["ten0", ten0],
  ["hundred0", hundred0],
  ["thousand0", thousand0],
]

trace("\n===== String NUL test =====\n")

for (const test of tests) {
  const ab = ArrayBuffer.fromString(test[1])
  const u8 = new Uint8Array(ab)
  const te = new TextEncoder().encode(test[1])
  const same = u8.byteLength == te.byteLength && u8.every((v, i) => v == te[i])
  trace(
    `${test[0]} ${test[1].length} ${ab.byteLength} ` +
      `${u8.slice(0, 8)}...${u8[u8.byteLength - 1]} same=${same}\n`
  )
}

for (const test of tests) {
  timeit(`${test[0]} fromString`, () => {
    const s = test[1]
    for (let i = 0; i < 100; i++) {
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
    }
  })
}

// Before fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 11 48,49,50,51,52,192,128,54...57 same=false
// hundred0 100 110 48,49,50,51,52,192,128,54...57 same=false
// thousand0 1000 1100 48,49,50,51,52,192,128,54...57 same=false
// ten fromString took 37ms
// hundred fromString took 40ms
// thousand fromString took 67ms
// ten0 fromString took 37ms
// hundred0 fromString took 41ms
// thousand0 fromString took 69ms
//
// After fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 10 48,49,50,51,52,0,54,55...57 same=true
// hundred0 100 100 48,49,50,51,52,0,54,55...57 same=true
// thousand0 1000 1000 48,49,50,51,52,0,54,55...57 same=true
// ten fromString took 26ms
// hundred fromString took 33ms
// thousand fromString took 107ms
// ten0 fromString took 22ms
// hundred0 fromString took 39ms
// thousand0 fromString took 174ms

~~similar test for String.fromArrayBuffer will follow~~
String.fromArrayBuffer looks good too.

Edit:

For ArrayBuffer.fromString 1000 bytes: post-fix-no-NUL to pre-fix is ~1.6x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.6x; for 100 bytes or less the difference is noise.
For String.fromArrayBuffer 1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.5x; for 100 bytes or less the difference is noise.

phoddie · 2023-12-19T02:46:25Z

Thanks for the tests and benchmarks. Given the need for an extra pass over the data, it is going to take more time. Since the actual work is trivial, memory bandwidth is the limiting factor. Since this isn't generally performance critical, I think it is an OK tradeoff and keeps the code small.

I just noticed that fx_ArrayBuffer_fromString can be optimized. The first thing it does is get the length of the string. That can be combined with the pass that searches for nulls. That should make the performance much closer to the original. Here's the updated version.

fx_ArrayBuffer_fromString 2

void fx_ArrayBuffer_fromString(txMachine* the)
{
	txSize length = 0;
	if (mxArgc < 1)
		mxTypeError("no argument");
	
	txString c = mxArgv(0)->value.string;
	txInteger nulls = 0;
	while (true) {
		uint8_t b = (uint8_t)c_read8(c++);
		if (!b) break;

		length += 1;
		if ((0xc0 == b) && (0x80 == (uint8_t)c_read8(c)))
			nulls += 1;
	} 
	
	fxConstructArrayBufferResult(the, mxThis, length - nulls);
	if (!nulls)
		c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
	else {
		txString c = mxArgv(0)->value.string, end = c + length;
		txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
		while (c < end) {
			uint8_t b = (uint8_t)c_read8(c++);
			if ((0xc0 == (uint8_t)b) && (0x80 == (uint8_t)c_read8(c))) {
				*out++ = 0;
				c += 1;
			}
			else
				*out++ = b;
		} 
	}
}

How's that look?

Separately, I would expect String.fromArrayBuffer() performance to be more-or-less unchanged. There's no additional pass over the data. If there are nulls, it will be a little slower because it can't use memcpy to transfer the bytes, but in the common case of no nulls it should pretty much match.

tve · 2023-12-19T04:42:17Z

For ArrayBuffer.fromString

1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.8x
for 100 bytes or less the difference is noise.

For String.fromArrayBuffer

1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.5x
for 100 bytes or less the difference is noise.

phoddie · 2023-12-19T17:39:08Z

Thanks for rechecking. The fromString optimization seems to be working nicely. I'll integrate the changes.

tve mentioned this issue Dec 14, 2023

HTTP server fails if body is a String object #1269

Closed

phoddie added the wontfix label Dec 15, 2023

phoddie closed this as completed Dec 15, 2023

mkellner pushed a commit that referenced this issue Dec 21, 2023

ArrayBuffer.fromString & String.fromArrayBuffer handle nulls #1271

9133fb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

tve commented Dec 14, 2023 •

edited

phoddie commented Dec 15, 2023

tve commented Dec 15, 2023

tve commented Dec 15, 2023 •

edited

phoddie commented Dec 15, 2023

tve commented Dec 16, 2023 •

edited

phoddie commented Dec 17, 2023

tve commented Dec 17, 2023

phoddie commented Dec 18, 2023

tve commented Dec 19, 2023 •

edited

phoddie commented Dec 19, 2023 •

edited

tve commented Dec 19, 2023

phoddie commented Dec 19, 2023

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

Comments

tve commented Dec 14, 2023 • edited

phoddie commented Dec 15, 2023

tve commented Dec 15, 2023

tve commented Dec 15, 2023 • edited

phoddie commented Dec 15, 2023

tve commented Dec 16, 2023 • edited

phoddie commented Dec 17, 2023

tve commented Dec 17, 2023

phoddie commented Dec 18, 2023

tve commented Dec 19, 2023 • edited

phoddie commented Dec 19, 2023 • edited

tve commented Dec 19, 2023

phoddie commented Dec 19, 2023

tve commented Dec 14, 2023 •

edited

tve commented Dec 15, 2023 •

edited

tve commented Dec 16, 2023 •

edited

tve commented Dec 19, 2023 •

edited

phoddie commented Dec 19, 2023 •

edited