Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing a string with NUL characters to a socket results in 0xc0 0x80 utf-8 sequences #1271

Closed
tve opened this issue Dec 14, 2023 · 12 comments
Closed
Labels

Comments

@tve
Copy link
Contributor

tve commented Dec 14, 2023

Moddable SDK version: 4.3
Target device: esp32

Description
In an HTTP prepareResponse handler I'm returning a string body that happens to contain NUL characters ('\0'). On the client side these come across as 0xc0 0x80 byte pairs and a byte is missing at the end.

Steps to Reproduce

  1. Run the following modification of the httpserverputfile example:
import { Server } from "http"
import { File } from "file"
import Net from "net"

new Server({}).callback = function (message, value) {
  switch (message) {
    case Server.headersComplete: // prepare for request body
      return String

    case Server.prepareResponse: // request body received
      return { status: 200, body: "Here is a [\0] byte" }
  }
}

trace(`Available on Wi-Fi "${Net.get("SSID")}"\n`)
trace(
  `curl --data-binary "@/users/[your directory path here]/test.txt"  http://${Net.get(
    "IP"
  )}/test.txt -v\n`
)

  1. Perform a curl request and observe the weird response:
> curl http://192.168.0.216/ | od -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    18  100    18    0     0    198      0 --:--:-- --:--:-- --:--:--   200
0000000   H   e   r   e       i   s       a       [ 300 200   ]       b
0000020   y   t
0000022

Notice the 0xc0 (0300) and 0x80 (0200) instead of the NUL and the missing 'e' at the end.

  1. Run a tcpdump to verify that this is not a curl misbehavior and also that the final 'e' is actually transmitted but omitted by curl due to the content-length:
        0x0020:  5019 1fb4 1aed 0000 4854 5450 2f31 2e31  P.......HTTP/1.1
        0x0030:  2032 3030 204f 4b0d 0a63 6f6e 6e65 6374  .200.OK..connect
        0x0040:  696f 6e3a 2063 6c6f 7365 0d0a 636f 6e74  ion:.close..cont
        0x0050:  656e 742d 6c65 6e67 7468 3a20 3138 0d0a  ent-length:.18..
        0x0060:  0d0a 4865 7265 2069 7320 6120 5bc0 805d  ..Here.is.a.[..]
        0x0070:  2062 7974 65                             .byte

Update: any response string with a non-ascii unicode character will trigger the incorrect content-length issue. For example, return { status: 200, body: "Here is a [©] byte" } results in

(main) /h/s/m/j/host> curl http://192.168.0.216/ | od -c
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    18  100    18    0     0     34      0 --:--:-- --:--:-- --:--:--    34
0000000   H   e   r   e       i   s       a       [ 302 251   ]       b
0000020   y   t
0000022

Note the missing 'e' at the end.

@phoddie
Copy link
Collaborator

phoddie commented Dec 15, 2023

Yes, this is known. No plans to fix it in this server. The caller can use TextEncoder to convert the string to binary UTF-8 and write that instead.

FWIW – the ECMA-419 HTTP client and server only operate on binary data (not strings). That's deliberate so that these text conversion issues remain external.

@phoddie phoddie closed this as completed Dec 15, 2023
@tve
Copy link
Contributor Author

tve commented Dec 15, 2023

Yes, this is known. No plans to fix it in this server.

Noting this in the docs would have saved me a bunch of time and aggravation...

@tve
Copy link
Contributor Author

tve commented Dec 15, 2023

FWIW – the ECMA-419 HTTP client and server

Thanks for pointing this out again, I'm not finding the HTTPServer implementation in the Moddable SDK, am I missing something? found it

@phoddie
Copy link
Collaborator

phoddie commented Dec 15, 2023

(Added a note to the docs)

@tve
Copy link
Contributor Author

tve commented Dec 16, 2023

FYI: the issue with NUL characters seems to also affect Array.fromString. Some uses, like in the ecma-419 mqtt client look vulnerable. I assume this is known and won't-fix.

@phoddie
Copy link
Collaborator

phoddie commented Dec 17, 2023

The default XS string encoding is almost UTF-8. The exception is NULLs which use the CESU-8 encoding. Here's a modified version of ArrayBuffer.fromString that handles the NULL encoding.

fx_ArrayBuffer_fromString
void fx_ArrayBuffer_fromString(txMachine* the)
{
	txSize length;
	if (mxArgc < 1)
		mxTypeError("no argument");
	length = mxStringLength(fxToString(the, mxArgv(0)));
	
	txString c = mxArgv(0)->value.string, end = c + length - 1;
	txInteger nulls = 0;
	while (c < end) {
		if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1]))
			nulls += 1;
		c++;
	} 
	
	fxConstructArrayBufferResult(the, mxThis, length - nulls);
	if (!nulls)
		c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
	else {
		txString c = mxArgv(0)->value.string, end = c + length;
		txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
		while (c < end) {
			if ((0xc0 == (uint8_t)c[0]) && (0x80 == (uint8_t)c[1])) {
				*out++ = 0;
				c += 2;
			}
			else
				*out++ = *c++;
		} 
	}
}

Here's the corresponding change to String.fromArrayBuffer:

fx_String_fromArrayBuffer
void fx_String_fromArrayBuffer(txMachine* the)
{
	txSlot* slot;
	txSlot* arrayBuffer = C_NULL, *sharedArrayBuffer = C_NULL;
	txSlot* bufferInfo;
	txInteger limit, offset;
	txInteger inLength, outLength = 0, nulls = 0;
	unsigned char *in;
	txString string;
	if (mxArgc < 1)
		mxTypeError("no argument");
	slot = mxArgv(0);
	if (slot->kind == XS_REFERENCE_KIND) {
		slot = slot->value.reference->next;
		if (slot) {
			bufferInfo = slot->next;
			if (slot->kind == XS_ARRAY_BUFFER_KIND)
				arrayBuffer = slot;
			else if (slot->kind == XS_HOST_KIND) {
				if (!(slot->flag & XS_HOST_CHUNK_FLAG) && bufferInfo && (bufferInfo->kind == XS_BUFFER_INFO_KIND))
					sharedArrayBuffer = slot;
			}
		}
	}
	if (!arrayBuffer && !sharedArrayBuffer)
		mxTypeError("argument is no ArrayBuffer instance");
	limit = bufferInfo->value.bufferInfo.length;
	offset = fxArgToByteLength(the, 1, 0);
	if (limit < offset)
		mxRangeError("out of range byteOffset %ld", offset);
	inLength = fxArgToByteLength(the, 2, limit - offset);
	if ((limit < (offset + inLength)) || ((offset + inLength) < offset))
		mxRangeError("out of range byteLength %ld", inLength);

	in = offset + (unsigned char *)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
	while (inLength > 0) {
		unsigned char first = c_read8(in++), clen;
		if (first < 0x80){
			if (0 == first)
				nulls += 1;
			inLength -= 1;
			outLength += 1;
			continue;
		}

		if (0xC0 == (first & 0xE0))
			clen = 2;
		else if (0xE0 == (first & 0xF0))
			clen = 3;
		else if (0xF0 == (first & 0xF0))
			clen = 4;
		else
			goto badUTF8;

		inLength -= clen;
		if (inLength < 0)
			goto badUTF8;

		outLength += clen;
		clen -= 1;
		do {
			if (0x80 != (0xc0 & c_read8(in++)))
				goto badUTF8;
		} while (--clen > 0);
	}

	string = fxNewChunk(the, outLength + nulls + 1);
	if (!nulls)
		c_memcpy(string, offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data), outLength);
	else {
		txString c = string, end = c + outLength + nulls;
		txString buf = offset + (txString)(arrayBuffer ? arrayBuffer->value.arrayBuffer.address : sharedArrayBuffer->value.host.data);
		while (c < end) {
			txByte b = *buf++;
			if (b)
				*c++ = b;
			else {
				*c++ = 0xC0;
				*c++ = 0x80;
			}
		}
	}
	string[outLength + nulls] = 0;
	mxResult->value.string = string;
	mxResult->kind = XS_STRING_KIND;

	return;

badUTF8:
	mxTypeError("invalid UTF-8");
}

Let me know if this works for you. If it does, I'll merge it.

@tve
Copy link
Contributor Author

tve commented Dec 17, 2023

CESU-8: learn something new every day...
I'll plug it in and report back, thanks!

@phoddie
Copy link
Collaborator

phoddie commented Dec 18, 2023

I'll plug it in and report back, thanks!

Great. How'd it go?

@tve
Copy link
Contributor Author

tve commented Dec 19, 2023

My test for Array.fromString looks good, ~the impact on larger buffers (1KB) is 2x but on smaller strings I would call it "in the noise" (using esp32):

import Time from "time"
import TextEncoder from "text/encoder"

const ten = "0123456789"
const hundred = ten + ten + ten + ten + ten + ten + ten + ten + ten + ten
const thousand =
  hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred + hundred

const ten0 = "01234\x006789"
const hundred0 = ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0 + ten0
const thousand0 =
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0 +
  hundred0

function timeit(desc: string, f: () => void): void {
  const t0 = Time.ticks
  f()
  const t1 = Time.ticks
  trace(`${desc} took ${t1 - t0}ms\n`)
}

const tests = [
  ["ten", ten],
  ["hundred", hundred],
  ["thousand", thousand],
  ["ten0", ten0],
  ["hundred0", hundred0],
  ["thousand0", thousand0],
]

trace("\n===== String NUL test =====\n")

for (const test of tests) {
  const ab = ArrayBuffer.fromString(test[1])
  const u8 = new Uint8Array(ab)
  const te = new TextEncoder().encode(test[1])
  const same = u8.byteLength == te.byteLength && u8.every((v, i) => v == te[i])
  trace(
    `${test[0]} ${test[1].length} ${ab.byteLength} ` +
      `${u8.slice(0, 8)}...${u8[u8.byteLength - 1]} same=${same}\n`
  )
}

for (const test of tests) {
  timeit(`${test[0]} fromString`, () => {
    const s = test[1]
    for (let i = 0; i < 100; i++) {
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
      ArrayBuffer.fromString(s)
    }
  })
}

// Before fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 11 48,49,50,51,52,192,128,54...57 same=false
// hundred0 100 110 48,49,50,51,52,192,128,54...57 same=false
// thousand0 1000 1100 48,49,50,51,52,192,128,54...57 same=false
// ten fromString took 37ms
// hundred fromString took 40ms
// thousand fromString took 67ms
// ten0 fromString took 37ms
// hundred0 fromString took 41ms
// thousand0 fromString took 69ms
//
// After fix:
// ===== String NUL test =====
// ten 10 10 48,49,50,51,52,53,54,55...57 same=true
// hundred 100 100 48,49,50,51,52,53,54,55...57 same=true
// thousand 1000 1000 48,49,50,51,52,53,54,55...57 same=true
// ten0 10 10 48,49,50,51,52,0,54,55...57 same=true
// hundred0 100 100 48,49,50,51,52,0,54,55...57 same=true
// thousand0 1000 1000 48,49,50,51,52,0,54,55...57 same=true
// ten fromString took 26ms
// hundred fromString took 33ms
// thousand fromString took 107ms
// ten0 fromString took 22ms
// hundred0 fromString took 39ms
// thousand0 fromString took 174ms

similar test for String.fromArrayBuffer will follow
String.fromArrayBuffer looks good too.

Edit:

  • For ArrayBuffer.fromString 1000 bytes: post-fix-no-NUL to pre-fix is ~1.6x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.6x; for 100 bytes or less the difference is noise.
  • For String.fromArrayBuffer 1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is also ~1.5x; for 100 bytes or less the difference is noise.

@phoddie
Copy link
Collaborator

phoddie commented Dec 19, 2023

Thanks for the tests and benchmarks. Given the need for an extra pass over the data, it is going to take more time. Since the actual work is trivial, memory bandwidth is the limiting factor. Since this isn't generally performance critical, I think it is an OK tradeoff and keeps the code small.

I just noticed that fx_ArrayBuffer_fromString can be optimized. The first thing it does is get the length of the string. That can be combined with the pass that searches for nulls. That should make the performance much closer to the original. Here's the updated version.

fx_ArrayBuffer_fromString 2
void fx_ArrayBuffer_fromString(txMachine* the)
{
	txSize length = 0;
	if (mxArgc < 1)
		mxTypeError("no argument");
	
	txString c = mxArgv(0)->value.string;
	txInteger nulls = 0;
	while (true) {
		uint8_t b = (uint8_t)c_read8(c++);
		if (!b) break;

		length += 1;
		if ((0xc0 == b) && (0x80 == (uint8_t)c_read8(c)))
			nulls += 1;
	} 
	
	fxConstructArrayBufferResult(the, mxThis, length - nulls);
	if (!nulls)
		c_memcpy(mxResult->value.reference->next->value.arrayBuffer.address, mxArgv(0)->value.string, length);
	else {
		txString c = mxArgv(0)->value.string, end = c + length;
		txByte *out = mxResult->value.reference->next->value.arrayBuffer.address;
		while (c < end) {
			uint8_t b = (uint8_t)c_read8(c++);
			if ((0xc0 == (uint8_t)b) && (0x80 == (uint8_t)c_read8(c))) {
				*out++ = 0;
				c += 1;
			}
			else
				*out++ = b;
		} 
	}
}

How's that look?

Separately, I would expect String.fromArrayBuffer() performance to be more-or-less unchanged. There's no additional pass over the data. If there are nulls, it will be a little slower because it can't use memcpy to transfer the bytes, but in the common case of no nulls it should pretty much match.

@tve
Copy link
Contributor Author

tve commented Dec 19, 2023

For ArrayBuffer.fromString

  • 1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.8x
  • for 100 bytes or less the difference is noise.

For String.fromArrayBuffer

  • 1000 bytes: post-fix-no-NUL to pre-fix is ~1.1x, and post-fix-with-NUL to post-fix-no-NUL is ~1.5x
  • for 100 bytes or less the difference is noise.

@phoddie
Copy link
Collaborator

phoddie commented Dec 19, 2023

Thanks for rechecking. The fromString optimization seems to be working nicely. I'll integrate the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants