incorrect handling of NUL characters in strings #10958

ScottPJones · 2015-04-23T14:40:16Z

This gets a BoundsError: attempt to access 1-element Array{Char,1}
in convert at utf32.jl:37

jiahao · 2015-04-23T15:18:44Z

Looks like length(::UTF8String) isn't right.

julia> length("\0\uff")
0

julia> length(UTF8String("\0w"))
0

StefanKarpinski · 2015-04-23T15:32:43Z

I think we need to stop relying on the u8 support functions for UTF-8 various things. It assumes that strings are NUL-terminated, which in Julia they are not. There may be other assumptions it makes that are invalid. @JeffBezanson might know what those assumptions may be.

stevengj · 2015-04-23T16:57:16Z

We could use "modified UTF-8", like Java, and use the overlong encoding for the NUL codepoint if it occurs in the string.

The problem with encoding NUL as the 0 byte is that any Julia code that calls external C code expecting NUL-terminated strings could silently produce unexpected/truncated results if an arbitrary UTF8String is passed.

Or we could just disallow NUL bytes in UTF8 or ASCII strings.

Or we could allow NUL bytes, making sure all our own code is "NUL-clean" (i.e. pass an explicit length to any C string routines), and caveat emptor for other packages calling C string routines.

stevengj · 2015-04-23T17:08:52Z

Note that all of the libuv functions expect NUL-terminated strings and don't properly handle strings with embedded NUL characters.

The more I look at this, the more I think that allowing NUL characters in strings is a gaping security problem. If you want an arbitrary sequence of bytes, use a Vector{UInt8}.

stevengj · 2015-04-23T17:32:59Z

Similarly for all the Windows (e.g. GetFullPathNameW or _wrmdir) and Unix (e.g. rmdir) filesystem functions that we are calling, which accept NUL-terminated strings. If we allow strings containing NUL, then we need to insert a check before every one of these calls in order to avoid the possibility of silent truncation. Note that this applies to UTF-16 strings (e.g. in the Windows functions) as well as UTF-8 strings.

stevengj · 2015-04-23T18:09:25Z

See also RFC 3629, the discussion of invalid-byte-sequence handling in the Wikipedia UTF-8 article, and PEP 0383 on non-decodable bytes. It seems like a subtle issue to decide what to do with things like this, because may be really hard to be sure that no invalid byte-sequences crop up.

stevengj · 2015-04-23T18:15:27Z

One possibility might be to have a special Cstring type corresponding to NUL-terminated char* (and Cwstring for wchar_t*) which, when used in ccall, is equivalent to Ptr{Uint8} but inserts a check for NUL characters. cc: @vtjnash

stevengj · 2015-04-23T18:17:26Z

@JeffBezanson, you wrote a lot of the original UTF-8 code, right? What was the general philosophy on handling of embedded NUL characters, or of invalid UTF-8? Note that jl_array_to_string happily stores invalid UTF-8 in the resulting UTF8String.

ScottPJones · 2015-04-23T18:32:51Z

Please! Don’t make it so that you can’t have \0 characters in Julia strings!
There are lots of totally valid uses of embedded NUL characters...
Also, you would just be moving the checking, so that it happens on every conversion between Array{UInt8,1}, Array{UInt16,1}, Array{UInt32,1}... yech!!!!
Also, frankly, I don’t think is truly necessary - and I don’t think anybody else is doing anything of the sort... why should it matter to a C function if you pass a pointer to a string that has something after the NUL character, if it only cares about the characters beforehand?

Also, I can bend the UTF-8 rules a bit to accept a single technically invalid sequence, but my intention was to never produce any invalid UTF-8, UTF-16, or UTF-32 strings.

Also, ASCIIStrings also can have embedded \0 bytes in them... are you going to make that invalid as well?

stevengj · 2015-04-23T18:39:27Z

@ScottPJones, the point is that you think you are passing foo\0bar to the C function, but the C function thinks you are passing "foo". i.e. silent truncation is occuring, which is dangerous. Really, we should throw an exception in such a case.

What is the use-case for strings with embedded NUL bytes? (Most POSIX and ISO C functions don't allow them.)

Note that we are talking about strings, not Array{Intxx}, so there would only be a check when you convert an array into a string, and we already check for valid UTF-8 when such a conversion is performed.

ScottPJones · 2015-04-23T18:49:02Z

Delimited strings... people have been using them for many, many, many years.
C is NOT the world... there are many other languages out there that know nothing about \0 terminated strings.
C itself allows \0 bytes within strings... and even char * doesn’t imply that it is \0 terminated...
C++ std::string does NOT have \0 to terminate strings...
I think any modern APIs tend to use a pointer and a 32-bit or better a size_t length.

I always avoided using any functions that relied on \0 termination... it kills performance compared to having fixed width characters and a length.
I also wouldn’t hold up POSIX or ISO C APIs as being somehow well designed... a lot of that was just standardizing what was present in the first Unixes...

ScottPJones · 2015-04-23T18:53:19Z

About security risks: creating overlong UTF-8 sequences causes a security problem.
Accepting them in a conversion function (just the 0xc0 0xbf for \0), along with pairs of 3-byte surrogate characters, frequent from systems that treat UTF-16 as if it were still UCS-2), I don’t think really causes security issues, esp. since the convert functions for UTF8String, UTF16String, and UTF32String allow pretty much anything in currently, so it isn’t any worse than before.

ScottPJones · 2015-04-23T18:55:30Z

I’d also thought of having a Cstring / Cwstring type that did the checks, only for use when you really needed it (which is less and less an issue, since more APIs now use length/pointer instead).
To me, that’s the best approach.

ScottPJones · 2015-04-23T18:57:18Z

I might call it CZstring, and CZwstring... to make it clear that these are JUST for the case where nul termination is important... not all C strings are \0 terminated anyway (just literals, and ones you are making to pass to particular APIs that still expect them)

stevengj · 2015-04-23T19:01:33Z

@ScottPJones, it's totally not true that all modern APIs (defined as APIs in current widespread use, or even defined as APIs of libraries created in the last 10-15 years) avoid NUL-terminated strings, for better or for worse. In both Julia base and in the packages you can find lots of examples calling external C libraries which pass NUL-terminated strings. Essentially all of these need to have checks if we continue to allow strings with embedded NUL.

A char* in C is not necessarily a "string", it is is just a pointer, possibly to an array. In common parlance, the term "string" in C refers to NUL-terminated arrays of char or wchar_t, so I think Cstring is appropriate here. If you just have a byte array and a length, you can use Ptr{UInt8} in ccall.

Whether we should accept (and silently convert) modified UTF-8 to standard UTF-8 is a separate issue; I tend to agree, but let's keep that out of this discussion. After reading the RFCs, I agree that we shouldn't produce the overlong NUL encoding ourselves.

Using NUL as a delimiter inside of a string is cute, but is it really that useful? Is there any popular library that returns strings in this format, for example?

JeffBezanson · 2015-04-23T19:05:51Z

My intent was to support NUL characters inside julia strings. If you pass such a string to a C function that accepts NUL-terminated strings, that's a separate problem we can't do anything about.

I definitely don't want to use "modified UTF-8".

Not all the functions in utf8.c assume nul termination. Several accept lengths or have non-nul-terminated variants. For length we should just call u8_charnum.

stevengj · 2015-04-23T19:08:06Z

If we pass such a string to a C function that accepts NUL-terminated strings (and we potentially do that, in many places), that is our problem and we do need a check.

Some of the functions in utf8.c seem to assume valid UTF-8, which may not be produced e.g. by bytestring(ptr, len).

StefanKarpinski · 2015-04-23T19:26:26Z

I think the way forward for passing strings to C APIs that expect NUL-terminated strings is to ensures that the string is properly NUL-terminated and that it has no embedded NULs either – embedded NULs should be an error. This guarantees that the Julia and C sides agree on what the string is. This can be accomplished either with a type on the Julia side that guarantees this property upon construction (potentially sharing data with the original Julia string when possible), or with a function that the user is expected to call on the string.

ScottPJones · 2015-04-23T19:27:58Z

@stevengj I didn't say that all modern APIs do, just that they tend that way now... Also, as a C programmer for 35 years, I disagree with saying that in common parlance string means NUL-terminated... may be it depends on what particular community of C (and not C++) programmers you are in.
About using NUL as a delimiter inside a string, it is common in data from systems where \0 is just another character... it is one of the more frequent delimiters I've seen for storing multiple values in a single database field...

stevengj · 2015-04-23T19:32:10Z

@StefanKarpinski, adding yet another set of string types would be a huge hassle for everyone. We could add a checknullterm(s) function that returns s or throws an exception if it contains NUL, but we'd be adding calls to that to lots of ccalls. That's why I think it would be far more convenient (and hence more likely to be used) to have a Cstring type that is not a string type per se, but rather a synonym for Ptr{Uint8} that does validation on the cconvert call.

JeffBezanson · 2015-04-23T19:32:17Z

Leaving aside NULs for a moment, I don't think it's sane to expect every operation on invalid utf-8 data to have some defined behavior. Currently our getindex for utf-8 tries to validate as it goes, but it's incomplete, and to do full checking would need to be even bigger and slower. If we keep going down that path I don't think we can have competitive performance.

timholy · 2015-04-23T19:32:23Z

Examples like this one:
http://www.cplusplus.com/reference/string/string/c_str/
make me think that we'll be on thin ice if we really endorse the claim that the common conception of a C string does not imply that it is null terminated.

ScottPJones · 2015-04-23T19:34:19Z

@StefanKarpinski Yes, a new Cstring (and Cwstring) type with their own constructors would do the trick... I'd also say that the normal String types should actually not require a terminating \0.
I think that just hurts performance, and makes it harder to move between Array{UInt8,1}, Array{UInt16,1} and Array{UInt32,1} and ASCIIString or UTF8String, UTF16String, and UTF32String, respectively.
Could there be a flag in the internal representation that indicated if the string was \0 terminated, so that literals could be created as now \0 terminated (with it hidden), it could be reinterpreted directly as a Cstring?

stevengj · 2015-04-23T19:37:00Z

@ScottPJones, there is a difference between a new string type and a new type. I think Cstring should be a new type that you put in a ccall argument, but not be a subtype of AbstractString; it will be effectively a pointer type. But that is a technical detail.

I'm skeptical that there is a significant performance cost to the current approach for UTF-8 nul-termination, in which all UInt8 arrays have a hidden extra 0 byte appended.

StefanKarpinski · 2015-04-23T19:44:46Z

I suspect that the current approach of sneaking a NUL byte at the end of every UInt8 vector is not a performance issue. It is, however, annoying for other reasons and should probably, IMO, go away. C APIs that take NUL-terminated strings tend not to be used for very long strings, so I think that checking that the passed string has no embedded NULs and ensuring the trailing NUL (possibly by creating a copy with the NUL appended) will not be a big performance issue either.

Having a Cstring type that's just a pointer with special conversion behavior would be ok, but do we also use it for NUL-terminated UTF-16 and UTF-32 strings? Those are not what one would classically consider "C strings". Do NUL-terminated UTF-16 and UTF-32 strings require only a single trailing NUL byte? Or do they require an entire NUL code unit?

stevengj · 2015-04-23T19:45:13Z

@JeffBezanson, I agree, but I just want to make sure Julia never crashes on invalid UTF-8 data, even if it gives junk results.

ScottPJones · 2015-04-23T20:14:03Z

Yes, I'd like to contribute to the core of Julia there, and also with decimal floating point support, improved ODBC support, and maybe an Aerospike package...

pao · 2015-04-23T20:14:09Z

using at-time, to

@ScottPJones a sidenote: as a part of what has turned out to be a syntax decision with hilarious consequences, please try to always code-quote the names of macros with backticks. Otherwise, you create a GitHub notification to a (sometimes-irritable) user with the same name as the macro.

ScottPJones · 2015-04-23T20:15:19Z

@pao Oops!!! Thanks :-)

pao · 2015-04-23T20:17:44Z

np, took us a while to figure that out 😄

fix #10958: buggy handling of embedded NUL chars

… to ccall (see JuliaLang#10958, JuliaLang#10991)

(cherry picked from commit 1d90e97) ref PR #10991 Conflicts: base/string.jl base/utf8.jl base/utf8proc.jl test/unicode.jl

… to ccall (see JuliaLang#10958, JuliaLang#10991)

jiahao added the domain:unicode Related to unicode characters and encodings label Apr 23, 2015

jiahao mentioned this issue Apr 23, 2015

Incorrect behavior of length(“\0\uff”) #10957

Closed

stevengj changed the title ~~utf32(“\0\uff”) gives an error~~ incorrect handling of NUL characters in strings Apr 23, 2015

stevengj added the needs decision A decision on this change is needed label Apr 23, 2015

stevengj mentioned this issue Apr 24, 2015

fix #10958: buggy handling of embedded NUL chars #10991

Merged

stevengj closed this as completed in 1d90e97 Apr 24, 2015

stevengj added a commit that referenced this issue Apr 24, 2015

Merge pull request #10991 from stevengj/nullsafe

8bcdb3f

fix #10958: buggy handling of embedded NUL chars

stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

c794a96

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj mentioned this issue Apr 24, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall #10994

Merged

4 tasks

stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

ade3444

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

df1aa34

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

e97df66

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

5c19613

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

7841275

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

71ac26c

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

71ab84a

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

6f60b3c

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

874e63c

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

f53d861

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

7741fb4

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

6aba33f

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

d0ef0b7

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

0171240

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit that referenced this issue Apr 26, 2015

fix #10958: buggy handling of embedded NUL chars

b192bf0

(cherry picked from commit 1d90e97) ref PR #10991 Conflicts: base/string.jl base/utf8.jl base/utf8proc.jl test/unicode.jl

stevengj added a commit that referenced this issue Apr 27, 2015

fix #10958: buggy handling of embedded NUL chars

cccbb34

stevengj added a commit to stevengj/julia that referenced this issue Apr 27, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

2384f23

… to ccall (see JuliaLang#10958, JuliaLang#10991)

stevengj added a commit to stevengj/julia that referenced this issue Apr 28, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

896b1a6

… to ccall (see JuliaLang#10958, JuliaLang#10991)

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

fix JuliaLang#10958: buggy handling of embedded NUL chars

fd37c2e

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

add Cstring/Cwstring types for safe passing of NUL-terminated strings…

57e166b

… to ccall (see JuliaLang#10958, JuliaLang#10991)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect handling of NUL characters in strings #10958

incorrect handling of NUL characters in strings #10958

ScottPJones commented Apr 23, 2015

jiahao commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

JeffBezanson commented Apr 23, 2015

stevengj commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

JeffBezanson commented Apr 23, 2015

timholy commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

pao commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

pao commented Apr 23, 2015

incorrect handling of NUL characters in strings #10958

incorrect handling of NUL characters in strings #10958

Comments

ScottPJones commented Apr 23, 2015

jiahao commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

JeffBezanson commented Apr 23, 2015

stevengj commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

JeffBezanson commented Apr 23, 2015

timholy commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

pao commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

pao commented Apr 23, 2015