Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect handling of NUL characters in strings #10958

Closed
ScottPJones opened this issue Apr 23, 2015 · 51 comments
Closed

incorrect handling of NUL characters in strings #10958

ScottPJones opened this issue Apr 23, 2015 · 51 comments
Labels
domain:unicode Related to unicode characters and encodings needs decision A decision on this change is needed

Comments

@ScottPJones
Copy link
Contributor

This gets a BoundsError: attempt to access 1-element Array{Char,1}
in convert at utf32.jl:37

@jiahao
Copy link
Member

jiahao commented Apr 23, 2015

Looks like length(::UTF8String) isn't right.

julia> length("\0\uff")
0

julia> length(UTF8String("\0w"))
0

@jiahao jiahao added the domain:unicode Related to unicode characters and encodings label Apr 23, 2015
@StefanKarpinski
Copy link
Sponsor Member

I think we need to stop relying on the u8 support functions for UTF-8 various things. It assumes that strings are NUL-terminated, which in Julia they are not. There may be other assumptions it makes that are invalid. @JeffBezanson might know what those assumptions may be.

@stevengj
Copy link
Member

We could use "modified UTF-8", like Java, and use the overlong encoding for the NUL codepoint if it occurs in the string.

The problem with encoding NUL as the 0 byte is that any Julia code that calls external C code expecting NUL-terminated strings could silently produce unexpected/truncated results if an arbitrary UTF8String is passed.

Or we could just disallow NUL bytes in UTF8 or ASCII strings.

Or we could allow NUL bytes, making sure all our own code is "NUL-clean" (i.e. pass an explicit length to any C string routines), and caveat emptor for other packages calling C string routines.

@stevengj
Copy link
Member

Note that all of the libuv functions expect NUL-terminated strings and don't properly handle strings with embedded NUL characters.

The more I look at this, the more I think that allowing NUL characters in strings is a gaping security problem. If you want an arbitrary sequence of bytes, use a Vector{UInt8}.

@stevengj
Copy link
Member

Similarly for all the Windows (e.g. GetFullPathNameW or _wrmdir) and Unix (e.g. rmdir) filesystem functions that we are calling, which accept NUL-terminated strings. If we allow strings containing NUL, then we need to insert a check before every one of these calls in order to avoid the possibility of silent truncation. Note that this applies to UTF-16 strings (e.g. in the Windows functions) as well as UTF-8 strings.

@stevengj stevengj changed the title utf32(“\0\uff”) gives an error incorrect handling of NUL characters in strings Apr 23, 2015
@stevengj stevengj added the needs decision A decision on this change is needed label Apr 23, 2015
@stevengj
Copy link
Member

See also RFC 3629, the discussion of invalid-byte-sequence handling in the Wikipedia UTF-8 article, and PEP 0383 on non-decodable bytes. It seems like a subtle issue to decide what to do with things like this, because may be really hard to be sure that no invalid byte-sequences crop up.

@stevengj
Copy link
Member

One possibility might be to have a special Cstring type corresponding to NUL-terminated char* (and Cwstring for wchar_t*) which, when used in ccall, is equivalent to Ptr{Uint8} but inserts a check for NUL characters. cc: @vtjnash

@stevengj
Copy link
Member

@JeffBezanson, you wrote a lot of the original UTF-8 code, right? What was the general philosophy on handling of embedded NUL characters, or of invalid UTF-8? Note that jl_array_to_string happily stores invalid UTF-8 in the resulting UTF8String.

@ScottPJones
Copy link
Contributor Author

Please! Don’t make it so that you can’t have \0 characters in Julia strings!
There are lots of totally valid uses of embedded NUL characters...
Also, you would just be moving the checking, so that it happens on every conversion between Array{UInt8,1}, Array{UInt16,1}, Array{UInt32,1}... yech!!!!
Also, frankly, I don’t think is truly necessary - and I don’t think anybody else is doing anything of the sort... why should it matter to a C function if you pass a pointer to a string that has something after the NUL character, if it only cares about the characters beforehand?

Also, I can bend the UTF-8 rules a bit to accept a single technically invalid sequence, but my intention was to never produce any invalid UTF-8, UTF-16, or UTF-32 strings.

Also, ASCIIStrings also can have embedded \0 bytes in them... are you going to make that invalid as well?

@stevengj
Copy link
Member

@ScottPJones, the point is that you think you are passing foo\0bar to the C function, but the C function thinks you are passing "foo". i.e. silent truncation is occuring, which is dangerous. Really, we should throw an exception in such a case.

What is the use-case for strings with embedded NUL bytes? (Most POSIX and ISO C functions don't allow them.)

Note that we are talking about strings, not Array{Intxx}, so there would only be a check when you convert an array into a string, and we already check for valid UTF-8 when such a conversion is performed.

@ScottPJones
Copy link
Contributor Author

Delimited strings... people have been using them for many, many, many years.
C is NOT the world... there are many other languages out there that know nothing about \0 terminated strings.
C itself allows \0 bytes within strings... and even char * doesn’t imply that it is \0 terminated...
C++ std::string does NOT have \0 to terminate strings...
I think any modern APIs tend to use a pointer and a 32-bit or better a size_t length.

I always avoided using any functions that relied on \0 termination... it kills performance compared to having fixed width characters and a length.
I also wouldn’t hold up POSIX or ISO C APIs as being somehow well designed... a lot of that was just standardizing what was present in the first Unixes...

@ScottPJones
Copy link
Contributor Author

About security risks: creating overlong UTF-8 sequences causes a security problem.
Accepting them in a conversion function (just the 0xc0 0xbf for \0), along with pairs of 3-byte surrogate characters, frequent from systems that treat UTF-16 as if it were still UCS-2), I don’t think really causes security issues, esp. since the convert functions for UTF8String, UTF16String, and UTF32String allow pretty much anything in currently, so it isn’t any worse than before.

@ScottPJones
Copy link
Contributor Author

I’d also thought of having a Cstring / Cwstring type that did the checks, only for use when you really needed it (which is less and less an issue, since more APIs now use length/pointer instead).
To me, that’s the best approach.

@ScottPJones
Copy link
Contributor Author

I might call it CZstring, and CZwstring... to make it clear that these are JUST for the case where nul termination is important... not all C strings are \0 terminated anyway (just literals, and ones you are making to pass to particular APIs that still expect them)

@stevengj
Copy link
Member

@ScottPJones, it's totally not true that all modern APIs (defined as APIs in current widespread use, or even defined as APIs of libraries created in the last 10-15 years) avoid NUL-terminated strings, for better or for worse. In both Julia base and in the packages you can find lots of examples calling external C libraries which pass NUL-terminated strings. Essentially all of these need to have checks if we continue to allow strings with embedded NUL.

A char* in C is not necessarily a "string", it is is just a pointer, possibly to an array. In common parlance, the term "string" in C refers to NUL-terminated arrays of char or wchar_t, so I think Cstring is appropriate here. If you just have a byte array and a length, you can use Ptr{UInt8} in ccall.

Whether we should accept (and silently convert) modified UTF-8 to standard UTF-8 is a separate issue; I tend to agree, but let's keep that out of this discussion. After reading the RFCs, I agree that we shouldn't produce the overlong NUL encoding ourselves.

Using NUL as a delimiter inside of a string is cute, but is it really that useful? Is there any popular library that returns strings in this format, for example?

@JeffBezanson
Copy link
Sponsor Member

My intent was to support NUL characters inside julia strings. If you pass such a string to a C function that accepts NUL-terminated strings, that's a separate problem we can't do anything about.

I definitely don't want to use "modified UTF-8".

Not all the functions in utf8.c assume nul termination. Several accept lengths or have non-nul-terminated variants. For length we should just call u8_charnum.

@stevengj
Copy link
Member

If we pass such a string to a C function that accepts NUL-terminated strings (and we potentially do that, in many places), that is our problem and we do need a check.

Some of the functions in utf8.c seem to assume valid UTF-8, which may not be produced e.g. by bytestring(ptr, len).

@StefanKarpinski
Copy link
Sponsor Member

I think the way forward for passing strings to C APIs that expect NUL-terminated strings is to ensures that the string is properly NUL-terminated and that it has no embedded NULs either – embedded NULs should be an error. This guarantees that the Julia and C sides agree on what the string is. This can be accomplished either with a type on the Julia side that guarantees this property upon construction (potentially sharing data with the original Julia string when possible), or with a function that the user is expected to call on the string.

@ScottPJones
Copy link
Contributor Author

@stevengj I didn't say that all modern APIs do, just that they tend that way now... Also, as a C programmer for 35 years, I disagree with saying that in common parlance string means NUL-terminated... may be it depends on what particular community of C (and not C++) programmers you are in.
About using NUL as a delimiter inside a string, it is common in data from systems where \0 is just another character... it is one of the more frequent delimiters I've seen for storing multiple values in a single database field...

@stevengj
Copy link
Member

@StefanKarpinski, adding yet another set of string types would be a huge hassle for everyone. We could add a checknullterm(s) function that returns s or throws an exception if it contains NUL, but we'd be adding calls to that to lots of ccalls. That's why I think it would be far more convenient (and hence more likely to be used) to have a Cstring type that is not a string type per se, but rather a synonym for Ptr{Uint8} that does validation on the cconvert call.

@JeffBezanson
Copy link
Sponsor Member

Leaving aside NULs for a moment, I don't think it's sane to expect every operation on invalid utf-8 data to have some defined behavior. Currently our getindex for utf-8 tries to validate as it goes, but it's incomplete, and to do full checking would need to be even bigger and slower. If we keep going down that path I don't think we can have competitive performance.

@timholy
Copy link
Sponsor Member

timholy commented Apr 23, 2015

Examples like this one:
http://www.cplusplus.com/reference/string/string/c_str/
make me think that we'll be on thin ice if we really endorse the claim that the common conception of a C string does not imply that it is null terminated.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski Yes, a new Cstring (and Cwstring) type with their own constructors would do the trick... I'd also say that the normal String types should actually not require a terminating \0.
I think that just hurts performance, and makes it harder to move between Array{UInt8,1}, Array{UInt16,1} and Array{UInt32,1} and ASCIIString or UTF8String, UTF16String, and UTF32String, respectively.
Could there be a flag in the internal representation that indicated if the string was \0 terminated, so that literals could be created as now \0 terminated (with it hidden), it could be reinterpreted directly as a Cstring?

@stevengj
Copy link
Member

@ScottPJones, there is a difference between a new string type and a new type. I think Cstring should be a new type that you put in a ccall argument, but not be a subtype of AbstractString; it will be effectively a pointer type. But that is a technical detail.

I'm skeptical that there is a significant performance cost to the current approach for UTF-8 nul-termination, in which all UInt8 arrays have a hidden extra 0 byte appended.

@StefanKarpinski
Copy link
Sponsor Member

I suspect that the current approach of sneaking a NUL byte at the end of every UInt8 vector is not a performance issue. It is, however, annoying for other reasons and should probably, IMO, go away. C APIs that take NUL-terminated strings tend not to be used for very long strings, so I think that checking that the passed string has no embedded NULs and ensuring the trailing NUL (possibly by creating a copy with the NUL appended) will not be a big performance issue either.

Having a Cstring type that's just a pointer with special conversion behavior would be ok, but do we also use it for NUL-terminated UTF-16 and UTF-32 strings? Those are not what one would classically consider "C strings". Do NUL-terminated UTF-16 and UTF-32 strings require only a single trailing NUL byte? Or do they require an entire NUL code unit?

@stevengj
Copy link
Member

@JeffBezanson, I agree, but I just want to make sure Julia never crashes on invalid UTF-8 data, even if it gives junk results.

@ScottPJones
Copy link
Contributor Author

Yes, I'd like to contribute to the core of Julia there, and also with decimal floating point support, improved ODBC support, and maybe an Aerospike package...

@pao
Copy link
Member

pao commented Apr 23, 2015

using at-time, to

@ScottPJones a sidenote: as a part of what has turned out to be a syntax decision with hilarious consequences, please try to always code-quote the names of macros with backticks. Otherwise, you create a GitHub notification to a (sometimes-irritable) user with the same name as the macro.

@ScottPJones
Copy link
Contributor Author

@pao Oops!!! Thanks :-)

@pao
Copy link
Member

pao commented Apr 23, 2015

np, took us a while to figure that out 😄

stevengj added a commit that referenced this issue Apr 24, 2015
fix #10958: buggy handling of embedded NUL chars
stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 24, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 25, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 26, 2015
stevengj added a commit that referenced this issue Apr 26, 2015
(cherry picked from commit 1d90e97)
ref PR #10991

Conflicts:
	base/string.jl
	base/utf8.jl
	base/utf8proc.jl
	test/unicode.jl
stevengj added a commit to stevengj/julia that referenced this issue Apr 27, 2015
stevengj added a commit to stevengj/julia that referenced this issue Apr 28, 2015
mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests

8 participants