Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall #10994

Merged
merged 3 commits into from
May 2, 2015

Conversation

stevengj
Copy link
Member

This adds two new types, Cstring and Cwstring, which act just like Ptr{UInt8} and Ptr{Cwchar_t} as arguments to ccall, except that they throw an error if the string contains an embedded NUL character. I went through the Julia code and used these types wherever they seemed necessary.

This is necessary for safe passing of strings to C routines expecting NUL-terminated strings, to avoid silent truncation. See #10958, #10991.

Note that you can avoid the test, for strings known not to contain NUL, simply by using Ptr{UInt8} as before.

Note also that I changed several of our own C API routines to accept a char* and a size_t length parameter, rather than assuming NUL termination.

Still missing:

  • Regression tests
  • Documentation
  • Check for NUL in Cmd strings (command-line params)
  • Change Cstring to a bitstype

cc: @JeffBezanson, @StefanKarpinski, @ScottPJones, @jiahao

@stevengj stevengj added the domain:unicode Related to unicode characters and encodings label Apr 24, 2015
# C NUL-terminated string pointers; these can be used in ccall
# instead of Ptr{UInt8} and Ptr{Cwchar_t}, respectively, to enforce
# a check for embedded NUL chars in the string (to avoid silent truncation).
immutable Cstring; p::Ptr{UInt8}; end
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Cchar although I'm not familiar with any systems where Cchar != UInt8 and we're likely to be broken for many reasons on any such system?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, that's the least convincing argument ever, pls ignore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(char) is always one.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was an amusing read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have had some signedness-of-char issues with unicode on XP in the past

@stevengj stevengj force-pushed the nullsafe2 branch 2 times, most recently from ade3444 to df1aa34 Compare April 24, 2015 22:06
@stevengj
Copy link
Member Author

Along the way, I noticed (and fixed + test) an unrelated bug where parse and tryparse were using length(s) instead of endof(s), and hence were giving incorrect answers for Unicode whitespace chars. Also I improved BigInt parsing so that it no longer makes a copy of the string when parsing a whole bytestring.

@ScottPJones
Copy link
Contributor

Yes, it should be Ptr{Cchar}, not Ptr{UInt8}, if you really want to be correct...
In most C implementations, by default char is signed, although on IBM systems the default is unsigned (because they needed to handle EBCDIC, which is 8-bit, not 7-bit like ASCII).
I remember how horrible it was back in 1989 when I first ported the product of the company I was consulting for (ended up consulting there for 29 years) to AIX... I had been careful in my changes to not assume anything about char, but other developers had not, and there was probably 20-30MB of C code then... (it was around 56MB as of the beginning of this year). What was worse, the IBM compiler had no flag at the time to switch char from unsigned to signed, and C89 (that added things like void, signed, volatile) was not available yet (and it took several years before we had it on all platforms we supported)

@ScottPJones
Copy link
Contributor

This is great, @stevengj !

@stevengj
Copy link
Member Author

Ptr{Cchar} vs. Ptr{Uint8} doesn't make a difference, regardless of whether char is signed; the same pointer address gets passed regardless.

@stevengj
Copy link
Member Author

@tkelman, any idea why I would be getting an undefined-var error on Windows for _getenvlen? Cwstring is defined in c.jl, which should have been loaded by this point.

jl_unprotect_stack at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
jl_throw at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
jl_undefined_var_error at c:\projects\julia\usr\bin\libjulia.dll (unknown line)
_getenvlen at env.jl:27
get at env.jl:49
tty_size at env.jl:195
term at markdown/render/terminal/render.jl:15

@ScottPJones
Copy link
Contributor

@StevenG There isn't anything in Julia that would look at whether Cchar is signed or unsigned on the platform, when dereferencing a Ptr{Cchar} from C? (I know, my newbieness with Julia is showing! 😀)

@stevengj
Copy link
Member Author

@ScottPJones, we are just talking about the argument type declared in ccall; at the binary level, all that is being passed is a memory address, no type information. It is equivalent in C to linking to a function compiled as void foo(char *c), but calling it from a prototype mis-declared as void foo(unsigned char *c). If we call foo((unsigned char*) a) with a char *a array, the same pointer to the same data is passed to the underlying foo routine regardless of whether char is signed — the ABI doesn't care what the pointer type declaration is, and nothing we do in the ccall affects the underlying foo machine code.

Not that it would hurt to change the declarations to Ptr{Cchar}, of course. We currently aren't very consistent, but I didn't have the patience to go through and change everything. I'll go ahead and change it in Cstring itself for tidiness' sake.

if findfirst(s.data, 0) != length(s.data)
throw(ArgumentError("embedded NUL chars are not allowed in C strings"))
end
return Cwstring(unsafe_convert(Ptr{wchar_t}, s))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Cwstring(unsafe_convert(Ptr{Cwchar_t}, s))
                                   ^

@stevengj stevengj force-pushed the nullsafe2 branch 4 times, most recently from 7841275 to 71ac26c Compare April 25, 2015 01:36
# instead of Ptr{Cchar} and Ptr{Cwchar_t}, respectively, to enforce
# a check for embedded NUL chars in the string (to avoid silent truncation).
immutable Cstring; p::Ptr{Cchar}; end
immutable Cwstring; p::Ptr{Cwchar_t}; end
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't consistent with some platform ABIs, unfortunately, and will break ccall on some platforms. it must be declared bitstype WORD_SIZE Cstring. i realize that creating new bits types is quite opaque currently and should be better documented. one extremely limited example is

julia/test/core.jl

Lines 2255 to 2257 in 54d8cc7

bitstype 24 Int24
Int24(x::Int) = Intrinsics.box(Int24,Intrinsics.trunc_int(Int24,Intrinsics.unbox(Int,x)))
Int(x::Int24) = Intrinsics.box(Int,Intrinsics.zext_int(Int,Intrinsics.unbox(Int24,x)))

pointer.jl is a more complex example (https://github.com/JuliaLang/julia/blob/master/base/pointer.jl)

(fwiw, and iirc, the only place it will actually matter (and thus segfault) is if this is used as a return type on x86 linux)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Unfortunately, WORD_SIZE is defined in int.jl, but I need it in c.jl which is included before int.jl in sysimg.jl. Any suggestions? I guess I can use Ptr.size*8 move the WORD_SIZE definition to base.jl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm getting ErrorException("invalid number of bits in type Cstring") ... does the 2nd argument to bitstype have to be a numeric literal?

It won't let me define a bitstype inside an if WORD_SIZE === 64 statement either; it says type definition not allowed inside a local scope, which is confusing since normally if does not start a new scope. (Hmm, later on it seems to work; I'm not sure what I was doing wrong.)

I am getting pretty frustrated with bitstype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have it working as a bitstype if I hard-code the size as 64. I'm trying to figure out how to use WORD_SIZE, but it is tricky because a lot of things aren't defined yet at this point in the bootstrap process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example,

if Ptr.size === 8
    bitstype 64 Cstring
    bitstype 64 Cwstring
else
    bitstype 32 Cstring
    bitstype 32 Cwstring
end

doesn't work for some reason (I get a box: argument is of incorrect size error later in the build process).

... aha, the problem is that .size is an Int32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And

if box(Int, unbox(Int32, Int.size)) === 8
    bitstype 64 Cstring
    bitstype 64 Cwstring
else
    bitstype 32 Cstring
    bitstype 32 Cwstring
end

is giving the mysterious type definition not allowed inside a local scope error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay,

const sizeof_Int = box(UInt32, unbox(Int32, Int.size))
if sizeof_Int === 0x00000008 # need === since == doesn't exist yet              
    bitstype 64 Cstring
    bitstype 64 Cwstring
else
    bitstype 32 Cstring
    bitstype 32 Cwstring
end

seems to work, though why I need the const declaration to avoid type definition not allowed inside a local scope is mysterious to me.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 25, 2015

with this commit, i think you can also get rid of the following check in SubString-to-Ptr conversion which is equivalent to this test for an embedded null:

julia/base/string.jl

Lines 650 to 652 in 54d8cc7

if s.offset+s.endof < endof(s.string)
throw(ArgumentError("a SubString must coincide with the end of the original string to be convertible to pointer"))
end

@stevengj
Copy link
Member Author

@vtjnash, no, that test is not equivalent. The point of that test is that if you have a substring whose end does not coincide with the end of the string, then the substring is not NUL-terminated. This is true regardless of whether there are embedded NULs anywhere.

Or maybe your point is that people will stop using Ptr{UInt8} for things that are supposed to be NUL-terminated, in which case that unsafe_convert logic can be moved to a Cstring conversion? It still can't be deleted, in any case. Unfortunately, for the foreseeable future, lots of code will continue to use Ptr{UInt8} even when they should be using Cstring.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 25, 2015

it's equivalent in the sense that it is a verification of whether the SubString can be used as a c-style null-terminated string.

but yes, i don't see any way of making this transition, since it is leaving the existing common case unsafe, while adding an alternative safe method.

# instead of Ptr{Cchar} and Ptr{Cwchar_t}, respectively, to enforce
# a check for embedded NUL chars in the string (to avoid silent truncation).
const sizeof_Int = box(UInt32, unbox(Int32, Int.size))
if sizeof_Int === 0x00000008 # need === since == doesn't exist yet
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is low level bit twiddling. the way boot.jl does this is probably cleaner (if is(Int,Int64) ... end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, yes, that would be cleaner...

@stevengj
Copy link
Member Author

We might as well delete the unsafe_convert for SubString, since it tries to guarantee that it is NUL-terminated but actually is wrong because it doesn't check for embedded NUL. It's not clear whether this case (a substring that coincides with the end of a string) is worth optimizing anyway.

@stevengj stevengj force-pushed the nullsafe2 branch 2 times, most recently from 874e63c to f53d861 Compare April 25, 2015 04:48
@tkelman
Copy link
Contributor

tkelman commented Apr 25, 2015

I can reproduce the file test giving rmdir: Directory not empty on a local Windows build of this branch.

@stevengj
Copy link
Member Author

Thanks @tkelman, what is the file that is in the directory?

@StefanKarpinski
Copy link
Sponsor Member

Does this work safely for non-ByteString strings? I.e. suppose we write

ccall(:func_that_takes_a_cstring, Cint, (Cstring,), str)

but str is some kind of abstract string type that doesn't actually store a contiguous bunch of bytes that Cstring can point at. You can materialize that contiguous bunch of bytes in the conversion process, but how is it rooted? Does cconvert take care of that somehow?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 28, 2015

yes, in v0.4 cconvert was introduced largely for the purpose of introducing gcroots for arbitrary conversions

@StefanKarpinski
Copy link
Sponsor Member

That's what I thought, just wanted to make sure. In that case I'm cool with this. @JeffBezanson's opinion would be good though. I have a feeling he might have one (he's at LL today, so it may take a bit).

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2015

This isn't merged onto master yet and I want to tag 0.3.8 today, so the sendfile bugfix can probably wait until 0.3.9.

@stevengj
Copy link
Member Author

@tkelman, note that there are two bugfix patches in this PR: one to sendfile, and one to tryparse.

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2015

Thanks, noted, but tryparse doesn't exist on release-0.3, does it?

@stevengj
Copy link
Member Author

stevengj commented May 1, 2015

Oh, right.

@vtjnash
Copy link
Sponsor Member

vtjnash commented May 1, 2015

fwiw, this seems like a good way to help avoid zero-day exploits such as described in http://lucumr.pocoo.org/2010/12/24/common-mistakes-as-web-developer/

stevengj added a commit that referenced this pull request May 2, 2015
add Cstring/Cwstring types for safe passing of NUL-terminated strings to ccall
@stevengj stevengj merged commit 803193e into JuliaLang:master May 2, 2015
@stevengj
Copy link
Member Author

stevengj commented May 2, 2015

@JeffBezanson said it was okay to merge.

@tkelman
Copy link
Contributor

tkelman commented May 4, 2015

I'm not sure exactly why, but this seems to cause an MSVC build to freeze while starting the second stage of bootstrap - no exports.jl appears, julia takes an entire core but an unchanging amount of memory (117,836 K to be specific, that might change from run to run though).

stevengj added a commit that referenced this pull request May 4, 2015
(cherry picked from commit 4ff969f)
ref PR #10994

Conflicts:
	base/fs.jl

use error instead of ArgumentError for release-0.3
to avoid changing the thrown exception type here
@tkelman
Copy link
Contributor

tkelman commented May 4, 2015

backported the sendfile bugfix in f5e0074

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request May 4, 2015
@stevengj stevengj deleted the nullsafe2 branch May 4, 2015 14:24
stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request May 5, 2015
@tkelman
Copy link
Contributor

tkelman commented May 12, 2015

Whatever my bootstrapping problem with an msvc build was seems to have gone away. Now just hitting the segfault from 8b8b261...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants