Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it expected that native code can modify Ruby Strings passed to it inplace? #1080

Open
eregon opened this issue Feb 15, 2024 · 8 comments
Open

Comments

@eregon
Copy link
Collaborator

eregon commented Feb 15, 2024

Ruby 3.3.0 FFI 1.16.3
It happens both for the :string and :pointer types:

$ irb -rffi
irb(main):001> s = "hello"
=> "hello"
irb(main):002* module MyLib
irb(main):003*   extend FFI::Library
irb(main):004*   ffi_lib 'c'
irb(main):005*   attach_function :strtok, [ :string, :string ], :pointer
irb(main):006> end
=> #<FFI::Function address=0x00007fa0de978680>
irb(main):007> MyLib.strtok(s, "e")
=> #<FFI::Pointer address=0x00007fa0b5dc1028>
irb(main):008> s
=> "h\u0000llo"
irb(main):009> s.ascii_only?
=> true

$ irb -rffi
irb(main):001* module MyLib
irb(main):002*   extend FFI::Library
irb(main):003*   ffi_lib 'c'
irb(main):004*   attach_function :strtok, [ :pointer, :string ], :pointer
irb(main):005> end
=> #<FFI::Function address=0x00007f371e1e5680>
irb(main):006> s = "hello"
=> "hello"
irb(main):007> MyLib.strtok(s, "e")
=> #<FFI::Pointer address=0x00007f36f562dcf0>
irb(main):008> s
=> "h\u0000llo"

This also makes it possible to break the code range of Ruby Strings, for example:

irb -rffi
irb(main):001* module MyLib
irb(main):002*   extend FFI::Library
irb(main):003*   ffi_lib 'c'
irb(main):004*   attach_function :strcpy, [ :pointer, :string ], :pointer
irb(main):005> end
=> #<FFI::Function address=0x00007f7c00898d10>
irb(main):006> s = "hello"
=> "hello"
irb(main):007> MyLib.strcpy(s, "ét")
=> #<FFI::Pointer address=0x00007f7bd7977d80>
irb(main):008> s
=> "ét\u0000o"
irb(main):009> s.ascii_only?
=> true # incorrect due to this mutation that Ruby does not know about
irb(main):011> s.bytes
=> [195, 169, 116, 0, 111]
irb(main):012> s.encoding
=> #<Encoding:UTF-8>
irb(main):014> s.bytes.pack('C*').force_encoding("UTF-8").ascii_only?
=> false

I found this by looking at a performance issue in TruffleRuby: oracle/truffleruby#3293 (comment)
On TruffleRuby the Ruby String passed to a FFI call needs to be converted from a byte[] to a native String (char*), like RSTRING_PTR() in C extensions.
That works and converting to native memory is anyway necessary.

But because of the semantics above this conversion needs to also change that Ruby String object to also use that native memory, otherwise it wouldn't reflect changes like above. And that can have really bad performance effects like in the linked issue when the same String is then matched against a Regexp and that needs a byte[] again, so then big strings go back and forth between native and managed byte[] memory which is really slow.

I think it would be better to change FFI to always .dup the String passed to native, so if that memory is changed it does not affect the original String.
OTOH using a String as some kind of native buffer can be useful too, but maybe this should only be done with a :buffer type. Maybe for :pointer too. But for :string it feels wrong.

@eregon
Copy link
Collaborator Author

eregon commented Feb 15, 2024

Interestingly on JRuby the mutations to the native memory have no effect on the Ruby String for :string, but it does for :pointer.
Sort of expected since I don't think JRuby has a way to represent a RubyString using native memory (but it could copy back from the native buffer after the call, and it seems to do that for :pointer and not for :string).

$ chruby jruby-9.4.5.0                                                                                                  
Using jruby-9.4.5.0: jruby 9.4.5.0 (3.1.4) 2023-11-02 1abae2700f OpenJDK 64-Bit Server VM 17.0.8+7 on 17.0.8+7 +jit [x86_64-linux]
$ irb -rffi
irb(main):001:0> module MyLib; extend FFI::Library; ffi_lib 'c'; attach_function :strtok, [ :string, :string ], :pointer; end
=> #<#<Class:0x2d5d001f> address=0x7fd087303680 size=0>
irb(main):002:0> s = "hello"
=> "hello"
irb(main):003:0> MyLib.strtok(s, "e")
=> #<FFI::Pointer address=0x7fd084001000>
irb(main):004:0> s
=> "hello"

$ irb -rffi
irb(main):001:0> module MyLib; extend FFI::Library; ffi_lib 'c'; attach_function :strtok, [ :pointer, :string ], :pointer; end
=> #<#<Class:0x7e514482> address=0x7fccb9e26680 size=0>
irb(main):002:0> s = "hello"
=> "hello"
irb(main):003:0> MyLib.strtok(s, "e")
=> #<FFI::Pointer address=0x7fccb87f90f0>
irb(main):004:0> s
=> "h\u0000llo"

FYI @headius @enebo

@eregon
Copy link
Collaborator Author

eregon commented Feb 15, 2024

IOW it comes done to whether the :string type is read-only or not.
It is explicitly read-only in the docs at https://github.com/ffi/ffi/wiki/Types#types (only saw it now, sorry).

:string should be considered to be const char * and the Ruby string must not be changed as long as it's accessed by the library. If the string buffer shall be modified from C or Ruby side, use :pointer and FFI::MemoryPointer instead.

So that resolves the question.
It's unfortunate that on CRuby mutating a :string actually has effect, but OTOH requiring an extra copy to avoid that seems too heavy perf-wise.

There is still a bug of FFI on CRuby though that it doesn't invalidate the coderange around a FFI call when passed a String to a :pointer argument. i.e. the 3rd example in the issue description. To fix that I think rb_str_modify() should be called on the String passed to a :pointer arg before the native call.

graalvmbot pushed a commit to oracle/truffleruby that referenced this issue Feb 15, 2024
…f a FFI call

* See #3293 (comment)
  and ffi/ffi#1080
* As that could cause extra conversions to managed later on.
graalvmbot pushed a commit to oracle/truffleruby that referenced this issue Feb 15, 2024
…f a FFI call

* See #3293 (comment)
  and ffi/ffi#1080
* As that could cause extra conversions to managed later on.
@headius
Copy link
Contributor

headius commented Feb 15, 2024

I think it would be better to change FFI to always .dup the String passed to native, so if that memory is changed it does not affect the original String.

Nobody should be modifying a String passed through FFI and expecting the original to reflect those changes. The docs make that clear: it should be treated as a const char *. Dup'ing would avoid changes from native code leaking back into the managed heap, albeit at the cost of copying.

A new type for FFI that passes strings by reference, or like a pointer, would be the way to support mutability (or just use a pointer type today).

Sort of expected since I don't think JRuby has a way to represent a RubyString using native memory (but it could copy back from the native buffer after the call, and it seems to do that for :pointer and not for :string).

Nothing is copied for pointer; it's just a pointer to a block of native memory that can be mutated in native code or from the managed runtime. it is expected that changes on one side are immediately reflected on the other, without copying.

RubyString may have a native version soon but it has not been a priority for us (and we would probably also dup when passing to FFI).

Note that any string dup should not be done in code shared across runtimes; that would essentially result in double-copying for all strings on JRuby and for managed strings on TruffleRuby. The dup should be done in code that is specifically preparing the Ruby string for a call into native code.

@eregon
Copy link
Collaborator Author

eregon commented Feb 15, 2024

@headius Makes sense.

I don't understand this bit though:

Nothing is copied for pointer; it's just a pointer to a block of native memory that can be mutated in native code or from the managed runtime. it is expected that changes on one side are immediately reflected on the other, without copying.

How can JRuby pass a Ruby String to native code with a :pointer argument, and have the changes reflected back in the Ruby String without copying from byte[] to native and back from native to byte[]? It copies both ways, right?
We can see the changes are correctly reflected back for :pointer in my last snippet above on JRuby.

@headius
Copy link
Contributor

headius commented Feb 15, 2024

How can JRuby pass a Ruby String to native code with a :pointer argument

Ah, I thought you meant a pointer allocated through FFI, not passing a String into a pointer. Yes, we copy out and in for that case.

@larskanis
Copy link
Member

There is still a bug of FFI on CRuby though that it doesn't invalidate the coderange around a FFI call when passed a String to a :pointer argument. i.e. the 3rd example in the issue description. To fix that I think rb_str_modify() should be called on the String passed to a :pointer arg before the native call.

I don't think this should be changed, given that the :pointer type is defined as read-only for strings in the wiki:

Strings can be passed directly to C, but should be considered read only on both Ruby and C side.

Adding rb_str_modify() would mean that we expect to get the String modified. It also prohibits calls with frozen strings.

Can we close this issue?

@eregon
Copy link
Collaborator Author

eregon commented Apr 1, 2024

I don't think this should be changed, given that the :pointer type is defined as read-only for strings in the wiki:

Given the comment for the :string type:

:string should be considered to be const char * and the Ruby string must not be changed as long as it’s accessed by the library. If the string buffer shall be modified from C or Ruby side, use :pointer and FFI::MemoryPointer instead.

I read that multiple times as "if you want to mutate the string from C, use :pointer".
Even though it does say "use :pointer and FFI::MemoryPointer instead".
I can try to reword that to avoid this misreading (EDIT: done).

Also I think I read the wiki back before the 18 July 2020 change: https://github.com/ffi/ffi/wiki/Types/_compare/dea572a4f062df5a3e296b7dd91bc0ae0ff16d22...2e2903e5f8af957e7fe59e83c073f65a2a633e56:

:string | If the function modifies the string buffer use :pointer instead. That is, :string should be considered to be const char *. Similar the Ruby string must not be changed as long as it's accessed by the library.

So maybe I remembered that part, which seems to clearly indicate a String + :pointer type is OK to mutate the String.

I am worried there is code out there relying on mutations to the Strings being kept after returns from C to Ruby, because the docs used to basically say this is fine. I wonder if the FFI test suite might even test that.
The fact that CRuby JRuby and TruffleRuby all keep the mutations for the :pointer type makes this even more likely.
Also it seems Charles had the same understanding as me regarding String + :pointer in #1080 (comment).

If we want to go the road of :pointer is read-only if given a String, then I feel we should .dup for that case to enforce it and add a test that changes from C are not reflected in the Ruby String, otherwise it will accidentally appear to work but can also break the coderange like in my example at the top. And then it also means that for String + :pointer we can avoid the copy-out in that case on JRuby and using a native String on TruffleRuby (which should result in better performance for both, and less side effects on TruffleRuby).

If instead we want to let :pointer mutate the String as it already sort of works and as the docs used to say, then we should do rb_str_modify().
It is indeed problematic if called with a frozen string. I guess either we wouldn't rb_str_modify() in that case and assume the C code won't mutate (but still dangerous) or we would do the rb_str_modify() always and so raise when passed a frozen string, which seems good as it seems risky to pass a frozen string to a C function (if it's mutated and it's an interned string it could be very hard to track where it comes from).

@eregon
Copy link
Collaborator Author

eregon commented Apr 1, 2024

I was curious and used rb_str_dup() in master...eregon:ffi:defensive-copy-ruby-strings. That passes the added specs ensuring that if the string is mutated in native code it's not reflected on the Ruby String. And no spec fails with that change.
I was wondering if I would need rb_str_dup() + rb_str_modify() on the copy but it seems not (would need to check the details of rb_str_dup() to be sure though, e.g. if it makes a COW copy in some cases it would not be enough).
EDIT: a quick look at rb_str_dup() seems to show it would be COW if not an embedded string, but not sure how to trigger it I tried with a longer string but still it didn't mutate the original.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants