Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ergonomics of getting slices from a PyBuffer #2824

Open
itamarst opened this issue Dec 20, 2022 · 4 comments
Open

Improving ergonomics of getting slices from a PyBuffer #2824

itamarst opened this issue Dec 20, 2022 · 4 comments

Comments

@itamarst
Copy link
Contributor

I'm writing a function that accepts anything that can create a read-only buffer:

fn validate_cbor(&self, py: Python<'_>, cbor: &PyAny) -> PyResult<()> {
        let buffer = PyBuffer::<u8>::get(cbor)?;
        if !buffer.readonly() {
            return Err(PyValueError::new_err("Must be read-only byte buffer"));
        }
        let slice = buffer
            .as_slice(py)
            .ok_or_else(|| PyTypeError::new_err("Must be a contiguous sequence of bytes"))?;

        // Safety: The slice is &[ReadOnlyCell<u8>]. A ReadOnlyCell has the same
        // memory representation as the underlying data; it's
        // #[repr(transparent)] newtype around UnsafeCell. And per Rust docs
        // "UnsafeCell<T> has the same in-memory representation as its inner
        // type T". So the main issue is whether the data is _really_ read-only.
        // We do the read-only check above, and yes a caller can probably somehow
        // lie, but if they do that, that's really their fault.
        let cbor: &[u8] = unsafe { std::mem::transmute(slice) };
        
        // ... custom code here
}

It's not clear to me why PyBytes::as_bytes() shouldn't return &[ReadOnlyCell<u8>] too. After all, there is nothing preventing a particularly malicious user from overwriting the data in a Python bytes object, just need to write a little bit of C. We need to trust callers to some extent in a world where they can write C.

So my feeling is that as_slice() can just return Some(&[T]), unless there's some reason to expect buffer API implementations to lie about their readonly flag.

As a follow-up, if that makes sense then maybe the pattern above could be simplified and built-in to PyO3.

@itamarst
Copy link
Contributor Author

Taking a step back: accepting read-only buffers seems like a reasonably common use case for people who want memory-efficient APIs, so it'd be nice to have good ergonomics, especially for people who haven't heard about transmute and/or don't feel confident reasoning about whether they can use it.

@adamreichold
Copy link
Member

After all, there is nothing preventing a particularly malicious user from overwriting the data in a Python bytes object, just need to write a little bit of C. We need to trust callers to some extent in a world where they can write C.

Writing C code is not required, at least when NumPy is involved: If I change your example into

fn validate_cbor(py: Python<'_>, cbor: &PyAny, cb: &PyAny) -> PyResult<()> {
    let buffer = PyBuffer::<u8>::get(cbor)?;
    if !buffer.readonly() {
        return Err(PyValueError::new_err("Must be read-only byte buffer"));
    }
    let slice = buffer
        .as_slice(py)
        .ok_or_else(|| PyTypeError::new_err("Must be a contiguous sequence of bytes"))?;

    let before = slice.iter().map(|cell| cell.get()).collect::<Vec<_>>();

    cb.call0()?;

    let after = slice.iter().map(|cell| cell.get()).collect::<Vec<_>>();

    assert_eq!(before, after);

    Ok(())
}

to include a callback cb which enables interleaving of Rust and Python code, then checking that before and after are equals can fail if called like

def test_validate_cbor():
    x = np.arange(0, 10, dtype=np.uint8)

    y = x.view()
    y.flags.writeable = False

    def cb():
        x[3] = 42

    validate_cbor(y, cb)

yielding

>       validate_cbor(y, cb)
E       pyo3_runtime.PanicException: assertion failed: `(left == right)`
E         left: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`,
E        right: `[0, 1, 2, 42, 4, 5, 6, 7, 8, 9]`

which would not be sound if slice: &[T] instead of slice: &[ReadonlyCell<T>].

Of course, the view y into x "lies" about its read-only flag in a certain sense, i.e. one cannot manipulate the underlying memory via y, e.g.

>       y[3] = 23
E       ValueError: assignment destination is read-only

but it can still be manipulated using Python code via x.

For this reason, rust-numpy dynamically borrow checks all memory accesses done by safe Rust code so that safe Rust code cannot be the originator of memory unsafety, but considers both other native extension modules as well as Python code unsafe, meaning trusted to manually uphold aliasing discipline, c.f. https://docs.rs/numpy/latest/numpy/borrow/index.html

@itamarst
Copy link
Contributor Author

Hm, yeah, NumPy views are a compelling argument that you can't trust the readonly flag in the general case.

So perhaps a more modest approach would be just keeping people from having to do transmutes themselves?

use std::mem::size_of;

impl<T> PyBuffer<T> {
    /// Safety: ... document constraints here ...
    unsafe fn as_regular_slice(&self) -> Option<&[T]> {
        if !self.readonly() {
            return None;
        }
        let slice = self.as_slice()?;
        const _: () = assert!(
            size_of::<ReadOnlyCell<T>>() == size_of::<T>(),
            "ReadOnlyCell<T> is different size than T"
        );
        Some(unsafe { std::mem::transmute::<&[ReadOnlyCell<T>], &[T]>(slice) })
    }
}

@itamarst itamarst changed the title Maybe PyBuffer::as_slice() should just return Option<&[T]> Improving ergonomics of getting slices from a PyBuffer Dec 21, 2022
@adamreichold
Copy link
Member

Hm, yeah, NumPy views are a compelling argument that you can't trust the readonly flag in the general case.

As hinted at above, for rust-numpy we took the opposite approach. We do trust Python code as we trust other "unsafe" code. (We consider the read-only flag, but only to prevent safe Rust code writing to read-only buffers, not to assume immutability.)

So perhaps a more modest approach would be just keeping people from having to do transmutes themselves?

Yeah, I think such methods make sense, but they should come with extensive documentation on the issue of shared mutability in Python code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants