Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try_from ndarray to tensor using zero copy #841

Open
nmboavida opened this issue Jan 28, 2024 · 0 comments
Open

try_from ndarray to tensor using zero copy #841

nmboavida opened this issue Jan 28, 2024 · 0 comments

Comments

@nmboavida
Copy link

nmboavida commented Jan 28, 2024

I noticed that the current try_from implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensor $O(n)$, however if we implement a zero copy solution we could bring this to $O(1)$. For reference the current implementation is:

// tensor/convert.rs
impl<T, D> TryFrom<ndarray::ArrayBase<T, D>> for Tensor
where
    T: ndarray::Data,
    T::Elem: Element,
    D: ndarray::Dimension,
{
    type Error = TchError;

    fn try_from(value: ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
        Self::try_from(&value)
    }
}

// ...

impl<T, D> TryFrom<&ndarray::ArrayBase<T, D>> for Tensor
where
    T: ndarray::Data,
    T::Elem: Element,
    D: ndarray::Dimension,
{
    type Error = TchError;

    fn try_from(value: &ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
        let slice = value
            .as_slice()
            .ok_or_else(|| TchError::Convert("cannot convert to slice".to_string()))?;
        let tn = Self::f_from_slice(slice)?;
        let shape: Vec<i64> = value.shape().iter().map(|s| *s as i64).collect();
        tn.f_reshape(shape)
    }
}
// wrappers/tensor.rs
impl Tensor {
// ...
    /// Converts a slice to a tensor.
    pub fn f_from_slice<T: kind::Element>(data: &[T]) -> Result<Tensor, TchError> {
        let data_len = data.len();
        let data = data.as_ptr() as *const c_void;
        let c_tensor = unsafe_torch_err!(at_tensor_of_data(
            data,
            [data_len as i64].as_ptr(),
            1,
            T::KIND.elt_size_in_bytes(),
            T::KIND.c_int(),
        ));
        Ok(Tensor { c_tensor })
    }
}

and from the tchlib/torch_api.cpp:

tensor at_tensor_of_data(void *vs, int64_t *dims, size_t ndims, size_t element_size_in_bytes, int type) {
  PROTECT(
    torch::Tensor tensor = torch::zeros(torch::IntArrayRef(dims, ndims), torch::ScalarType(type));
    if ((int64_t)element_size_in_bytes != tensor.element_size())
      throw std::invalid_argument("incoherent element sizes in bytes");
    void *tensor_data = tensor.data_ptr();
    memcpy(tensor_data, vs, tensor.numel() * element_size_in_bytes);
    return new torch::Tensor(tensor);
  )
  return nullptr;
}

This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.

I am wondering if it would make sense to have an implementation similar to the below:

fn ndarray_to_tensor<T, D>(array: ArrayBase<T, D>) -> Tensor
where
    T: ndarray::Data,
    T::Elem: kind::Element,
    D: ndarray::Dimension,
{
    let shape: Vec<i64> = array.shape().iter().map(|&s| s as i64).collect();
    let strides: Vec<i64> = array.strides().iter().map(|&s| s as i64).collect();
    let kind = get_kind::<T::Elem>();

    unsafe {
        let data_ptr = array.as_ptr();

        // Calculate the byte length of the array
        let num_bytes = array.len() * std::mem::size_of::<T>();

        // Create a byte slice from the data
        let byte_slice = std::slice::from_raw_parts(data_ptr as *const u8, num_bytes);

        // Ensure the ndarray is not dropped while the Tensor exists
        std::mem::forget(array);

        // Get the raw pointer of the byte slice
        let byte_slice_ptr = byte_slice.as_ptr();

        Tensor::from_blob(byte_slice_ptr, &shape, &strides, kind, Device::Cpu)
    }
}

pub fn get_kind<T: kind::Element>() -> Kind {
    T::KIND
}

The device type above is harcoded, though we could infer at runtime if the device is Cpu or Cuda using the Rust API. However I did not find a way to infer if the types Mps or Vulkan. Possibly we could to infer this during the C++ runtime?

Performance comparison

I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:

For a ~40 MB tensor:

  • Current implementation: 6.581549ms
  • Proposed implementation: 37.708µs

For a ~400 MB tensor:

  • Current implementation: 153.497799ms
  • Proposed implementation: 50.508µs

For a ~800 MB tensor:

  • Current implementation: 394.885819ms
  • Proposed implementation: 68.493µs

The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :

#[test]
fn from_ndarray() {
    let (nrows, ncols, ndepth) = (2_000, 500, 100);

    let iterations = 50;
    let mut total_duration_tensor = Duration::new(0, 0);
    let mut total_duration_tensor_2 = Duration::new(0, 0);

    for _ in 0..iterations {
        let nd = Array3::<f64>::zeros((nrows, ncols, ndepth));
        let nd_clone = nd.clone();

        // Timing for tensor
        let start = Instant::now();
        let tensor = Tensor::try_from(nd).unwrap();
        total_duration_tensor += start.elapsed();

        // Timing for tensor_2
        let start = Instant::now();
        let tensor_2 = ndarray_to_tensor(nd_clone);
        total_duration_tensor_2 += start.elapsed();

        // Check equality
        assert_eq!(tensor, tensor_2);
    }

    let avg_duration_tensor = total_duration_tensor / iterations;
    let avg_duration_tensor_2 = total_duration_tensor_2 / iterations;

    println!(
        "Average time taken to build tensor: {:?}",
        avg_duration_tensor
    );
    println!(
        "Average time taken to build tensor_2: {:?}",
        avg_duration_tensor_2
    );
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant