You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the current try_from implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensor $O(n)$, however if we implement a zero copy solution we could bring this to $O(1)$. For reference the current implementation is:
// tensor/convert.rs
impl<T, D> TryFrom<ndarray::ArrayBase<T, D>> for Tensor
where
T: ndarray::Data,
T::Elem: Element,
D: ndarray::Dimension,
{
type Error = TchError;
fn try_from(value: ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
Self::try_from(&value)
}
}
// ...
impl<T, D> TryFrom<&ndarray::ArrayBase<T, D>> for Tensor
where
T: ndarray::Data,
T::Elem: Element,
D: ndarray::Dimension,
{
type Error = TchError;
fn try_from(value: &ndarray::ArrayBase<T, D>) -> Result<Self, Self::Error> {
let slice = value
.as_slice()
.ok_or_else(|| TchError::Convert("cannot convert to slice".to_string()))?;
let tn = Self::f_from_slice(slice)?;
let shape: Vec<i64> = value.shape().iter().map(|s| *s as i64).collect();
tn.f_reshape(shape)
}
}
// wrappers/tensor.rs
impl Tensor {
// ...
/// Converts a slice to a tensor.
pub fn f_from_slice<T: kind::Element>(data: &[T]) -> Result<Tensor, TchError> {
let data_len = data.len();
let data = data.as_ptr() as *const c_void;
let c_tensor = unsafe_torch_err!(at_tensor_of_data(
data,
[data_len as i64].as_ptr(),
1,
T::KIND.elt_size_in_bytes(),
T::KIND.c_int(),
));
Ok(Tensor { c_tensor })
}
}
and from the tchlib/torch_api.cpp:
tensor at_tensor_of_data(void *vs, int64_t *dims, size_t ndims, size_t element_size_in_bytes, int type) {
PROTECT(
torch::Tensor tensor = torch::zeros(torch::IntArrayRef(dims, ndims), torch::ScalarType(type));
if ((int64_t)element_size_in_bytes != tensor.element_size())
throw std::invalid_argument("incoherent element sizes in bytes");
void *tensor_data = tensor.data_ptr();
memcpy(tensor_data, vs, tensor.numel() * element_size_in_bytes);
return new torch::Tensor(tensor);
)
return nullptr;
}
This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.
I am wondering if it would make sense to have an implementation similar to the below:
fn ndarray_to_tensor<T, D>(array: ArrayBase<T, D>) -> Tensor
where
T: ndarray::Data,
T::Elem: kind::Element,
D: ndarray::Dimension,
{
let shape: Vec<i64> = array.shape().iter().map(|&s| s as i64).collect();
let strides: Vec<i64> = array.strides().iter().map(|&s| s as i64).collect();
let kind = get_kind::<T::Elem>();
unsafe {
let data_ptr = array.as_ptr();
// Calculate the byte length of the array
let num_bytes = array.len() * std::mem::size_of::<T>();
// Create a byte slice from the data
let byte_slice = std::slice::from_raw_parts(data_ptr as *const u8, num_bytes);
// Ensure the ndarray is not dropped while the Tensor exists
std::mem::forget(array);
// Get the raw pointer of the byte slice
let byte_slice_ptr = byte_slice.as_ptr();
Tensor::from_blob(byte_slice_ptr, &shape, &strides, kind, Device::Cpu)
}
}
pub fn get_kind<T: kind::Element>() -> Kind {
T::KIND
}
The device type above is harcoded, though we could infer at runtime if the device is Cpu or Cuda using the Rust API. However I did not find a way to infer if the types Mps or Vulkan. Possibly we could to infer this during the C++ runtime?
Performance comparison
I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:
For a ~40 MB tensor:
Current implementation: 6.581549ms
Proposed implementation: 37.708µs
For a ~400 MB tensor:
Current implementation: 153.497799ms
Proposed implementation: 50.508µs
For a ~800 MB tensor:
Current implementation: 394.885819ms
Proposed implementation: 68.493µs
The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :
#[test]
fn from_ndarray() {
let (nrows, ncols, ndepth) = (2_000, 500, 100);
let iterations = 50;
let mut total_duration_tensor = Duration::new(0, 0);
let mut total_duration_tensor_2 = Duration::new(0, 0);
for _ in 0..iterations {
let nd = Array3::<f64>::zeros((nrows, ncols, ndepth));
let nd_clone = nd.clone();
// Timing for tensor
let start = Instant::now();
let tensor = Tensor::try_from(nd).unwrap();
total_duration_tensor += start.elapsed();
// Timing for tensor_2
let start = Instant::now();
let tensor_2 = ndarray_to_tensor(nd_clone);
total_duration_tensor_2 += start.elapsed();
// Check equality
assert_eq!(tensor, tensor_2);
}
let avg_duration_tensor = total_duration_tensor / iterations;
let avg_duration_tensor_2 = total_duration_tensor_2 / iterations;
println!(
"Average time taken to build tensor: {:?}",
avg_duration_tensor
);
println!(
"Average time taken to build tensor_2: {:?}",
avg_duration_tensor_2
);
}
The text was updated successfully, but these errors were encountered:
I noticed that the current$O(n)$ , however if we implement a zero copy solution we could bring this to $O(1)$ . For reference the current implementation is:
try_from
implementation for converting ndarrays into tensors is copying the underlying data. This makes interoperability between ndarray --> tensorand from the tchlib/torch_api.cpp:
This implementation is quite expensive and hurts the performance compared to the python API, which if I am not mistaken, allows us to convert a numpy array into a tensor by reusing the data.
I am wondering if it would make sense to have an implementation similar to the below:
The device type above is harcoded, though we could infer at runtime if the device is
Cpu
orCuda
using the Rust API. However I did not find a way to infer if the typesMps
orVulkan
. Possibly we could to infer this during the C++ runtime?Performance comparison
I tested the proposed implementation vs. the current implementation and here's the average time taken to build the tensor:
For a ~40 MB tensor:
For a ~400 MB tensor:
For a ~800 MB tensor:
The test I used to compute these is the following (ideally we would bench this properly for a prod solution) :
The text was updated successfully, but these errors were encountered: