Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd overwriting of tensor values while using LibTorch backend #1430

Closed
MichaelGoodale opened this issue Mar 7, 2024 · 2 comments · Fixed by #1434
Closed

Odd overwriting of tensor values while using LibTorch backend #1430

MichaelGoodale opened this issue Mar 7, 2024 · 2 comments · Fixed by #1434
Assignees
Labels
bug Something isn't working

Comments

@MichaelGoodale
Copy link

Hi, I have this rather odd problem where sometimes a Tensor's values are overwritten by the value of another calculation down the road. It only happens in the LibTorch backend, not with the NdArray backend (only one I've tested).

It seems to be a result of some sort of combination of reshaping and taking the exponent. Sometimes, when I take a slice, the original value will be replaced by that of the exponent of the slice (which is cloned). It really was a gnarly bug to find in the code but I managed to make a very small working example.

#[test]
fn bizarre() -> Result<()> {
    let zeros = Tensor::<NdArray, 1>::zeros([2], &NdArrayDevice::default());
    zeros.clone().slice([1..2]).reshape([1]).exp();
    //Works as expected
    assert_eq!(
        zeros.to_data(),
        Tensor::<NdArray, 1>::zeros([2], &NdArrayDevice::default()).to_data()
    );

    let zeros = Tensor::<LibTorch, 1>::zeros([2], &LibTorchDevice::default());
    zeros.clone().slice([1..2]).reshape([1]).clone().exp();
    //Works as expected thanks to the second clone after reshaping.
    assert_eq!(
        zeros.to_data(),
        Tensor::<LibTorch, 1>::zeros([2], &LibTorchDevice::default()).to_data()
    );

    let zeros = Tensor::<LibTorch, 1>::zeros([2], &LibTorchDevice::default());
    zeros.clone().slice([1..2]).reshape([1]).exp();
    //Doesn't work, leads to zeroes being equal to [0.0, 1.0]
    assert_eq!(
        zeros.to_data(),
        Tensor::<LibTorch, 1>::zeros([2], &LibTorchDevice::default()).to_data()
    );

    Ok(())
}

The first two asserts work fine whereas the last one fails, as the second value of zeros has been replaced by e^0=1. (If I try with different values the number is consistently equal to e^x).

I have no idea what could possibly be causing this, but it happens with both the current release version (0.12.1) and with the current commit on the main branch.

Thanks!

@antimora antimora added the bug Something isn't working label Mar 7, 2024
@antimora
Copy link
Collaborator

antimora commented Mar 7, 2024

Tagging @nathanielsimard. Do you think there is a cloning problem with Tch Tensors like we had in the past?

@nathanielsimard
Copy link
Member

Thanks, @MichaelGoodale, for reporting the bug. We track the number of references on each tensor and reuse some in place for improved performance and memory usage without you having to do anything. However, there might be a bug in one of the operations that can invalidate the state. With your example, it's probably going to be easy to fix it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants