Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.einsum_path vs TensorOperations #156

Open
ejmeitz opened this issue Dec 7, 2023 · 3 comments
Open

np.einsum_path vs TensorOperations #156

ejmeitz opened this issue Dec 7, 2023 · 3 comments

Comments

@ejmeitz
Copy link

ejmeitz commented Dec 7, 2023

Hello,

I was wondering if there is much difference in the contraction order given by numpy.einsum_path vs. what is being done inside of TensorOperations? I've looked at the output of @macroexpand and its not super clear to me what is being done. The specific contraction I'm looking at is:
@macroexpand @tensor K_true[n,m,l] := psi[i,j,k]*P[i,n]*P[j,m]*P[k,l]
and I'm just trying to understand how things are being multiplied under the hood so I can re-make this on GPU in a way that takes advantage of sparsity. I've coded what the numpy command outputs and TensorOperations is still much faster. I believe they have same computational complexity (4) and have similar memory footprints.

@lkdvos
Copy link
Collaborator

lkdvos commented Dec 8, 2023

As far as the contraction you give is concerned, actually there is no optimization of contraction order that is automatically happening. In this case, the standard left to right multiplication is carried out. Nevertheless, you could ask for an optimized version, with various different measures to optimize over. You can find this in some more detail in this part of the documentation but if anything is not clear, do let us know!

For the rest, what is going on under the hood could be summarised as taking the given contraction, and rewriting it in terms of the pairwise operations that are defined in the interface. In that sense, by defining the functions of the interface for your sparse array type, you could just use tensoroperations as is. As a small side note, GPU support is already in there, so you could also have a look at the cuTENSOR extension if you want inspiration as to what it requires to define an new backend/support a new type.

Hope this helps!

@ejmeitz
Copy link
Author

ejmeitz commented Dec 8, 2023

I guess left to right multiplication is still N^4 complexity so that's fine. @tensor and @tensoropt give the same results.

I also already do use the GPU version, but I'm running into memory issues as my tensors are kind of big (would be amazing if TensorOperations could automatically distribute an operation across GPUs). It isn't really faster to take advantage of sparsity but it sure as hell helps the memory out. I'm not sure I can overload the TensorOpt interface to use sparsity since I want to use the GPU. You have any recommendations for reducing memory on the GPU (that psi is 99.9% zero).

While I'm on the topic of memory I noticed that TensorOperations is missing out on some optimizations to reduce memory. I've pasted my two functions below. One is a simple one liner and the other is the exact same contraction but split into the parts that np.einsum prints out. I was able to re-use one of the intermediate arrays here that TensorOperations does not. This reduced the memory by 1/3.

For example,

function K_tensopt(psi, P)
    @tensoropt begin
        K_true[n,m,l] := psi[i,j,k]*P[i,n]*P[j,m]*P[k,l]
    end
end

image

Whereas this function which is equivalent (and also just the contractions that numpy spits out)

function K_tensopt_parts(psi, phi)

    @tensor begin
        K1[j,k,n] := psi[i,j,k]*phi[i,n]
    end

    @tensor begin
        K2[k,m,n] := K1[j,k,n]*phi[j,m]
    end 

    fill!(K1, 0.0)

    @tensor begin
        K1[n,m,l] = K2[k,m,n]*phi[k,l]
    end

    return K1
end

image

@lkdvos
Copy link
Collaborator

lkdvos commented Dec 8, 2023

Thanks for showing that, that's quite interesting! We've never really considered being able to reuse memory from the temporary intermediate objects within the same contraction. The biggest struggle with that is that the @tensor macro works at compile time, which means it does not have access to any of the array sizes. As far as I understand, the only reason you can reuse that memory is if you know that the all these indices have the same dimensions, which is not true in generic contractions. Nevertheless, in this case you do have the additional information that at least some of the indices are equal, because your input tensors are used repeatedly, so there might be something you can do there. Sadly, this is not something we have implemented, and I am not sure if there is a very quick fix.

Considering the issue with memory usage. This is definitely an active and ongoing topic of interest. We rewrote part of the machinery to be able to experiment with different backends and allocation strategies, specifically to be able to address this issue. I would recommend having a look at AllocationKit.jl, where I have put some more information as well. They are primarily focussed on CPUs at the moment, but it might be the case that one of these approaches also works for reducing the GPU memory pressure.

In any case, if you do figure it out, or have some more questions, definitely don't hesitate to contact me, I am also very interested, both in working with GPUs as well as dealing with memory pressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants