<!-- <center> Joaquin Peñuela-Parra </center>
<center> Department of Mechanical Engineering and Materials Science </center>
<center> University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA </center>  -->

$$ \textrm{Joaquin Penuela-Parra} $$
$$ \textrm{Department of Mechanical Engineering and Materials Science} $$
$$ \textrm{University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA} $$ 

In [16]:
using ITensors
using CUDA #Nvidia GPU

In [17]:
CUDA.devices()

CUDA.DeviceIterator() for 1 devices:
0. NVIDIA GeForce RTX 4070 Laptop GPU

In [18]:
CUDA.memory_status()

Effective GPU memory usage: 56.37% (4.507 GiB/7.996 GiB)
Memory pool usage: 281.496 MiB (3.344 GiB reserved)


In general to use ITensor with GPU, we just need to use the function cu() or NDTensors.cu() to define the same ITensor Object inside the GPU Memory as a CUArray. The only difference between the two functions is that NDTensors.cu() preserves the element type of the tensors, while cu() converts to single precision. However, **single precision can generate problems in DMRG or TEBD** algorithms: https://itensor.discourse.group/t/tebd-with-gpu-error-with-eigen/1266/5 

**Block Sparse contraction**

Despite this type of contractions are still in development (https://itensor.discourse.group/t/ann-initial-release-of-new-itensor-gpu-backends/1227/3), we already can see some advantage of the use of GPU:

In [19]:
#woGPU
i = Index([QN(0) => 1000, QN(1) => 1000]);
A = randomITensor(i', dag(i));

@time (A)'*(A);

  0.026541 seconds (64 allocations: 30.526 MiB)


In [20]:
typeof(A)

ITensor

In [21]:
#wGPU
A = cu(A)

@time (A)'*(A)

A_cpu = NDTensors.cpu(A) #This is how you can go back to the cpu.
@time (A_cpu)'*(A_cpu);


  0.000313 seconds (196 allocations: 11.594 KiB)
  0.015404 seconds (64 allocations: 15.267 MiB)


In [22]:
@show A #We can see that it is in the GPU

Excessive output truncated after 10485760 bytes.

ITensor ord=2
(dim=2000|id=848)' <Out>
 1: QN(0) => 1000
 2: QN(1) => 1000
(dim=2000|id=848) <In>
 1: QN(0) => 1000
 2: QN(1) => 1000
NDTensors.BlockSparse{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}, 2}

**Note:** If A is a CuArray and we define another variable in terms of A, that variable will be also a CuArray:

In [23]:
C = 2*A - 3*A

ITensor ord=2
(dim=2000|id=848)' <Out>
 1: QN(0) => 1000
 2: QN(1) => 1000
(dim=2000|id=848) <In>
 1: QN(0) => 1000
 2: QN(1) => 1000
NDTensors.BlockSparse{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}, 2}

In [24]:
@time (C)'*(C)

ITensor ord=2
(dim=2000|id=848)'' <Out>
 1: QN(0) => 1000
 2: QN(1) => 1000
(dim=2000|id=848) <In>
 1: QN(0) => 1000
 2: QN(1) => 1000
NDTensors.BlockSparse{Float32, CuArray{Float32, 1, CUDA.DeviceMemory}, 2}

  0.000338 seconds (196 allocations: 11.594 KiB)


**Expectation values contractions (inner and apply functions)**

Pure states: $\langle A \rangle = \langle \Psi | A | \Psi \rangle$ 

In [25]:
#woGPU
sites = siteinds("S=1/2",50)
bond_dimension = 400 #The difference between times increases with this value.

A = randomMPS(sites, bond_dimension)
O = randomMPO(sites)

@time inner(A, apply(O, A))

3.453439719106022e-16

  8.899990 seconds (65.65 k allocations: 3.411 GiB, 4.11% gc time)


In [26]:
@time inner(A, apply(O, A))

  9.481926 seconds (65.65 k allocations: 3.411 GiB, 2.39% gc time)


3.453439719106022e-16

In [27]:
#wGPU
A = NDTensors.cu(A)
O = NDTensors.cu(O)

@time inner(A, apply(O, A))

  1.775505 seconds (159.57 k allocations: 19.890 MiB)


3.453439710454487e-16

Mixed States: $\langle A \rangle = \textrm{Tr} (\rho A)$. Take for example when $\rho$ is an identity matrix, and A is a random operator.

In [28]:
#woGPU
ρ = MPO(sites, "Id")
O = randomMPO(sites)

@time tr(apply(ρ, O))

  0.041562 seconds (113.03 k allocations: 30.104 MiB)


4.677849879209959e-18

In [29]:
#wGPU
ρ = NDTensors.cu(ρ)
O = NDTensors.cu(O)

@time tr(apply(ρ, O))

#We also can perform the trace manually by contracting the bra and ket site indices of each site with delta tensors. This takes a similar time:
I = MPO(sites, "Id") #Contains all the delta tensors.
I = NDTensors.cu(I)

@time inner(I, apply(ρ, O))

  0.226028 seconds (257.09 k allocations: 33.948 MiB)
  0.230393 seconds (252.35 k allocations: 36.490 MiB)


4.677849879209973e-18

**Why it is not faster?**. I think that in this case we do not see advantage because the bond dimension of $ρ$ is 1. If we have an MPO with a bigger bond dimension, probably we will see the advantage as in the case of MPS.

In [30]:
ρ

MPO
[1] ((dim=2|id=113|"S=1/2,Site,n=1")', (dim=2|id=113|"S=1/2,Site,n=1"), (dim=1|id=238|"Link,l=1"))
[2] ((dim=2|id=398|"S=1/2,Site,n=2")', (dim=2|id=398|"S=1/2,Site,n=2"), (dim=1|id=441|"Link,l=2"), (dim=1|id=238|"Link,l=1"))
[3] ((dim=2|id=242|"S=1/2,Site,n=3")', (dim=2|id=242|"S=1/2,Site,n=3"), (dim=1|id=11|"Link,l=3"), (dim=1|id=441|"Link,l=2"))
[4] ((dim=2|id=217|"S=1/2,Site,n=4")', (dim=2|id=217|"S=1/2,Site,n=4"), (dim=1|id=298|"Link,l=4"), (dim=1|id=11|"Link,l=3"))
[5] ((dim=2|id=697|"S=1/2,Site,n=5")', (dim=2|id=697|"S=1/2,Site,n=5"), (dim=1|id=248|"Link,l=5"), (dim=1|id=298|"Link,l=4"))
[6] ((dim=2|id=839|"S=1/2,Site,n=6")', (dim=2|id=839|"S=1/2,Site,n=6"), (dim=1|id=39|"Link,l=6"), (dim=1|id=248|"Link,l=5"))
[7] ((dim=2|id=267|"S=1/2,Site,n=7")', (dim=2|id=267|"S=1/2,Site,n=7"), (dim=1|id=365|"Link,l=7"), (dim=1|id=39|"Link,l=6"))
[8] ((dim=2|id=562|"S=1/2,Site,n=8")', (dim=2|id=562|"S=1/2,Site,n=8"), (dim=1|id=247|"Link,l=8"), (dim=1|id=365|"Link,l=7"))
[9] ((dim=2|id=519|

**Garbage Collection** (references: https://cuda.juliagpu.org/stable/usage/memory/ and https://discourse.julialang.org/t/any-way-to-delete-an-object-and-free-memory/53600)

This can be monitored using the Task Manager of Windows or using CUDA.memory_status():

In [31]:
CUDA.memory_status()

Effective GPU memory usage: 56.37% (4.507 GiB/7.996 GiB)
Memory pool usage: 3.223 GiB (3.344 GiB reserved)


In [None]:
#woGPU
sites = siteinds("S=1/2",50)
bond_dimension = 500 #The difference between times increases with this value.

A = randomMPS(sites, bond_dimension)
O = randomMPO(sites)

@time begin 
    inner(A, apply(O, A))
end

In [32]:
#wGPU
A = NDTensors.cu(A)
O = NDTensors.cu(O)

@time inner(A, apply(O, A))

  1.531794 seconds (159.59 k allocations: 19.890 MiB, 2.34% gc time)


-1.5617907963743223e-16

As A has a bond_dimension of 600, the GPU used a lot of memory to perform the last operation:

In [33]:
CUDA.memory_status()

Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 2.639 GiB (3.875 GiB reserved)


<!-- ![Task Manager](CUDA_1.png)  -->

<!-- <img src="CUDA_1.png" alt="alt text" width="400" height="400"/> -->

If we try to run the code again, we must be careful because it could be slow because we do not have more GPU memory and ITensors has not clean all the cache.

In [34]:
for i = 1:10
    @time begin 
        inner(A, apply(O, A))
        CUDA.memory_status()    
        # CUDA.reclaim()
        # CUDA.memory_status()    
    end
end

Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 2.446 GiB (3.875 GiB reserved)
  1.546038 seconds (159.65 k allocations: 19.895 MiB, 0.79% gc time)
Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 2.322 GiB (3.875 GiB reserved)
  1.526758 seconds (159.79 k allocations: 19.906 MiB, 0.33% gc time)
Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 2.198 GiB (3.875 GiB reserved)
  1.508424 seconds (159.78 k allocations: 19.906 MiB, 0.18% gc time)
Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 2.074 GiB (3.875 GiB reserved)
  1.502198 seconds (159.78 k allocations: 19.906 MiB, 0.19% gc time)
Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 1.950 GiB (3.875 GiB reserved)
  1.494142 seconds (159.91 k allocations: 19.915 MiB, 0.19% gc time)
Effective GPU memory usage: 63.02% (5.039 GiB/7.996 GiB)
Memory pool usage: 1.826 GiB (3.875 GiB reserved)
  1.502902 se

When we do not have more GPU the computer starts using the shared GPU memory (i.e. RAM), this generates a bottle neck and it could be even slower than just using CPU.

In [134]:
CUDA.memory_status()

Effective GPU memory usage: 54.11% (17.174 GiB/31.739 GiB)
Memory pool usage: 5.751 GiB (16.719 GiB reserved)


If we want to recover all the memory used to save the chache of the calculation we need to do the following:

In [23]:
@time GC.gc(true)
@time CUDA.reclaim()

  0.232856 seconds (99.16% gc time)
  0.045881 seconds (10 allocations: 608 bytes)


In [130]:
CUDA.memory_status()

Effective GPU memory usage: 19.84% (6.299 GiB/31.739 GiB)
Memory pool usage: 5.751 GiB (5.844 GiB reserved)


What if we clean in each step of the for?

In [41]:
for i = 1:8
    @time begin 
        inner(A, apply(O, A))
        CUDA.memory_status()    
    end
    @time begin
        GC.gc(true)
        CUDA.reclaim()
    end
end

Effective GPU memory usage: 22.98% (7.293 GiB/31.739 GiB)
Memory pool usage: 6.901 GiB (6.906 GiB reserved)
  0.796158 seconds (178.78 k allocations: 20.643 MiB)
  0.466342 seconds (10 allocations: 608 bytes, 93.54% gc time)
Effective GPU memory usage: 22.98% (7.293 GiB/31.739 GiB)
Memory pool usage: 6.901 GiB (6.906 GiB reserved)
  0.801689 seconds (178.77 k allocations: 20.643 MiB)
  0.492212 seconds (10 allocations: 608 bytes, 94.12% gc time)
Effective GPU memory usage: 22.98% (7.293 GiB/31.739 GiB)
Memory pool usage: 6.901 GiB (6.906 GiB reserved)
  0.814620 seconds (178.77 k allocations: 20.644 MiB)
  0.518652 seconds (10 allocations: 608 bytes, 93.98% gc time)
Effective GPU memory usage: 22.98% (7.293 GiB/31.739 GiB)
Memory pool usage: 6.901 GiB (6.906 GiB reserved)
  0.804223 seconds (178.77 k allocations: 20.643 MiB)
  0.480974 seconds (10 allocations: 608 bytes, 93.75% gc time)
Effective GPU memory usage: 22.98% (7.293 GiB/31.739 GiB)
Memory pool usage: 6.901 GiB (6.906 GiB re

In [44]:
CUDA.memory_status()

Effective GPU memory usage: 20.81% (1.664 GiB/7.996 GiB)
Memory pool usage: 481.346 MiB (512.000 MiB reserved)


It works, is better for the memory also takes some time.

**Multiple GPUs** https://cuda.juliagpu.org/stable/usage/multigpu/

In [23]:
#monitoring multiples gpu functions:

function memory_info_all_gpus(print_info = true)
    
    percentages = []

    scale = 1/(1024^3) #converty bytes to GB
    for (i, dev) in enumerate(CUDA.NVML.devices())

        name = CUDA.NVML.name(dev) 
        mem_info = CUDA.NVML.memory_info(dev)
        total = round(mem_info.total*scale, sigdigits=4)
        used = round(mem_info.used*scale, sigdigits=4)
        free = round(mem_info.free*scale, sigdigits=4)
        percentage= round(used*100/total, sigdigits=4)
        
        print_info ? println("$name #$i memory usage: $percentage % ($used GB/ $total GB)" ) : nothing
        
        append!(percentages, percentage)
    end
    
    return percentages
end

function clean_all_gpus()
    for i=reverse(0:length(CUDA.devices()) - 1)
        global current_gpu = i
        CUDA.device!(current_gpu)
        GC.gc(true) 
        CUDA.reclaim()
    end
end

clean_all_gpus (generic function with 1 method)

In [74]:
clean_all_gpus()
memory_info_all_gpus()

Tesla V100-SXM2-32GB #1 memory usage: 9.566 % (3.061 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)


2-element Vector{Any}:
 9.566
 1.76

In [75]:
sites = siteinds("S=1/2",50)
bond_dimension = 3000 #The difference between times increases with this value.

#wGPU
A = NDTensors.cu(randomMPS(sites, bond_dimension); storagemode=CUDA.UnifiedMemory)
O = NDTensors.cu(randomMPO(sites); storagemode=CUDA.UnifiedMemory);

In [76]:
@time inner(A, apply(O, A))

 50.590041 seconds (246.03 k allocations: 25.780 MiB, 0.01% gc time)


1.1878975780263433e-16

In [78]:
memory_info_all_gpus()

Tesla V100-SXM2-32GB #1 memory usage: 62.22 % (19.91 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)


2-element Vector{Any}:
 62.22
  1.76

In [79]:
@time begin
    for i=1:5
        @time inner(A, apply(O, A))
        memory_info_all_gpus()
    end
end

 50.511294 seconds (246.04 k allocations: 25.781 MiB)
Tesla V100-SXM2-32GB #1 memory usage: 100.0 % (32.0 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 53.646136 seconds (246.12 k allocations: 25.790 MiB, 0.24% gc time)
Tesla V100-SXM2-32GB #1 memory usage: 100.0 % (32.0 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 51.979540 seconds (246.16 k allocations: 25.792 MiB)
Tesla V100-SXM2-32GB #1 memory usage: 100.0 % (32.0 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)


LoadError: InterruptException:

In [69]:
clean_all_gpus()
memory_info_all_gpus()

Tesla V100-SXM2-32GB #1 memory usage: 6.453 % (2.065 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)


2-element Vector{Any}:
 6.453
 1.76

In [70]:
@time begin
    for i=1:5
        @time inner(A, apply(O, A))
        CUDA.reclaim()
        memory_info_all_gpus()
    end
end

  7.884805 seconds (245.83 k allocations: 23.247 MiB)
Tesla V100-SXM2-32GB #1 memory usage: 99.41 % (31.81 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 10.885261 seconds (245.98 k allocations: 23.259 MiB)
Tesla V100-SXM2-32GB #1 memory usage: 99.41 % (31.81 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
  9.227167 seconds (245.93 k allocations: 23.258 MiB, 0.24% gc time)
Tesla V100-SXM2-32GB #1 memory usage: 99.41 % (31.81 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 11.051499 seconds (246.03 k allocations: 23.260 MiB)
Tesla V100-SXM2-32GB #1 memory usage: 99.41 % (31.81 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 10.496104 seconds (245.96 k allocations: 23.258 MiB, 0.17% gc time)
Tesla V100-SXM2-32GB #1 memory usage: 60.09 % (19.23 GB/ 32.0 GB)
Tesla V100-SXM2-32GB #2 memory usage: 1.76 % (0.5632 GB/ 32.0 GB)
 49.549965 seconds (1.23 M allocations: 