In [3]:
using Qaintessent
using Qaintensor
using Qaintensor: optimize_contraction_order!, network_graph, line_graph, tree_decomposition
using BenchmarkTools

┌ Info: Precompiling Qaintensor [c26a0288-a22f-4b8a-8630-7ac49e34be6b]
└ @ Base loading.jl:1278


# Optimizing contraction ordering for expectation-value-like TensorNetworks

In [Markov & Shi, Simulating quantum computation by contracting tensor networks](https://arxiv.org/abs/quant-ph/0511069.) they study tensor networks that start with a product state and end with the projection onto another product state, as in the example below

```
1 □—————————————————□———————————□
                    |
2 □———————————□—————□—————□—————□
              |           |
3 □—————□———————————————————————□
        |     |           |
4 □—————□—————□———————————□—————□
```

These networks are the kind emerging from computing expectation values when starting with a product state. We see a great time saving also for MPS type circuits, **as long as no leg remains uncontracted** (for example, the output is projected into another MPS). The size of all legs of the original tensor should be roughly the same for getting good performance (no different orders of magnitude)

They prove that the contraction complexity for tensor networks with no open legs is <i>exponential in the treewidth of the underlying graph</i>, and give an algorithm for constructing an near-optimal contraction ordering given a near-optimal tree decomposition of the network graph. 

Cases when substantial improvement is achieved are:

* <i>Circuits of logarithmic depth</i>: If the depth of the circuit is logarithmic in the number of gates $T$, the complexity is polynomial in $T$
* <i>Circuits of local interacting qubits</i>: The complexity goes with $\exp(\sqrt{r}\sqrt{T})$, with $r$ the maximum distance between iteracting qubits and $T$ the total number of gates.

Finding the optimal tree decomposition of a graph is NP-complete, but some heuristics perform reasonably well in finding a near-optimal one. We have implemented the _minimum fill-in_ heuristic to provide the tree decomposition. For optimizing the contraction order just apply `optimize_contraction_order!` to a tensor network.

Below —after some setup code— you can find some benchmarking for typical circuits

In [7]:
# TODO
# BEGIN DELETE: this code is copied from the `mps` branch; delete when merged

function ClosedMPS(T::AbstractVector{Tensor})
    l = length(T)
    @assert ndims(T[1]) == 2
    for i in 2:l-1
        @assert ndims(T[i]) == 3
    end
     @assert ndims(T[l]) == 2

    contractions = [Summation([1 => 2, 2 => 1]); [Summation([i => 3,i+1 => 1]) for i in 2:l-1]]
    openidx = reverse([1 => 1; [i => 2 for i in 2:l]])
    tn = TensorNetwork(T, contractions, openidx)
    return tn
end


function shift_summation(S::Summation, step)
   return Summation([S.idx[i].first + step => S.idx[i].second for i in 1:2])
end

# END DELETE
Base.ndims(T::Tensor) = ndims(T.data)

Base.copy(net::TensorNetwork) = TensorNetwork(copy(net.tensors), copy(net.contractions), copy(net.openidx))
crand(dims...) = rand(ComplexF64, dims...)

# generate expectation value tensor network
""" Compute the expectation value of a random MPS when run through circuit `cgc`"""
function expectation_value(cgc::CircuitGateChain{N}; is_decompose = false) where N
    
    tensors = Tensor.([crand(2,2), [crand(2,2,2) for i in 2:N-1]..., crand(2,2)])
    T0 = ClosedMPS(tensors)
    
    T = copy(T0)
    tensor_circuit!(T, cgc, is_decompose = is_decompose)
    
    # measure
    T.contractions = [T.contractions; shift_summation.(T0.contractions, length(T.tensors))]
    for i in 1:N
        push!(T.tensors, T0.tensors[N+1-i])
        push!(T.contractions, Summation([T.openidx[end], (length(T.tensors) => T0.openidx[N+1-i].second)]))
        pop!(T.openidx)
    end
    T
end

expectation_value

# Benchmarking the contraction order that `optimize_contraction_order!` produces

For each tensor network we have the following workflow

1. Benchmark the normal contraction
2. Optimize the contraction order and benchmark again the contraction
3. Benchmark the whole workflow (copy + optimize contraction order + contraction) 

#  QFT

In [8]:
N = 20

cgc = qft_circuit(N)
 
T0 = expectation_value(cgc);
contract(T0); # contract once to compile some of the functions 

In [9]:
T = copy(T0)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  6.90 GiB
  allocs estimate:  89151
  --------------
  minimum time:     3.004 s (22.22% GC)
  median time:      3.052 s (22.05% GC)
  mean time:        3.052 s (22.05% GC)
  maximum time:     3.100 s (21.89% GC)
  --------------
  samples:          2
  evals/sample:     1

In [10]:
optimize_contraction_order!(T)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  208.76 MiB
  allocs estimate:  36875
  --------------
  minimum time:     239.969 ms (7.64% GC)
  median time:      280.439 ms (10.50% GC)
  mean time:        320.508 ms (18.63% GC)
  maximum time:     669.458 ms (71.57% GC)
  --------------
  samples:          16
  evals/sample:     1

In [11]:
function copy_and_optimize(T0)
    T = copy(T0)
    optimize_contraction_order!(T)
    contract(T)
end

@benchmark copy_and_optimize($T0)

BenchmarkTools.Trial: 
  memory estimate:  511.85 MiB
  allocs estimate:  3279901
  --------------
  minimum time:     619.641 ms (11.58% GC)
  median time:      917.911 ms (35.76% GC)
  mean time:        884.713 ms (34.01% GC)
  maximum time:     1.013 s (42.21% GC)
  --------------
  samples:          6
  evals/sample:     1

# Testing with `is_decompose = true`

In [12]:
N = 20

cgc = qft_circuit(N)
 
T0 = expectation_value(cgc, is_decompose = true);

In [13]:
T = copy(T0)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  31.64 GiB
  allocs estimate:  167578
  --------------
  minimum time:     28.279 s (19.91% GC)
  median time:      28.279 s (19.91% GC)
  mean time:        28.279 s (19.91% GC)
  maximum time:     28.279 s (19.91% GC)
  --------------
  samples:          1
  evals/sample:     1

In [14]:
optimize_contraction_order!(T)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  142.09 MiB
  allocs estimate:  61949
  --------------
  minimum time:     148.687 ms (6.89% GC)
  median time:      174.553 ms (12.91% GC)
  mean time:        180.737 ms (13.40% GC)
  maximum time:     257.425 ms (24.24% GC)
  --------------
  samples:          28
  evals/sample:     1

In [15]:
@benchmark copy_and_optimize($T0)

BenchmarkTools.Trial: 
  memory estimate:  688.75 MiB
  allocs estimate:  6343072
  --------------
  minimum time:     698.267 ms (16.16% GC)
  median time:      795.603 ms (16.18% GC)
  mean time:        788.832 ms (16.44% GC)
  maximum time:     860.370 ms (14.25% GC)
  --------------
  samples:          7
  evals/sample:     1

# Testing adder without decompose

In [16]:
N = 7 # total qubits = 22
cgc = vbe_adder_circuit(N)
T0 = expectation_value(cgc);

In [17]:
T = copy(T0)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  6.25 GiB
  allocs estimate:  30905
  --------------
  minimum time:     5.272 s (27.56% GC)
  median time:      5.272 s (27.56% GC)
  mean time:        5.272 s (27.56% GC)
  maximum time:     5.272 s (27.56% GC)
  --------------
  samples:          1
  evals/sample:     1

In [18]:
optimize_contraction_order!(T)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  567.12 MiB
  allocs estimate:  14568
  --------------
  minimum time:     741.391 ms (6.44% GC)
  median time:      765.129 ms (7.46% GC)
  mean time:        843.136 ms (16.66% GC)
  maximum time:     1.125 s (36.06% GC)
  --------------
  samples:          6
  evals/sample:     1

In [19]:
@benchmark copy_and_optimize($T0)

BenchmarkTools.Trial: 
  memory estimate:  621.31 MiB
  allocs estimate:  581792
  --------------
  minimum time:     794.480 ms (10.09% GC)
  median time:      801.267 ms (9.65% GC)
  mean time:        868.801 ms (16.02% GC)
  maximum time:     1.051 s (32.74% GC)
  --------------
  samples:          6
  evals/sample:     1

# With decompose

In [20]:
N = 7 # total qubits = 22
cgc = vbe_adder_circuit(N)
T0 = expectation_value(cgc, is_decompose = true);

In [21]:
T = copy(T0)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  46.82 GiB
  allocs estimate:  63580
  --------------
  minimum time:     30.052 s (21.97% GC)
  median time:      30.052 s (21.97% GC)
  mean time:        30.052 s (21.97% GC)
  maximum time:     30.052 s (21.97% GC)
  --------------
  samples:          1
  evals/sample:     1

In [22]:
optimize_contraction_order!(T)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  858.94 MiB
  allocs estimate:  21292
  --------------
  minimum time:     1.518 s (7.82% GC)
  median time:      1.555 s (10.80% GC)
  mean time:        1.709 s (17.34% GC)
  maximum time:     2.053 s (29.32% GC)
  --------------
  samples:          3
  evals/sample:     1

In [23]:
@benchmark copy_and_optimize($T0)

BenchmarkTools.Trial: 
  memory estimate:  955.28 MiB
  allocs estimate:  1107967
  --------------
  minimum time:     2.013 s (10.75% GC)
  median time:      2.094 s (18.03% GC)
  mean time:        2.195 s (17.21% GC)
  maximum time:     2.478 s (21.76% GC)
  --------------
  samples:          3
  evals/sample:     1

# Testing another adder

In [24]:
N = 7 # total qubits = 22
cgc = qcla_inplace_adder_circuit(N)
T0 = expectation_value(cgc);

In [25]:
T = copy(T0)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  17.50 GiB
  allocs estimate:  40758
  --------------
  minimum time:     13.106 s (24.46% GC)
  median time:      13.106 s (24.46% GC)
  mean time:        13.106 s (24.46% GC)
  maximum time:     13.106 s (24.46% GC)
  --------------
  samples:          1
  evals/sample:     1

In [26]:
optimize_contraction_order!(T)
@benchmark contract($T)

BenchmarkTools.Trial: 
  memory estimate:  552.77 MiB
  allocs estimate:  19702
  --------------
  minimum time:     1.097 s (3.35% GC)
  median time:      1.449 s (12.15% GC)
  mean time:        1.394 s (15.06% GC)
  maximum time:     1.580 s (28.54% GC)
  --------------
  samples:          4
  evals/sample:     1

In [27]:
@benchmark copy_and_optimize($T0)

BenchmarkTools.Trial: 
  memory estimate:  638.12 MiB
  allocs estimate:  838869
  --------------
  minimum time:     1.229 s (8.53% GC)
  median time:      1.336 s (16.52% GC)
  mean time:        1.343 s (17.80% GC)
  maximum time:     1.473 s (30.54% GC)
  --------------
  samples:          4
  evals/sample:     1