### TDA 2016, January 21st, Leuven

# &nbsp; 
# &nbsp;

# TensorOperations.jl:
## Convenient tensor operations with Julia
### (and fun with metaprogramming)

# &nbsp;
# &nbsp;

### Jutho Haegeman
#### Department of Physics and Astronomy
#### UGent

### my motivation: quantum many body physics
* weirdness of quantum mechanics: Schrodinger's cat
![Schrodinger's cat](schrodinger.png)

### my motivation: quantum many body physics
* quantum bit ( = qubit):
$$\vert\Psi\rangle = \alpha \vert 0\rangle + \beta \vert 1\rangle$$
with $\alpha,\beta\in\mathbb{C}$
* intrinsically indeterministic:

    * $|\alpha|^2$: probability of measuring 0
    * $|\beta|^2$: probability of measuring 1

* for $N$ different qubits?
$$\vert\Psi\rangle = \Psi_{00000} \vert 00000\rangle + \Psi_{00001} \vert 00001\rangle + \ldots + \Psi_{11111} \vert 11111\rangle$$
$\Rightarrow$ storing a quantum state of $N$ qubits requires $2^N$ complex numbers $\Psi_{i_1,i_2,\ldots,i_{N}}$

### my motivation: quantum many body physics
* quantum state is a high-order tensor / multidimensional array:
![State Psi](psi.png)
* Curse of dimensionality: exponential scaling in the number of degrees of freedom (qubits, spins, atoms, ...)

In [None]:
![Tensor networks](tn.png)

### Tensors and tensor contractions
* graphical notation:
    * matrix - vector multiplication: ![matvec](matvec.png)
    * matrix - matrix multiplication: ![matmat](matmat.png)
* general tensor operations: permutations, partial traces, contractions
    * graphical: ![tensor operation](tensorcontraction.png)
    * index notation with Einstein summation convention:
    $D_{a,b,c} = A_{a,d,e,c}\cdot B_{f,e,b,d,f}+C_{c,b,a}$

### Tensor operations in Julia

In [6]:
n=3;
A=randn(n,n,n,n);
B=randn(n,n,n,n,n);
C=randn(n,n,n);

D2=zeros(n,n,n);
for a=1:n, b=1:n, c=1:n
    D2[a,b,c] += C[c,b,a]
    for d=1:n, e=1:n, f=1:n
        D2[a,b,c] += A[a,d,e,c]*B[f,e,b,d,f]
    end
end

using TensorOperations
@tensor D[a,b,c] := A[a,d,e,c]*B[f,e,b,d,f] + C[c,b,a];

vecnorm(D-D2)

9.24156474531335e-15

In [7]:
function f1!(D,n,A,B,C)
    for a=1:n, b=1:n, c=1:n
        D[a,b,c] += C[c,b,a]
        for d=1:n, e=1:n, f=1:n
            D[a,b,c] += A[a,d,e,c]*B[f,e,b,d,f]
        end
    end
    return D
end
function f2!(D,n,A,B,C)
    @tensor D[a,b,c] = A[a,d,e,c]*B[f,e,b,d,f] + C[c,b,a];
    return D
end

f2! (generic function with 1 method)

In [8]:
n=30;
A=randn(n,n,n,n);
B=randn(n,n,n,n,n);
C=randn(n,n,n);
D=zeros(n,n,n);

In [9]:
@time f1!(D,n,A,B,C);
@time f2!(D,n,A,B,C);

  5.910319 seconds (14.34 k allocations: 658.876 KB)
  0.019754 seconds (5.79 k allocations: 6.808 MB)


### What is going on underneath?
* Basic tensor operations (`op` can be idenity (doing nothing) or `conj`):
    * permutations and addition: `C = β*C + α*permutation(op(A))`
    * partial trace: `C = β*C + α*partialtrace(op(A))`
    * contraction: `C = β*C + α*contract(op(A),op(B))`
    
  (also via method based access)

### 1. Permutations

In [10]:
A=randn(10,10,10,10,10,10,10,10);
B=zeros(10,10,10,10,10,10,10,10);

In [12]:
@time permutedims!(B,A,[8,7,6,5,4,3,2,1]);
@time @tensor B[8,7,6,5,4,3,2,1] = A[1,2,3,4,5,6,7,8];

  1.906119 seconds (40 allocations: 1.406 KB)
  0.353497 seconds (32 allocations: 1.406 KB)


In [14]:
@time copy!(B,A);
@time permutedims!(B,A,[1,2,3,4,5,6,7,8]);
@time @tensor B[1,2,3,4,5,6,7,8] = A[1,2,3,4,5,6,7,8];

  0.101288 seconds (4 allocations: 160 bytes)
  0.127674 seconds (40 allocations: 1.406 KB)
  0.132622 seconds (32 allocations: 1.406 KB)


### 1. Permutations
* How to optimize permutations? Why is it slower than normal copy?
* Even for matrix transposition?
  ```julia
  transpose!(dst,src)```
  ![transpose](transpose.png)
  Memory is linear $\Rightarrow$ `transpose` require unfavorable memory access!

```julia
function transpose!(B::StridedMatrix,A::StridedMatrix)
    m, n = size(A)
    size(B,1) == n && size(B,2) == m || throw(DimensionMismatch("transpose"))

    if m*n<=4*transposebaselength
        @inbounds begin
            for j = 1:n
                for i = 1:m
                    B[j,i] = transpose(A[i,j])
                end
            end
        end
    else
        transposeblock!(B,A,m,n,0,0)
    end
    return B
end
function transposeblock!(B::StridedMatrix,A::StridedMatrix,m::Int,n::Int,offseti::Int,offsetj::Int)
    if m*n<=transposebaselength
        @inbounds begin
            for j = offsetj+(1:n)
                for i = offseti+(1:m)
                    B[j,i] = transpose(A[i,j])
                end
            end
        end
    elseif m>n
        newm=m>>1
        transposeblock!(B,A,newm,n,offseti,offsetj)
        transposeblock!(B,A,m-newm,n,offseti+newm,offsetj)
    else
        newn=n>>1
        transposeblock!(B,A,m,newn,offseti,offsetj)
        transposeblock!(B,A,m,n-newn,offseti,offsetj+newn)
    end
    return B
end
```

### 1. Permutations
* How to generalize to multidimensional permutations?
    1. How to write nested loops depending on the dimensionality of the array?
    2. What is the best blocking (divide and conquer) strategy?

1. Solution to 1: generated functions!

parse -> expressions -> macro expansion -> new expression -> type inference -> generated functions -> compile -> run

[TensorOperations.jl kernels](https://github.com/Jutho/TensorOperations.jl/tree/staged/src)

2. Solution to 2: divide dimensions along which the minimum of the memory jumps of the two arrays is maximal.

### 2. Partial trace
* very similar, but somewhat more carefull

### 3. Tensor contraction: very similar to matrix multiplication

* Fastest algorithm: permute input arrays and reshape them such that you can use BLAS matrix multiplication
  ![simple contraction](simplecontraction.png)

```julia
Amat=reshape(permutedims(A,[1,4,2,3]),(dA1*dA4,dA2*dA3))
Bmat=reshape(permutedims(B,[3,1,2]),(dB3*dB1,dB2))
Cmat=Amat*Bmat
C=permutedims(reshape(Cmat,(dA1,dA4,dB2)),[1,3,2])
```

```julia
using TensorOperations
C = tensorcontract(A,[1,2,3,4],B,[3,5,2],[1,5,4])
@tensor C[a,b,c] = A[a,d,e,c]*B[e,b,d]
```

### Future directions:
#### Contraction order matters!

* matrix - matrix - vector multiplication: `A*B*v`: `A*(B*v)` is much faster than `(A*B)*v`
* ![mera](mera.png)

#### What is optimal contraction order?

* ![2dmerac](2dmerac.png)
* "Faster identification of optimal contraction sequences for tensor networks" (PR E 90, 033315 (2014))
    
####  $\Rightarrow$ implement new macro that takes `A[...]*B[...]*C[...]*D[...]` and transforms it into e.g. `A[...]*((B[...]*C[...])*D[...])` at compile time

#### More flexible index notation; mixed combinations of manual loops, creating slices and applying tensor operations
#### Multi-threading? GPU?