### TDA 2016, January 21st, Leuven

# &nbsp;

# TensorOperations.jl:
## Convenient tensor operations with Julia
### (and fun with metaprogramming)

# &nbsp;

### Jutho Haegeman
#### Department of Physics and Astronomy, UGent

## Overview


* **Motivation: Tensor Network Decompositions in Quantum Physics**
* **Intro to the Julia Language**
* **TensorOperations.jl**
* **Implementation of basic tensor operations with metaprogramming**
* **Optimization of tensor contraction order**
* **Outlook**

## Motivation: quantum many body physics
* weirdness of quantum mechanics: Schrodinger's cat
<img src="schrodinger.png" style="width: 300px;"/>

## Motivation: quantum many body physics
* quantum bit ( = qubit):

$$\vert\Psi\rangle = \alpha \vert 0\rangle + \beta \vert 1\rangle\quad\text{with}\quad\alpha,\beta\in\mathbb{C}$$

* intrinsically indeterministic:

    * $|\alpha|^2$: probability of measuring 0
    * $|\beta|^2$: probability of measuring 1

* for $N$ different qubits? 

$$\vert\Psi\rangle = \Psi_{00000}\vert 00000 \rangle + \Psi_{00001} \vert 00001\rangle + \ldots+ \Psi_{11111} \vert 11111\rangle$$

**$\Rightarrow$ storing a quantum state of $N$ qubits requires $2^N$ complex numbers: $\Psi_{i_1,i_2,\ldots,i_{N}}$**

## Motivation: quantum many body physics
* quantum state is a high-order tensor / multidimensional array:
  <img src="psi.png" style="width: 500px;"/>

* Curse of dimensionality: exponential scaling in $N$, the number of degrees of freedom (qubits, spins, atoms, ...)
  
* Realistic materials: $N$ is in the order of Avogadro's number, i.e. $O(10^{23})$
  <img src="graphene.jpg" style="width: 300px;"/>

## Motivation: tensor network decompositions
* graphical notation:
    * matrix - vector multiplication: <img src="matvec.png" style="width: 400px;"/>
    * matrix - matrix multiplication: <img src="matmat.png" style="width: 400px;"/>
* tensor network decompositions for efficient description of quantum states
  <img src="tn2.png" style="width: 600px;"/>

## Introduction to the Julia Language

  <img src="julia.png" style="width: 200px;"/>


* Selling point: dynamic high-level language with the speed of a statically-compiled language

* Key features:
    * Just-in-time compiled (using LLVM infrastructure)
    * Dynamic type system
    * Multiple dispatch:
        * define function behavior across many combinations of argument types
        * automatic generation of efficient, specialized code for different argument types
    * Good support for computational science, numerics, multidimensional arrays, ...
    * Powerful metaprogramming facilities

#### Code generation

In [8]:
function myabs(x)
    if x < 0
        return -x
    end
    return x
end
println("LLVM code for 64-bit integer")
code_llvm(myabs,Tuple{Int64})
println("LLVM code for 64-bit unsigned integer")
code_llvm(myabs,Tuple{UInt64})
println("LLVM code for 64-bit floating point")
code_llvm(myabs,Tuple{Float64})

LLVM code for 64-bit integer

define i64 @julia_myabs_21745(i64) {
top:
  %1 = icmp sgt i64 %0, -1
  br i1 %1, label %L, label %if

if:                                               ; preds = %top
  %2 = sub i64 0, %0
  ret i64 %2

L:                                                ; preds = %top
  ret i64 %0
}
LLVM code for 64-bit unsigned integer

define i64 @julia_myabs_21746(i64) {
L:
  ret i64 %0
}
LLVM code for 64-bit floating point

define double @julia_myabs_21747(double) {
top:
  %1 = fcmp uge double %0, 0.000000e+00
  br i1 %1, label %L, label %if

if:                                               ; preds = %top
  %2 = fmul double %0, -1.000000e+00
  ret double %2

L:                                                ; preds = %top
  ret double %0
}


#### Type inference

In [9]:
mysqrt(x) = x < 0 ? sqrt(complex(x)) : sqrt(x)
function summyabs(v::Vector)
    s = myabs(v[1])
    for i = 2:length(v)
        s += myabs(v[i])
    end
    return s
end
function summysqrt(v::Vector)
    s = mysqrt(v[1])
    for i = 2:length(v)
        s += mysqrt(v[i])
    end
    return s
end

summysqrt (generic function with 1 method)

In [10]:
code_typed(summyabs,Tuple{Vector{Int64}})

1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:v], Any[Any[Any[:v,Array{Int64,1},0],Any[:s,Int64,2],Any[symbol("#s41"),Int64,2],Any[:i,Int64,18],Any[:_var0,Int64,2],Any[:_var1,Int64,2]],Any[],Any[UnitRange{Int64},Tuple{Int64,Int64},Int64,Int64,Int64,Int64,Int64],Any[]], :(begin  # In[9], line 3:
        GenSym(2) = (Base.arrayref)(v::Array{Int64,1},1)::Int64
        unless (Base.slt_int)(GenSym(2),0)::Bool goto 6
        _var0 = (Base.box)(Int64,(Base.neg_int)(GenSym(2)))
        goto 7
        6: 
        _var0 = GenSym(2)
        7: 
        s = _var0::Int64 # In[9], line 4:
        GenSym(3) = (Base.arraylen)(v::Array{Int64,1})::Int64
        GenSym(0) = $(Expr(:new, UnitRange{Int64}, 2, :(((top(getfield))(Base.Intrinsics,:select_value)::I)((Base.sle_int)(2,GenSym(3))::Bool,GenSym(3),(Base.box)(Int64,(Base.sub_int)(2,1)))::Int64)))
        #s41 = (top(getfield))(GenSym(0),:start)::Int64
        unless (Base.box)(Base.Bool,(Base.not_int)(#s41::Int64 === (Base.box)(Base.Int,(Base.add

In [11]:
code_typed(summysqrt,Tuple{Vector{Int64}})

1-element Array{Any,1}:
 :($(Expr(:lambda, Any[:v], Any[Any[Any[:v,Array{Int64,1},0],Any[:s,Union{Complex{Float64},Float64},2],Any[symbol("#s41"),Int64,2],Any[:i,Int64,18],Any[:_var0,Union{Complex{Float64},Float64},2],Any[:_var1,Union{Complex{Float64},Float64},2]],Any[],Any[UnitRange{Int64},Tuple{Int64,Int64},Int64,Complex{Int64},Int64,Int64,Complex{Int64},Int64,Int64],Any[]], :(begin  # In[9], line 10:
        GenSym(2) = (Base.arrayref)(v::Array{Int64,1},1)::Int64
        unless (Base.slt_int)(GenSym(2),0)::Bool goto 6
        GenSym(3) = $(Expr(:new, Complex{Int64}, GenSym(2), 0))
        _var0 = (Base.sqrt)($(Expr(:new, Complex{Float64}, :((Base.box)(Float64,(Base.sitofp)(Float64,(top(getfield))(GenSym(3),:re)::Int64))), :((Base.box)(Float64,(Base.sitofp)(Float64,(top(getfield))(GenSym(3),:im)::Int64))))))::Complex{Float64}
        goto 7
        6: 
        _var0 = (Base.Math.box)(Base.Math.Float64,(Base.Math.sqrt_llvm)((Base.box)(Float64,(Base.sitofp)(Float64,GenSym(2)))))::Float

### Tensor operations in Julia
* general tensor operations: permutations, partial traces, contractions
    * graphical: ![tensor operation](tensorcontraction.png)
    * index notation with Einstein summation convention:
    $D_{a,b,c} = A_{a,d,e,c}\cdot B_{f,e,b,d,f}+C_{c,b,a}$

In [None]:
n=3;
A=randn(n,n,n,n);
B=randn(n,n,n,n,n);
C=randn(n,n,n);

D2=zeros(n,n,n);
for a=1:n, b=1:n, c=1:n
    D2[a,b,c] += C[c,b,a]
    for d=1:n, e=1:n, f=1:n
        D2[a,b,c] += A[a,d,e,c]*B[f,e,b,d,f]
    end
end

using TensorOperations
@tensor D[a,b,c] := A[a,d,e,c]*B[f,e,b,d,f] + C[c,b,a];

vecnorm(D-D2)

In [None]:
function f1!(D,n,A,B,C)
    for a=1:n, b=1:n, c=1:n
        D[a,b,c] += C[c,b,a]
        for d=1:n, e=1:n, f=1:n
            D[a,b,c] += A[a,d,e,c]*B[f,e,b,d,f]
        end
    end
    return D
end
function f2!(D,n,A,B,C)
    @tensor D[a,b,c] = A[a,d,e,c]*B[f,e,b,d,f] + C[c,b,a];
    return D
end

In [None]:
n=30;
A=randn(n,n,n,n);
B=randn(n,n,n,n,n);
C=randn(n,n,n);
D=zeros(n,n,n);

In [None]:
@time f1!(D,n,A,B,C);
@time f2!(D,n,A,B,C);

### What is going on underneath?
* Basic tensor operations (`op` can be idenity (doing nothing) or `conj`):
    * permutations and addition: `C = β*C + α*permutation(op(A))`
    * partial trace: `C = β*C + α*partialtrace(op(A))`
    * contraction: `C = β*C + α*contract(op(A),op(B))`
    
  (also via method based access)

### 1. Permutations

In [None]:
A=randn(10,10,10,10,10,10,10,10);
B=zeros(10,10,10,10,10,10,10,10);

In [None]:
@time permutedims!(B,A,[8,7,6,5,4,3,2,1]);
@time @tensor B[8,7,6,5,4,3,2,1] = A[1,2,3,4,5,6,7,8];

In [None]:
@time copy!(B,A);
@time permutedims!(B,A,[1,2,3,4,5,6,7,8]);
@time @tensor B[1,2,3,4,5,6,7,8] = A[1,2,3,4,5,6,7,8];

### 1. Permutations
* How to optimize permutations? Why is it slower than normal copy?
* Even for matrix transposition?
  ```julia
  transpose!(dst,src)```
  ![transpose](transpose.png)
  Memory is linear $\Rightarrow$ `transpose` require unfavorable memory access!

```julia
function transpose!(B::StridedMatrix,A::StridedMatrix)
    m, n = size(A)
    size(B,1) == n && size(B,2) == m || throw(DimensionMismatch("transpose"))

    if m*n<=4*transposebaselength
        @inbounds begin
            for j = 1:n
                for i = 1:m
                    B[j,i] = transpose(A[i,j])
                end
            end
        end
    else
        transposeblock!(B,A,m,n,0,0)
    end
    return B
end
function transposeblock!(B::StridedMatrix,A::StridedMatrix,m::Int,n::Int,offseti::Int,offsetj::Int)
    if m*n<=transposebaselength
        @inbounds begin
            for j = offsetj+(1:n)
                for i = offseti+(1:m)
                    B[j,i] = transpose(A[i,j])
                end
            end
        end
    elseif m>n
        newm=m>>1
        transposeblock!(B,A,newm,n,offseti,offsetj)
        transposeblock!(B,A,m-newm,n,offseti+newm,offsetj)
    else
        newn=n>>1
        transposeblock!(B,A,m,newn,offseti,offsetj)
        transposeblock!(B,A,m,n-newn,offseti,offsetj+newn)
    end
    return B
end
```

### 1. Permutations
* How to generalize to multidimensional permutations?
    1. How to write nested loops depending on the dimensionality of the array?
    2. What is the best blocking (divide and conquer) strategy?

1. Solution to 1: generated functions!

parse -> expressions -> macro expansion -> new expression -> type inference -> generated functions -> compile -> run

[TensorOperations.jl kernels](https://github.com/Jutho/TensorOperations.jl/tree/staged/src)

2. Solution to 2: divide dimensions along which the minimum of the memory jumps of the two arrays is maximal.

### 2. Partial trace
* very similar, but somewhat more carefull

### 3. Tensor contraction: very similar to matrix multiplication

* Fastest algorithm: permute input arrays and reshape them such that you can use BLAS matrix multiplication
  ![simple contraction](simplecontraction.png)

```julia
Amat=reshape(permutedims(A,[1,4,2,3]),(dA1*dA4,dA2*dA3))
Bmat=reshape(permutedims(B,[3,1,2]),(dB3*dB1,dB2))
Cmat=Amat*Bmat
C=permutedims(reshape(Cmat,(dA1,dA4,dB2)),[1,3,2])
```

```julia
using TensorOperations
C = tensorcontract(A,[1,2,3,4],B,[3,5,2],[1,5,4])
@tensor C[a,b,c] = A[a,d,e,c]*B[e,b,d]
```

### Future directions:
#### Contraction order matters!

* matrix - matrix - vector multiplication: `A*B*v`: `A*(B*v)` is much faster than `(A*B)*v`
* ![mera](mera.png)

#### What is optimal contraction order?

* ![2dmerac](2dmerac.png)
* "Faster identification of optimal contraction sequences for tensor networks" (PR E 90, 033315 (2014))
    
####  $\Rightarrow$ implement new macro that takes `A[...]*B[...]*C[...]*D[...]` and transforms it into e.g. `A[...]*((B[...]*C[...])*D[...])` at compile time

#### More flexible index notation; mixed combinations of manual loops, creating slices and applying tensor operations
#### Multi-threading? GPU?