# Quantum wavefunctions as restricted Boltzmann machines (RBM)
(This neural network ansatz was introduced in paper [Solving the quantum many-body problem with artificial neural networks](http://doi.org/10.1126/science.aag2302).)

The generic $N$-spin wavefunction in the Ising basis is 
$$
|\psi\rangle = \sum_{\{\sigma\}}\Psi(\sigma_1, \dots, \sigma_N) |\sigma_1, \dots, \sigma_N \rangle
$$
The RBM ansatz defines a set of $M$ hidden binary variable $h_j$ and the wavefunction elements are given based on a set of complex-valued weights $w_{ij}$, between the hidden and visible spins, and corresponding biases (local fields) $a_i$ and $b_j$ for the visible and hidden spins respecitively.
$$
    \Psi(\{\sigma\}, \{h\}) = \frac{1}{Z} \exp{\left(\sum_i a_i \sigma_i + \sum_{ij} w_{ij}\sigma_ih_j + \sum_j b_j h_j\right)}
$$
This means we are charactrising a linear map from a set of $2^N$ spin configurations to a complex number by $N\times M+N+M$ variables.

For a given set of weights $w'$ and biases $b'$ integrating the the hidden layer variables we get
$$
    \Psi(\{\sigma\}) = \frac{1}{Z} \exp{\left(\sum_i a_i \sigma_i\right)} \times  \prod_j 2\cosh(\theta_j) \quad \text{where} \quad
    \theta_j = \sum_{i} w'_{ij}\sigma_i + b'_j
$$

Note that the variable $Z$ is just introduced for normalization (we will ignore it throughtout because we explicitely normalized the wavefunctions and expectation values).

In [None]:
using LinearAlgebra

In [2]:
### Parameters
# `N` is the number of spins, `M` the number of hidden spins and `α=M/N` the ratio of hidden layer to visible layer
N = 16
α = 2
M = floor(α * N)
n_pars = N*M + N + M

# The weights of connections and biases for visible and hidden layer
scale = 0.1
w = scale * randn(ComplexF64, N, M)
#w = zeros(ComplexF64, N, M)
a = randn(ComplexF64, N)
#a = fill(1.0im, N)
b = randn(ComplexF64, M)
#b = zeros(ComplexF64, M)

# pick a random configuration
s = rand([1, -1], N)
θs = transpose(w)*s + b

# definition of the wavefunction
wavefunction(s, a, θs) = exp(dot(s, a)) * prod(cosh.(θs))

@show psi = wavefunction(s, a, θs)
@show s;

psi = wavefunction(s, a, θs) = 21.38825288305486 + 555.3118727599374im
s = [-1, -1, 1, 1, 1, -1, 1, -1, -1, -1, -1, 1, -1, -1, 1, -1]


## The optimization algorithm
### Gradient descent
we would like to represent the ground state of some Hamiltonian $H$ (for example the quantum Ising in transverse field) so we choose to minimize the variational energy with respect to $\psi$ that is a function of the current weights $w$ and biases $a, b$.
$$
    E = \frac{\langle \psi | H | \psi \rangle}{\langle \psi | \psi \rangle}
$$
The generalized force (the gradient of the variational energy) is a real-valued function of the complex weights and biases, therefore
$$
    F_k(w) = \frac{\partial}{\partial w_k} E(w) = 2 \frac{\partial}{\partial w^*_k}E(w, w^*) =
    \sum_{\boldsymbol{\sigma}} 
    \frac{\left[\frac{\partial}{\partial w^*_k} \psi^*(\boldsymbol{\sigma} )\right]\langle \boldsymbol{\sigma}  | H | \psi \rangle}{\langle \psi | \psi \rangle} - 
    \sum_\boldsymbol{\sigma}
    \frac{\psi^*(\boldsymbol{\sigma} )\langle \boldsymbol{\sigma}  | H | \psi \rangle}{\langle \psi | \psi \rangle} 
    \sum_{\boldsymbol{\sigma} '} 
    \frac{\left[\frac{\partial}{\partial w^*_k} \psi^*(\boldsymbol{\sigma} ')\right]\psi(\boldsymbol{\sigma} ')}{\langle \psi | \psi \rangle}
$$
As usual Interpreting $p(\boldsymbol{\sigma} ) = \frac{|\psi(\boldsymbol{\sigma} )|^2}{\langle \psi | \psi \rangle}$ as probability of configuration $\boldsymbol{\sigma} $ and defining the average (with some abuse of notation)
$\langle A\rangle = \sum_{\boldsymbol{\sigma}} p(\boldsymbol{\sigma}) A(\boldsymbol{\sigma})$ we have 
$$
F_k(w) =  \langle O^*_k \mathcal{E}\rangle - \langle O^*_k\rangle\langle \mathcal{E}\rangle
$$
where the operators $O_k(\boldsymbol{\sigma})$ and $\mathcal{E}(\boldsymbol{\sigma})$ are defined as 
$$
    \begin{aligned}
        \mathcal{E}(\boldsymbol{\sigma} ) & = \frac{\langle \boldsymbol{\sigma}  | H | \psi \rangle}{\psi(\boldsymbol{\sigma})}  \\
        O_k(\boldsymbol{\sigma}) & = \frac{\partial_k \psi(\boldsymbol{\sigma})}{\psi(\boldsymbol{\sigma})} 
    \end{aligned}
$$
The usual gradient descent algorithm can be implemented to update the weights and biases with learning rate $\lambda$ according to 
$$
    w_k^{i+1} = w_k^i - \lambda F_k(w_k^i)
$$

Based on the definition of $\psi(\boldsymbol{\sigma})$, the derivatives are given by
$$
    \begin{aligned}
        \partial_{a_i} \ln \psi &= \sigma_i \\
        \partial_{b_j} \ln \psi &= \tanh \theta_j \\
        \partial_{w_{ij}} \ln \psi &= \sigma_i \tanh \theta_j
    \end{aligned}
$$


### Monte Carlo evaluation
Since the configuration space is $2^N$ large, it is generally impossible to evaluate the above averages for problems of interest. Therefore, we should pick a sampling method to approximately evaluate the averages. At each step of the optimization, the Monte-Carlo sampling method picks a series of random configurations $\boldsymbol{\sigma}_1, \dots, \boldsymbol{\sigma}_\ell$ and evaluates the operators. New configurations are chosen by flipping a random spin and according to the acceptance probability 
$$
    P_{\text{accept}}(\boldsymbol{\sigma}_{\ell+1}) = \min \left(1, \left|\frac{\psi(\boldsymbol{\sigma}_{\ell+1})}{\psi(\boldsymbol{\sigma}_\ell)}\right|^2\right)
$$
To avoid recalculations, the set of $\theta_j$ is saved and updated only for the connection to the flipped spin.

Note that this process amount to a version of *stochastic* gradient descent as the total gradient is not computed at each step. The choice of configurations, however, is chosen by a Metropolis walk rather than uniform random to respect the probability distribution induced by the wavefunction over the configurations.

### Stochastic reconfiguration 

In [None]:
# learning rate
λ = 0.05
# Gradient descent RBM learning for ground state of quantum Ising transverse field
for learning_step=1:100
    g = 1.00
    n_mcsweeps = 10
    l = 0
    O_av    = zeros(n_pars)
    E_av    = 0.0
    szsz_av = zeros(N)
    sx_av   = zeros(N)
    EO_av   = zeros(n_pars)
    n_samples = n_mcsweeps*N
    counter = 0
    while l < n_samples
        #print("x")
        spin = rand(1:N)
        counter += 1
        #spin = mod1(counter, N)
        #println(spin)
        s_new = copy(s)
        s_new[spin] *= -1
        θs_new = θs - 2 * s[spin] .* w[spin, :]
        psi_new = wavefunction(s_new, a, θs_new)
        #println(abs(psi_new/psi)^2)
        if abs(psi_new/psi)^2 > rand()
            psi = psi_new
            s  = s_new
            #println()
            #println(s)
            θs = θs_new
            l += 1
            O = conj([s...;tanh.(θs)...; [s[i] * tanh.(θs)[j] for j=1:M for i=1:N]...])
            szsz = [s[i]*s[mod1(i+1, N)] for i in 1:N]
            sx = []
            for i=1:N
                s_measure = copy(s)
                s_measure[i] *= -1
                #println(s-s_measure)
                push!(sx, wavefunction(s_measure, a, θs - 2 * s[i] .* w[i, :])/psi)
                #push!(sx, wavefunction(s_measure, a, θs)/psi)
            end
            #display(sx)
            #sleep(10)
            E = -sum(szsz) - g*sum(sx)
            #println(E, sum(szsz))
            O_av += O
            #E_av += sum(szsz)
            E_av += E
            EO_av += E .* O
        end
    end

    EO_av /= n_samples
    O_av /= n_samples
    E_av /= n_samples
    F = EO_av - E_av .* O_av

    println(E_av)
    # update weights and bias
    a -= λ *F[1:N]
    b -= λ *F[N+1:N+M]
    w -= λ *reshape(F[N+M+1:n_pars], N, M)
end