# Emergent Classicality

In [148]:
%run 'main.py'

## Model

### Dipolar Distribution

In directional statistics, the $d$-dimensional dipolar distribution is a destribution of $d$-component unit vectors $x=(x_1,x_2,\cdots,x_d)$ following the PDF
$$p(x)=\frac{1}{\Omega_{d-1}}(1+x_d m),$$
where $\Omega_{d-1}=2 \pi^{d/2}/\Gamma(d/2)$ is the volume of a $S^{d-1}$ sphere (e.g. $\Omega_0=2$, $\Omega_1=2\pi$, $\Omega_2 = 4\pi$). The distribution is controlled by a single parameter $m$, called the polarization, which should take value in $m\in[-1,1]$ to ensure that the probability is positive. It has the following statistical properties:

* mean: $(0,\cdots,0,m/d)$
* variance: diagonal components $(1/d,\cdots,1/d,1/d-(m/d)^2)$, off diagonal components all vanish
* entropy: $S(m)=\log\Omega_{d-1}+\frac{1}{2}{}_2F_1^{(0,1,0,0)}\big(-\frac{1}{2},0;\frac{d}{2};m^2\big)$, where${}_2F_1^{(0,1,0,0)}$ is the first-order derivative of the hypergeometric function on its second parameter. Special cases:
    * $d=1$: $S=-\frac{1+m}{2}\log\frac{1+m}{2}-\frac{1-m}{2}\log\frac{1-m}{2}$,
    * $d=3$: $S=\log(4\pi)+\frac{1}{2}-\frac{(1+m)^2}{4m}\log(1+m)+\frac{(1-m)^2}{4m}\log(1-m)$.

To calcualte the marginal distribution of $x_d$, we consider parameterize $x_d=\cos\theta$, then $\mathrm{d}x_d=-\sin\theta\;\mathrm{d}\theta$ and by definition
$$p(x_d)\;\mathrm{d}x_d=-\sin^{d-2}\theta\;\mathrm{d}\theta\int\mathrm{d}\Omega_{d-2} p(x).$$
In general, we have $p(x_d)=p(x)\sin^{d-3}\theta=\frac{1}{2}(1+x_d m)(1-x_d^2)^{(d-3)/2}$. Only for $d=3$, the sin factor vanishes and we have $p(x_d)=p(x)$. So $d=3$ is somehow special.

The dipolar distribution plays an important role in sampling weakmeasurement outcomes of a qubit. In particular, we will focus on the $d=1$ and $d=3$ cases which are relevant to our later discussion.

* **$d=1$ case**. In this case, the univector $x$ degrades to two points $\pm1$, and the distribution reduces to the Bernoulli distribution
$$p(x)=\frac{1}{2}(1+x m)=\left\{\begin{array}{ll}\frac{1+m}{2}&x=+1,\\\frac{1-m}{2}&x=-1.\end{array}\right.$$
It can be sampled by first draw $z\sim\mathrm{Bernoulli}(p_z)$ with $p_z=(1-m)/2$, then generate $x=1-2z$. 

In [10]:
dp1 = Dipolar1D(0.)
dp1.sample()

tensor([-1.])

* **$d=3$ case**. We first sample $x_3$ from the marginal distribution
$$p(x_3)=\frac{1}{2}(1+x_3 m).$$
This can be done by first sample $z\sim \mathrm{uniform}(0,1)$ and then generate
$$x_3=\frac{m-2+4z}{1+\sqrt{(m-1)^2+4mz}}.$$
Then we sample $\theta\sim\mathrm{uniform}(0,2\pi)$ and then generate
$$x_1+\mathrm{i} x_2=e^{\mathrm{i}\theta}\sqrt{1-x_3^2}.$$
With these, we can put together $x=(x_1,x_2,x_3)$.

In [11]:
dp3 = Dipolar3D(0.)
dp3.sample()

tensor([ 0.9789, -0.1061, -0.1749])

`Dipolar1D` and `Dipolar3D` are probability distributions, which supports two methods:
* `.sample(shape)`: draws multiple samples forming array of the `shape`.
* `.log_prob(x)`: gives the logarithmic probability of sample $x$.
* They support parallel sampling of independent distributions of different polarization parameters, i.e. `Dipolar1D([m0,m1,m2,...])`.

In [12]:
dp3 = Dipolar3D([1.,0.5,0.,-1.])
x = dp3.sample([2])
x

tensor([[[-0.0108, -0.8156,  0.5785],
         [ 0.5874,  0.5201,  0.6201],
         [ 0.5066,  0.8566,  0.0978],
         [ 0.8619,  0.3848,  0.3303]],

        [[ 0.0416, -0.8498,  0.5254],
         [-0.2548, -0.0957,  0.9622],
         [ 0.1504,  0.9798, -0.1315],
         [-0.8639, -0.4240,  0.2719]]])

In [13]:
dp3.log_prob(x)

tensor([[-2.0746, -2.2610, -2.5310, -2.9319],
        [-2.1087, -2.1382, -2.5310, -2.8483]])

### Apparatus and Measurements

`Apparatus(epsilon, N, alpha0 = 0, axis = 'fixed'|'random')` generates an apparatus to simulate the weak measurement. It has the following parameters:
* `epsilon`: strength of the measurement, defined via the measurement operator $K_i=e^{\epsilon S_i}$ with $S_i$ being a unit norm operator to be weaking measured, such that no measurement: $\epsilon=0$ and projective measurement: $\epsilon\to\infty$.
* `N`: the number of ancilla qubit in the aparatus (also denoted as the sample size).
* `alpha0`: initial $\alpha_0$ value of the cat state, defined via
$$|\Psi_0\rangle=\frac{1}{\sqrt{2\cosh\alpha_0}}\big(e^{+\alpha_0/2}|000\cdots\rangle+e^{-\alpha_0/2}|111\cdots\rangle\big)$$
* `scheme`: measurement scheme, can be `'fixed'` or `'random'`. This has to do with the choice of measurement axis
    * `'fixed'`: fixed axis weak measurement described by the Kraus operator
    $$K_{[x]} = \prod_i K_{x_i}=\prod_i \frac{e^{\epsilon x_i Z_i}}{\sqrt{2 \cosh \epsilon}},$$
    where the $i$th qubit measurement outcome $x_i=\pm1$ is binary along the measurment axis.
    * `'random'`: random axis weak measurement described by the Kraus operator
    $$K_{[x]} = \prod_i K_{x_i}=\prod_i \frac{e^{\epsilon (x_{i1} X_i+x_{i2} Y_i+x_{i3} Z_i)}}{\sqrt{2 \cosh \epsilon}},$$
    where the $i$th qubit measurement outcome $x_i=(x_{i1},x_{i2},x_{i3})$ is a unit vector that can orient in any direction.

Starting from the cat state $|\Psi_0\rangle$, the joint probability for the measurement outcome $[x]$ to appear is given by
$$p[x]=\langle\Psi_0|K_{[x]}|\Psi_0\rangle.$$
It is hard to directly sample $[x]$ from the joint distribution. Since the Krause operator on different qubits commute with each other, their order does not matter (i.e. the distribution $p[x]$ is invariant under permuting qubits in $[x]$). So we can sample $[x]$ autoregressively
$$p[x]=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots.$$
In fact, the regression is hidden Markovian, i.e. the condition on the history of $(x_1,x_2,\cdots,x_{i-1})$ can be replaced by the condition on the $\alpha_i$ parameter, such that $p(x_i|x_1,\cdots,x_{i-1})=p(x_i|\alpha_{i-1})$. The generative process is iterative
$$x_i \sim p(x_i|\alpha_{i-1}), \quad \alpha_{i}=\alpha_{i-1}+b(x_i),$$
initiated from $\alpha_0$. Depending on the measurement scheme, we have
* **Fixed Axis:** $p(x|\alpha)$ is 1D dipolar distribution with polarization $m=\tanh\alpha\tanh\epsilon$ and $b(x)=\epsilon x$.
* **Random Axis:** $p(x|\alpha)$ is 3D dipolar distribution with polarization $m=\tanh\alpha\tanh\epsilon$ and $b(x)=2\mathrm{arctanh}(x_3 \tanh(\epsilon/2))$.

In [14]:
app = Apparatus(0.2, 3, scheme = 'fixed')
app.sample(2)

tensor([[[-1.],
         [ 1.],
         [ 1.]],

        [[-1.],
         [-1.],
         [-1.]]])

In [15]:
app = Apparatus(0.2, 3, scheme = 'random')
app.sample(2)

tensor([[[ 0.5197,  0.4696, -0.7137],
         [-0.9007, -0.3837,  0.2040],
         [-0.9747,  0.1965,  0.1063]],

        [[-0.9821, -0.1821,  0.0476],
         [ 0.8693, -0.4577,  0.1866],
         [-0.1612,  0.8579, -0.4880]]])

### Encoder

For each sample, let $x_i$ be the measurement outcome of the $i$th qubit. We use the attention mechanism to come up with a permutationally invariant feature $h$ (hidden variable),
$$h[x] = \frac{\sum_i [x_i, v(x_i)]e^{-s(x_i)}}{\sum_i e^{-s(x_i)}},$$
where
* $s(x_i)$ is the score function, modeled by a neural network. Its softmax (among index $i$) is the attention that $x_i$ will receive, i.e. the weight that will be put to aggregate the features from $x_i$.
* $v(x_i)$ is the value function, modeled by a neural network. It learns to capture relevant features in $x_i$ that can be use to infer the statistical properties of the latent variables. Its result $v(x_i)$ is further concatenated with $x_i$ to form a feature vector which provides direct access of the input $x_i$ as a shortcut.

From the aggregated feature $h$, we infer the distribution of the latent variable $z$, model as a normal distribution with mean $\mu_z$ and standard deviation $\sigma_z$ (inferred in log scale). The inference maps are modeled by neural networks,
$$\mu_z = \mu(h), \quad \log\sigma_z = \log\sigma(h).$$
Such that the resulting encoding distribution is
$$q(z|x) = \text{Normal}(\mu_z(h[x]), \sigma_z(h[x])).$$

`Encoder(x_dim, z_dim, v_dim=None)` creates an encoder that takes the input $x$ of dimension `x_dim` and returns a distribution $q(z|x)$ in the latent space of dimension `z_dim`. Optionally, one can also specify `v_dim`, which the dimension of the intermediate values (hidden features) that will be used in the inference. The encoder returns the distribution object $q(z|x)$, from which the laternt variable $z$ can be sampled. Note that as a Gaussian distribution, $q(z|x)$ supports reparametrized sampling via `q.rsample()`. We should use this method in the VAE, such that the gradient back propogation will not be precluded by the sampling.

In [16]:
enc = Encoder(1, 2)
x = Apparatus(0.2, 3, scheme = 'fixed').sample(3)
q = enc(x)
print(x,'\n->\n',q.loc)

tensor([[[ 1.],
         [ 1.],
         [ 1.]],

        [[-1.],
         [ 1.],
         [ 1.]],

        [[-1.],
         [-1.],
         [-1.]]]) 
->
 tensor([[ 0.1502, -0.1934],
        [-0.0245, -0.0901],
        [-0.4865,  0.1832]], grad_fn=<AddmmBackward>)


We can observe that the model perserves the permutation symmetry among the qubits. Two inputs related by permutation should give rise to the same latent Gaussian model (parameterized by the same parameters).

### Decoder

The decoder should provide a generative model $p(x)$ for the input samples. Here we are facing two choices
* **Plan A:** modeling the joint distribution as ensembles of product distributions, which are correlated by sharing same latent variable,
$$p[x]=\int\mathrm{d}z \prod_i p(x_i|z)p(z).$$
In this approach, we should sample a single instance of $z$ and distribute it over all qubits to evaluate the likelyhood of input data jointly. 
* **Plan B:** modeling the joint distribution autoregressively,
$$p[x]=p(x_1)p(x_2|x_1)p(x_3|x_1,x_2)\cdots,$$
where each conditional distribution is modeled by a VAE. Let $[x]_i=(x_1,x_2,\cdots,x_i)$ be the data from the first $i$ qubits, and $[x]=[x]_N$ be the data from all the $N$ qubits. Assuming that the conditional distribution converges after some steps, we can approximate $p(z|[x]_i)$ by the variational ansatz $q(z|[x])$,
$$p(x_{i+1}|[x]_{i})=\int\mathrm{d}z p(x_{i+1}|z)p(z|[x]_i)\simeq \int\mathrm{d}z p(x_{i+1}|z)q(z|[x]).$$
In this approach, we should draw the latent variable $z_i\sim q(z|x)$ for each qubit $i$ independently and evaluate the likelyhood of each qubit input independennt.

Plan A sounds like the canonical choice in the conventional framework of VAE. But it feels like that Plan B can explore the latent space more efficiently (as it samples much more latent variables in training) and the data will put more pressure on the encoding distribution $q(z|x)$ to separate the modes (as the correlation among the data can only be possible if $q(z|x)$ actively switches between different modes).

No mater which choice, we need to describe how to model $p(x_i|z)$. Our probability model will be based on the dipolar distribution. We first use a decoder map to infer the polarization $m(z)$ from the latent variable $z$, then we construct the distribution $p(x_i|z)=p_\text{dipolar}(x_i;m(z))$. Following the Choice B, we will need to sample $z_i$ for every qubit $i$ independently and construct the decoding distributions separately (in parallel).

In [17]:
z = q.rsample([3]).transpose(0,1)
dec = Decoder(2)
p = dec(z)
print(z,'\n->\n',p.log_prob(x))

tensor([[[-1.2123,  0.3994],
         [-0.4460, -0.7534],
         [-1.2678, -0.4238]],

        [[ 0.7324, -0.3536],
         [ 1.4823, -0.6025],
         [ 1.9007, -1.1298]],

        [[-0.1863,  1.2557],
         [-0.8953, -0.7847],
         [-2.9996,  1.3832]]], grad_fn=<TransposeBackward0>) 
->
 tensor([[-0.9874, -1.0692, -1.0742],
        [-0.4558, -0.9962, -1.0723],
        [-0.6896, -0.4157, -0.5415]], grad_fn=<LogBackward>)


### Variational Autoencoder (VAE)

`VAE(z_dim, apparatus)` constructs a VAE that uses a latent space of dimension `z_dim` to study the weak measurment signals emmited from the `apparatus`. It can be trained to extract features from measurement signals and represent them in the latent space. It contains
* an encoder that takes the measurement outcomes $x$ (as input) to construct $q(z|x)$
* a decoder that takes the latent representation $z$ to construct $p(x|z)$

Forward pass:
1. construct $q(z|x)$, 
2. sample $z\sim q(z|x)$, 
3. construct $p(x|z)$, 
4. return $p(x|z)$ and $q(z|x)$.

Loss function:
$$\mathcal{L}=\mathcal{L}_\text{re}+\mathcal{L}_\text{kl}=-\mathbb{E}_{z\sim q(z|x)}\log p(x|z)+\mathsf{KL}[q(z|x)|p(z)].$$
The objective is to minimize the loss.
* reconstruction loss is estimated by sampling
$$\mathcal{L}_\text{re}=-\mathbb{E}_{z\sim q(z|x)}\log p(x|z)\simeq -\sum_\text{batch}\sum_{i=1}^{N}\log p(x_i|z).$$
* KL loss can be explicitly calculated given the mean $\mu_z$ and the standard deviation $\sigma_z$ of the encoding distribution $q(z|x)$
$$\mathcal{L}_\text{kl}=\mathsf{KL}[q(z|x)|p(z)]=\sum_\text{batch}\frac{1}{2}(\mu_z^2+\sigma_z^2-1)-\log \sigma_z.$$
Here we have assume that the prior distribution $p(z)$ is the standard normal distribution to regularize the latent space.

In [166]:
vae = VAE(1, Apparatus(0.2, 200, scheme='random'))
vae

VAE(
  (encoder): Encoder(
    (attention): Sequential(
      (0): Linear(in_features=3, out_features=96, bias=True)
      (1): ELU(alpha=1.0)
      (2): Linear(in_features=96, out_features=1, bias=True)
      (3): Softmax(dim=-2)
    )
    (value): Sequential(
      (0): Linear(in_features=3, out_features=96, bias=True)
      (1): ELU(alpha=1.0)
      (2): Linear(in_features=96, out_features=24, bias=True)
    )
    (mu): Linear(in_features=27, out_features=1, bias=True)
    (logstd): Linear(in_features=27, out_features=1, bias=True)
  )
  (decoder): Decoder(
    (predict): Sequential(
      (0): Linear(in_features=1, out_features=4, bias=True)
      (1): ELU(alpha=1.0)
      (2): Linear(in_features=4, out_features=1, bias=True)
      (3): Tanh()
    )
  )
)

#### Training

The VAE can be trained by the method `VAE.learn(steps, batch_size)` with
* `steps`: number of steps (batches) to train,
* `batch_size`: number of samples in each batch.

With the KL loss, it could be slow to learn the features at the begining (although it will still converge eventually). A trick to boost the training at the begining stage is to turn off the KL loss, and then turn it on later. To this end, we introduce the regularization strength $r$ to the loss function
$$\mathcal{L}=\mathcal{L}_\text{re}+r \mathcal{L}_\text{kl}.$$
The multiplier $r$ can be accessed by `VAE.reg`.

In [184]:
vae.reg = 1.
loss = vae.learn(50, 50)
x = vae.apparatus.sample(10)
p, q = vae(x)
print('loss: ',loss)
print('[x3, m_predict, m_sample]:')
print(torch.stack([3*torch.mean(x[:,:,-1],-1).unsqueeze(-1), 
                   vae.decoder.predict(q.loc), 
                   torch.mean(p.m,-1).unsqueeze(-1)],dim=-1).squeeze().data)
print('[z_mean, z_std]:')
print(torch.stack([q.loc, q.scale],dim=-2).squeeze().data)

loss:  -0.0032532726936340332
[x3, m_predict, m_sample]:
tensor([[ 0.2904,  0.2245,  0.3425],
        [ 0.2714,  0.2081,  0.0990],
        [-0.2555, -0.1923, -0.2310],
        [-0.3137, -0.2397, -0.3144],
        [-0.1270, -0.1067,  0.0474],
        [-0.2029, -0.1730, -0.0145],
        [-0.3537, -0.2735, -0.0946],
        [-0.1021, -0.0925, -0.1578],
        [ 0.0466,  0.0372, -0.0419],
        [-0.1897, -0.1510,  0.0373]])
[z_mean, z_std]:
tensor([[ 1.0422,  0.5248],
        [ 0.9549,  0.5241],
        [-0.8968,  0.4905],
        [-1.1465,  0.4846],
        [-0.4737,  0.5000],
        [-0.7983,  0.4944],
        [-1.3327,  0.4831],
        [-0.4062,  0.5002],
        [ 0.1856,  0.5095],
        [-0.6883,  0.4922]])


Preliminary results:
* Polarization-Magnetization relation
    * When $N\epsilon^2$ is large, the VAE learns to predict the polarization 
$$m(\mu_z) \to \left\{\begin{array}{ll}\langle x\rangle & \text{fixed axis},\\
3\langle x_3\rangle & \text{random axis}.\end{array}\right.$$
that follows the $z$-magnetization of the weak measurement outcome. This is nontrivial for the random axis scheme, because VAE must learn that among the three components of $x$, only the third component matters. VAE learns this from the correlation among the signals from different qubits, such correlation only happens in the $x_3$ channel.
    * When $N\epsilon^2$ is small, $m(\mu_z)$ approaches to 0 and stops to follow $\langle x\rangle$. This is kind of expected as the data distribution is still centered around $\langle x\rangle\sim 0$. It is correct to think that the distribution is unbiased, as the correlation is no obvious yet.
    * **TODO:** We should quantify how $m(\mu_z)$ follows $\langle x\rangle$ as a function of $N$. We can collect the ratio $\langle x\rangle/\mathrm{arctanh}(m(\mu_z(x)))$ and plot that as a function of $N$ (or $\epsilon$).
* Sensitivity of this behavior.
    * Both Plan A and B are able to converge to the solution that VAE realizes that the magenetization is the key. But Plan A seems to work better than plan B in terms of the convergence speed.
    * The result is very sensitive to the KL regularization $r$. If $r=0$ (no regularization), the VAE can easily learns $\mu_z$ to precicely follow $\langle x\rangle$ and also drive $\sigma_z$ towards zero. This overfits the data and drives the distribution of $m$ to peak at the $\pm1$ boundary (this is like combining $m=\pm1$ dipolar distributions to realize uniform distribution). In this case the latent bifurcation will also follow the data bifurcation. But we do not want to persuit this at the price of overfitting. We should make $r=1$ to impose the correct regularization. We also comment on the normalization of reconstruction loss, the point is that we should not normalize it by $N$, because each qubit only has a very weak preference, we need to accumulate the preference among the hole system to determine the system qubit state. 
* The standard deviation $\sigma_z$ consistently decrease with $N\epsilon^2$. However, the relation is more like $\sigma_z\sim(N\epsilon^2)^{-0.34}$, not the expected $-0.5$ exponent. But this may have many reasons, e.g. the KL regularization puts pressure in the latent space to push distributions together, which distort the meaning of $\sigma_z$. Nevertheless, one thing is for sure, as $\sigma_z$ decreases, the VAE gets more confident about its prediction of $\mu_z$ when $N$ gets large.
     * **TODO:** How to identify the modes in the latent space? EM algorithm (can it be formualted as an objectimization problem)? Introspective learning - higher level machine analysis the latent space of VAE?
* We can observe the distribution $p(z)=\sum_x q(z|x)p(x)$ develops *double-peak* structure. Bifurcation does happen. 
    * But the latent space and data space bifurcation do not happen together. With increasing $N\epsilon^2$, the data space first bifurcates (at $N\epsilon^2\sim 1$ for fixed axis and $N\epsilon^2\sim 3$ for random axis), the latent space bifurcates later (at $N\epsilon^2\sim 4$ for fixed axis and $N\epsilon^2\sim 12$ for random axis). This is because the KL regularization is at action to prevent machine making the conclusion too early.
    * **TODO:** Can we gain some analytical understanding of how the latent space bifurcation is delayed given the KL regularization? What is the theoretical best performance?
* We can observe the $N\epsilon^2$ universality. Evidences includes the statistics of $\sigma_z$ and $\mu_z$ are almost identical for the same $N\epsilon^2$, the distribution $p(z)$ also looks very similar. Maybe not suprizing because the data has this universality. But for VAE, it is not obvious that different size $N$ can collapse.
    * **TODO:** quantify decoherence and knowledge acquissition as a function of $N\epsilon^2$. There is no "time" involved.
* Generalizations:
    * Multiple system qubits, coupled selectively to ancilla qubits. The problem no longer has permutation symmetry. The construction needs to be modified. Both the encoder and decoder must memorize the position of qubits.
    * Suppose the CNOT circuit is followed by local basis transformation on each qubit. Our previous argument is that as the measurement is also random axis, the transforamtion can be obsorbed to the measurement. The arguement is fine for the encoder. The problem with this arguement is on the decoder side we can not use the dipolar distribution along the same axis. The preferential axis must match the axis learnt by the encoder. Can we insert a orthogonal rotation layer (shared between encoder and decoder) to deal with the basis choice? Or should we acutally work with flow-based model (because orthogonal transformation is already a invertible flow)? The significance is to show that VAE no only learns about the system qubit but also learns about the apparatus's preferential basis. 
    * Can VAE actively choice the measurement basis? Design the experiment list a experimentalist? Does this "speed up" the decoherence? (It should be, because this effectively turns the random axis scheme to the fixed axis scheme, and we know that there is a speed up by 3 times.) 
* Is attention mechanism really necessary? In the end it just provides a weight, but couldn't we just absorb the weight into the feature vector? What is the scenario that this mechanism is useful, i.e. the weight really depends on the pattern of measurement signal.
    * For example, can we consider $\epsilon$ is varying from qubit to qubit, such that some qubit is more informative than the other? In this case, we will need a nontrivial attention which is position dependent. By learnin this position dependence, the machine can discover which quantum sensors are more defected/effective. This will be of pratical use.
* Hierarchical network.
    * Can we consider CNOT circuit is acually hierarchical, such that the information spreading is exponential? Maybe this is more like the case of information propagation in realistic systems. To work with the hierachy, should we also consider tree like neural networks, such as neural RG? Can we use flow-based model as encoder and let the strong measurment emerge at the IR (top) layer?

#### TensorBoard Visualization

We can visualize the training process in TensorBoard. Run `tensorboard --logdir=runs` from command line in the folder of this notebook. The result is hosted on http://localhost:6006/.

In [224]:
vae = VAE(1, Apparatus(0.1, 100, scheme='random'))
writer = SummaryWriter('runs/'+str(vae))

In [225]:
for i in range(100):
    vae.reg = np.tanh(vae.step/200)
    loss = vae.learn(20, 50)
    writer.add_scalar('training_loss', loss, vae.step)
    with torch.no_grad():
        x = vae.apparatus.sample(100)
        p, q = vae(x)
        writer.add_scalar('latent_norm', torch.mean(torch.norm(q.loc, dim=-1)), vae.step)
        writer.add_scalar('latent_fluctuation', torch.mean(q.scale), vae.step)
        writer.add_histogram('x3', torch.mean(x[:,:,-1],-1), vae.step)
        writer.add_histogram('m', p.m.squeeze(), vae.step)
        writer.add_histogram('z', torch.sum(q.sample([vae.sample_size]),dim=-1).squeeze(), vae.step)

In [53]:
%run 'main.py'
vae = VAE(2, Apparatus(0.1, 10, scheme='random'))

In [56]:
x = vae.apparatus.sample(100)
p, q = vae(x)
re_loss = - torch.sum(p.log_prob(x) + np.log(4*np.pi))
re_loss

tensor(43.1318, grad_fn=<NegBackward>)

In [122]:
np.log(2)

0.6931471805599453