# Theoretical Backgrounds 
## Simulation-Based Inference
Put it in short, Simulation-Based Inference(SBI) is to utilize the simulation data instead of the data obtained from real-world experiments to infer the underlying structure of the real-world data.

In the field of physics, almost all the models contain parameters. For a specific model in a model class $\mathcal M$ which is parametrized by $\mathbf\theta$, the experimental results of a experiment related to that model can be uncertain, which can be treated as a random variable whose probability density is determined by the parameter of our model, taking the form $p(\mathbf x| \mathbf{\theta},\mathcal M)$ where $\mathbf x$ denotes a possible result of our experiment, $\mathbf{\theta}=(\theta_1,\theta_2,...,\theta_n)$ are the parameters of our model. For example, $\mathbf x$ can be measured temperatures in the $n$ pixels of a CMB sky map, and $\theta_1,\theta_2...$ can be the parameters of our cosmological model  e.g. the Hubble constant $h$, the cosmic density parameter $\Omega$.

In reality, what we usually do is to use the experiment results to select a model from all models that we came up with. The model selection procedure can be divided into two steps, the first step is to select a model class $\mathcal M$ which we believe the real-world model is in. Theoretically, the first can be done by comparing the [Bayes factor](https://en.wikipedia.org/wiki/Bayes_factor) of different model classes. Nevertheless in real world we may conduct this step in other ways. Now suppose we select a model class $\mathcal M$ which is parameterized by $\mathbf \theta$, how do we select a specific model from it as our final model used to represent the laws of nature? This work can be done by using Bayes theorem:
\begin{gather}
p(\mathbf{\theta}|\mathbf x)=\dfrac{p(\mathbf x| \mathbf{\theta})p(\theta)}{p(\mathbf x)},
\end{gather}
where the posterior $p(\mathbf{\theta}|\mathbf x)$ is the probability density that the real-world model refecting the law of physics is captured by the model class $\mathcal M$ and $\mathbf \theta$ is the parameter of the real-world model given that the experiment result is $\mathbf x$. After obtaining the posterior, we can choose $\text{argmax}_{\theta} p(\mathbf{\theta}|\mathbf x)$ as the parameter of our model.

To obtain the posterior, we need both the likelihood $p(\mathbf x| \mathbf{\theta})$ and $p(\mathbf x)$. When we do not have any idea on prior $p(\mathbf\theta)$, it's reasonable to assume $p(\mathbf\theta)$ is uniform. To fit the likelihood, we need to sample observation-parameter pairs from the joint probability $p(\mathbf x ,\mathbf\theta)$. But in real world the parameter is fixed to some specific vector $\mathbf\theta_0$, our observations in real world are just samplings from the probability density $p(\mathbf x| \mathbf{\theta_0})$, we can never draw samples from joint probability by just doing observations in the real world. An alternative approach to sample from joint probability is by simulating. We can try to build a simulator by taking account all factors that we could think of which play roles between models and observations corresponding to them. The simulator that we build would be able to take any parameter $\mathbf\theta$ as input and output as many as observations as we want. We hope and assume that the observation generating procedure of the simulator when taking $\mathbf \theta$ as input is approximately the same as observations observed/sampled from a world whose model's parameter is $\mathbf\theta$, i.e. the sampling of $\mathbf x$ from $p(\mathbf x| \mathbf{\theta})$ can be replaced by input $\mathbf \theta$ to the simulator and generate/simulate observations. The sampling of one $\{\mathbf x,\mathbf\theta\}$ pair from joint probability would be first sampling a $\mathbf\theta$ from $p(\mathbf \theta)$, and then input $\theta$ into the simulator to generate a $\mathbf x$. In this experiment, we generate Gaussian Random Fields from a likelihood with a known analytic form likelihood, the prior is also set to be a specified uniform distribution. We first use the observatin-parameter pairs sampled to fit the likelihood ansartz which is a specific kind of flow-based generative model [RealNVP](https://lilianweng.github.io/posts/2018-10-13-flow-models/), then we compare the posterior obtained by the likelihood fitted and the true posterior.











## Gaussian Random Field
## Basic concepts and properties
A Gaussian Random Field (GRF) $\phi(\mathbf{x})$ is a field where each point $\mathbf x$ of the space or a subset of the space is associated with a random variable and satisfies that $\forall n\in\mathbb Z^+, \forall \mathbf{x_1},\mathbf{x_2},...,\mathbf{x_n}$, the random vector $(\phi(x_1),...,\phi(x_n))$ follows the Gaussian distribution. We can use GRFs to represent the physical field in several contexts.  Based on the Central Limit Theorem, we may use a GRF to represent a coarse-grained field whose physical quantities at each space point can be treated as a sum of physical quantities in smaller spatial region, and cosmological observations and theoretical considerations also indicate that the initial density fluctuations of our universe can also be well described by a GRF. Suppose the fluctuations of our universe is a GRF, it's supposed to satisfy the following relations:
$$
\mu(\mathbf x)\equiv\langle\phi(\mathbf x)\rangle=0\\
\begin{aligned}
C(\mathbf x_1, \mathbf x_2)&\equiv\langle(\phi(\mathbf x_1)-\mu(\mathbf x_1))(\phi(\mathbf x_2)-\mu(\mathbf x_2))\rangle\\
&=C(\mathbf x_1-\mathbf x_2)\quad \text{(homogenius, translational symmetry)}\\
&=C(|\mathbf x_1-\mathbf x_2|)\quad \text{(isotropic, rotational symmetry)}.
\end{aligned}
$$
In fact, any N-point correlation of the random field of the density fluctuation should have translational and rotational symmetry, but since the GRF is solely determined by the first and second order momentum, the satisfaction for the equalities above is sufficient for the field to be homogenius and isotropic. The GRF satisfying translational and rotational is totally determined by its power spectrum. We derive the form of probability density of the GRF in $k-$space in the rest part of this section.

Let $\phi(\mathbf{x})$ be a $d$ dimentional Random Gaussian Field, let $r_{\mathbf{k}},i_{\mathbf{k}}$ be the real and imaginary part of its $k$ space field, we have
\begin{gather}
\left\langle r_{\mathbf{k}} \right\rangle=\left\langle i_{\mathbf{k}} \right\rangle=0, \quad \text{($r_{\mathbf{k}}$ and $i_{\mathbf{k}}$ are just linear superposition of Gaussian random variable with zero mean)}
\end{gather}
and 
\begin{aligned}
\left\langle r_{\mathbf{k_1}} r_{\mathbf{k_2}} \right\rangle&=\dfrac{1}{(2\pi)^d}\langle\int d \mathbf{x} \, d \mathbf{x}' \phi(\mathbf{x})\phi(\mathbf{x}')
\cos(\mathbf{k_1} \cdot \mathbf{x})\cos(\mathbf{k_2} \cdot \mathbf{x}')\rangle\\
&=\dfrac{1}{(2\pi)^d}\int d \mathbf{x} \, d \mathbf{x}' \cos(\mathbf{k_1} \cdot \mathbf{x}) \cos(\mathbf{k_2} \cdot \mathbf{x}')\langle\phi(\mathbf{x})\phi(\mathbf{x}')\rangle\\
&=\dfrac{1}{(2\pi)^d}\int d \mathbf{x} \, d \mathbf{x}'  \frac{1}{4} \left( e^{i (\mathbf{k_1} \cdot \mathbf{x} + \mathbf{k_2} \cdot \mathbf{x}')} + e^{-i (\mathbf{k_1} \cdot \mathbf{x} + \mathbf{k_2} \cdot \mathbf{x}')} + e^{i (\mathbf{k_1} \cdot \mathbf{x} - \mathbf{k_2} \cdot \mathbf{x}')} + e^{-i (\mathbf{k_1} \cdot \mathbf{x} - \mathbf{k_2} \cdot \mathbf{x}')} \right) C\left(\left|\mathbf{x} - \mathbf{x}'\right|\right)
\end{aligned}
\end{gather}

the first term in the sum is:
\begin{align*}
& \frac{1}{(2 \pi)^d} \int d \mathbf{x} \, d \mathbf{x}' \, \frac{e^{i \left( \mathbf{k_1} \cdot \mathbf{x} + \mathbf{k_2} \cdot \mathbf{x}' \right)}}{4} C\left(\left|\mathbf{x} - \mathbf{x}'\right|\right) \\
& = \int d \mathbf{x} \, d \mathbf{x}' \, \frac{e^{i \left( \mathbf{k_1} + \mathbf{k_2} \right) \cdot \mathbf{x}} \cdot e^{-i \mathbf{k_2} \cdot (\mathbf{x}-\mathbf{x}')}}{4} C\left(\left|\mathbf{x} - \mathbf{x}'\right|\right) \\
& = \int d \mathbf{x} \, d \mathbf{r} \left( \frac{e^{i \left( \mathbf{k_1} + \mathbf{k_2} \right) \cdot \mathbf{x}}}{4} \right) \cdot \left( e^{-i \mathbf{k_2} \cdot \mathbf{r}} C(r) \right) \\
& = \frac{\delta \left( \mathbf{k_1} + \mathbf{k_2} \right)}{4} \cdot \mathcal{F}(C(r))\left(\mathbf{k_1}\right),
\end{align*}
since $C(r)$ is spherically symmetric, thus $\mathcal{F}(C(r))(\mathbf{k})$ is also spherically symmetric. The three remaining integrals can be done in a similar way. The final result is:

\begin{gather}
\left\langle r_{\mathbf{k_1}} r_{\mathbf{k_2}} \right\rangle= \frac{\delta \left( \mathbf{k_1} - \mathbf{k_2} \right)+\delta \left( \mathbf{k_1}+\mathbf{k_2} \right)}{2}(2\pi)^d P(k_1),
\end{gather}
where $P(k)\equiv\mathcal{F}(C(r))(\mathbf{k})/(2\pi)^d$


Similarly, we can calculate:
\begin{align*}
\left\langle i_{\mathbf{k_1}} i_{\mathbf{k_2}} \right\rangle &= \frac{\delta \left( \mathbf{k_1} - \mathbf{k_2} \right)-\delta \left( \mathbf{k_1}+\mathbf{k_2} \right)}{2}(2\pi)^d P(k_1) , \\
\left\langle r_{\mathbf{k_1}} i_{\mathbf{k_2}} \right\rangle&=0.
\end{align*}

We can further obtain that:



\begin{align*}
\left\langle i_{\mathbf{k}}^2 \right\rangle &= \left\langle r_{\mathbf{k}}^2 \right\rangle = \frac{\delta(0)}{2} \cdot (2 \pi)^d P(\mathbf{k}) = \frac{\int_{\text{finite volume} V} d^d \mathbf{x} \, \textrm{e}^{\operatorname{\textrm{i}0\cdot\mathbf{x}}}}{2} \cdot P(\mathbf{k}) \quad \text{(the system size is finite after all)} \\
&= \frac{V}{2} P(\mathbf{k}), \quad \forall \mathbf{k} \neq 0
\end{align*}

\begin{align*}
\left\langle \left( r_{\mathbf{k}} - r_{\mathbf{-k}} \right)^2 \right\rangle &= \left\langle r_{\mathbf{k}}^2 \right\rangle + \left\langle r_{\mathbf{-k}}^2 \right\rangle - 2\left\langle r_{\mathbf{k}} r_{\mathbf{-k}} \right\rangle = \left( \frac{\delta(0)}{2} + \frac{\delta(0)}{2} - \delta(0) \right) \cdot (2 \pi)^d P(\mathbf{k}) = 0, \quad \forall \mathbf{k} \neq 0 \\
\left\langle \left( i_{\mathbf{k}} + i_{\mathbf{-k}} \right)^2 \right\rangle &= 0, \quad \forall \mathbf{k} \neq 0 \\
\left\langle i_{\mathbf{0}}^2 \right\rangle &= \frac{\delta(0) - \delta(0)}{2} (2 \pi)^d P(0) = 0 \\
\left\langle r_{\mathbf{0}}^2 \right\rangle &= \frac{\delta(0) + \delta(0)}{2} (2 \pi)^d P(0) = \delta(0)(2 \pi)^d P(0) = V P(0) \\
\left\langle \left| \phi_{\mathbf{k}} \right|^2 \right\rangle &= \left\langle r_{\mathbf{k}}^2 + i_{\mathbf{k}}^2 \right\rangle = V P(\mathbf{k}), \quad \forall \mathbf{k}
\end{align*}

The equalities suggests that only half of the $r_{\mathbf{k}}$ and $i_{\mathbf{k}}$ are independent, we can choose them as: $\left\{r_{\mathbf{k}}|k_1,k_2, \ldots k_d>0\right\} \cup\left\{r_{\mathbf{k}}|k_1=0,k_2, \ldots k_d>0\right\}  \cup \ldots \left\{r_{\mathbf{k}}|k_1,k_2, \ldots k_d=0\right\}$ and $\left\{i_{\mathbf{k}}|i_1,i_2, \ldots i_d>0\right\} \cup\left\{i_{\mathbf{k}}|i_1=0,i_2, \ldots i_d>0\right\}  \cup \ldots \left\{i_{\mathbf{k}}|i_1,i_2, \ldots i_{d-1}=0,i_{d}>0\right\}$, the other half is determined by the relation below:
$$
r_{\mathbf{k}}=r_{\mathbf{k}} ,\quad i_{\mathbf{k}}=-i_{\mathbf{-k}}
$$

For the discrete case, consider the field is defined on a $d$-dimentional hypercubic lattice with $N$ nodes, the situation is similar:
\begin{aligned}
\langle r_{\mathbf{k1}}r_{\mathbf{k2}}\rangle &= \ldots = \dfrac{1}{N} \sum_{\mathbf{x}, \mathbf{x}'} \dfrac{\textrm{e}^{\textrm{i}(\mathbf{k_1}+\mathbf{k_2})\cdot\mathbf{x}}}{4} \cdot \textrm{e}^{\textrm{i}\mathbf{k_2}\cdot(\mathbf{x}-\mathbf{x'})} C(|\mathbf{x}-\mathbf{x'}|) + \text{3 other terms} \\
&= \dfrac{\delta^\text{K}(\mathbf{k_1}+\mathbf{k_2},0)+\delta^\text{K}(\mathbf{k_1}-\mathbf{k_2},0)}{2}P(k_1),
\end{aligned}
where $\delta^{\text K}$ is the Kronecker delta, the last equal sign holds only if the spatial correlation of the field decays fast enough, e.g. the correlation length $\xi\ll L$, the scale of the system. Other equalities hold similarly, and specifically:
$$
\left\langle\left|\phi_{\mathbf{k}}\right|^2\right\rangle= P(k), \forall \mathbf{k},
$$
thus $P(k)$ is the power spectrum of the field.

By using the fact that the GRF is characterized by $N/2$($N+1/2$ if $N$ is odd) independent randoms variables, the probability of a field configuration in a volume element can be expressed as:
\begin{gather}
\begin{aligned}
&\phantom{=}p(r_{\mathbf{k_1}}\in\text{d}r_{\mathbf{k_1}},...,r_{\mathbf{k_{N/2}}}\in\text{d}r_{\mathbf{k_{N/2}}},i_{\mathbf{k_1}}\in\text{d}i_{\mathbf{k_1}},...,i_{\mathbf{k_{N/2}}}\in\text{d}i_{\mathbf{k_{N/2}}})\\
&=\prod_{i=1}^{N/2}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\text{e}^{-\frac{r_{\mathbf{k_i}}^2}{2(P(k_i)/2)}}\prod_{i=1}^{N/2}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\text{e}^{-\frac{i_{\mathbf{k_i}}^2}{2(P(k_i)/2)}}\prod_{i=1}^{N/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}}\\
&=\prod_{i=1}^{N}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\prod_{i=1}^{N/2} \exp\left(-\frac{r_{\mathbf{k_i}}^2 + i_{\mathbf{k_i}}^2}{2 \cdot (P(k_i)/2)}\right)\prod_{i=1}^{N/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}}\\
&=\prod_{i=1}^{N}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\prod_{i=1}^{N/2} \exp\left(-\frac{2|\phi_{\mathbf{k_i}}|^2 }{2 \cdot P(k_i)}\right)\prod_{i=1}^{N/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}}\\
&=\prod_{i=1}^{N}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\exp\left(-\frac{|\phi_{\mathbf{k_i}}|^2 }{2 \cdot P(k_i)}\right)\prod_{i=1}^{N/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}},
\end{aligned}
\end{gather}
we can also conduct a change of variable:
\begin{gather}
r_{\mathbf{k}}=|\phi_{\mathbf{k}}|\cos(\theta_{\mathbf{k}}),\\
i_{\mathbf{k}}=|\phi_{\mathbf{k}}|\sin(\theta_{\mathbf{k}}),
\end{gather}
and the new volume element is
$$
{d}r_{\mathbf{k}}\text{d}i_{\mathbf{k}}=|\phi_{\mathbf{k}}|{d}|\phi_{\mathbf{k}}|\text{d}\theta_{\mathbf{k}},
$$  
the probability becomes
$$
\prod_{i=1}^{N}\dfrac{1}{\sqrt{2\pi}\sqrt{P(k_i)/2}}\exp\left(-\frac{|\phi_{\mathbf{k_i}}|^2 }{2 \cdot P(k_i)}\right)\prod_{i=1}^{N/2}|\phi_{\mathbf{k_i}}|{d}|\phi_{\mathbf{k_i}}|\text{d}\theta_{\mathbf{k_i}}=\prod_{i=1}^{N}\dfrac{1}{\sqrt{P(k_i)/2}}\exp\left(-\frac{|\phi_{\mathbf{k_i}}|^2 }{2 \cdot P(k_i)}\right)\prod_{i=1}^{N/2}|\phi_{\mathbf{k_i}}|{d}|\phi_{\mathbf{k_i}}|\dfrac{\text{d}\theta_{\mathbf{k_i}}}{2\pi}
$$
When $N$ is odd, the volume element $\prod_{i=1}^{N/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}}$ becomes $\prod_{i=1}^{(N-1)/2}\text{d}r_{\mathbf{k_i}}\text{d}i_{\mathbf{k_i}}\cdot\text{d}r_{\mathbf{0}}$, and by conducting similar calculations we can find that the form of the probability density is the same as the even case since since $\langle r_{\mathbf0}^2\rangle=P(\mathbf0)$ instead of $P(\mathbf0)/2$.

The probability distribution of a GRF is determined by its spectrum $P(k)$. In this experiment, the spectrum of GRFs generated is set to be 
$$
P(k)=A\cdot k^{-B},
$$
where $A$ and $B$ play the role of the parameters of our model, the GRFs generated under a specific $(A,B)$ take the role of observations generated by the 'simulators'. We will use $\{\mathbf z, (A,B)\}$ pairs, where $\mathbf z$ is the latent vector compressed from the GRF by the autoencoder pretrained,  to fit the likelihood and obtain the posterior. The two likelihood $p(\mathbf z|(A,B))$ and $P(\phi(\mathbf x)|(A,B))$ might not be identical, but the two posteriors corresponding to the same GRF are supposed to be the same approximatedly since the factor introduced by change of variables to the likelihood should be canceled by marginal probability.

## Sampling of Gaussian Random Fields
The GRFs in our experiments are sampled in the way below
$$
\phi(\mathbf x)=\dfrac{1}{N}\text{Re}\left(\sum_{\mathbf k}(r_{\mathbf k}+\text{i}i_{\textbf k})\text{e}^{\text{i}\mathbf k\cdot \mathbf x}\right),
$$
where $r_{\mathbf k}$ and $i_{\mathbf k}$ are sampled independetly from $\mathcal N(0,P(k))$ instead of $\mathcal N(0,P(k)/2)$, this generating fasion indeed gives what we want. Let $\phi_{\mathbf k}$ denote the the Fourier tranformation of $\phi(\mathbf x)$, let $\tilde r_{\mathbf k},\tilde i_{\mathbf k}$ denote the real part ,imaginary part of $\phi_{\mathbf k}$ , we have:
$$
\begin{aligned}
\phi_{\mathbf k}&=\dfrac1N \sum_{\mathbf x}\sum_{\mathbf k} r_{\mathbf k}\cos(\mathbf k \cdot\mathbf x)\text{e}^{-\text{i}\mathbf k'\cdot\mathbf x}-\dfrac1N \sum_{\mathbf x}\sum_{\mathbf k} i_{\mathbf k}\sin(\mathbf k \cdot\mathbf x)\text{e}^{-\text{i}\mathbf k'\cdot\mathbf x}\\
&=\dfrac1N\sum_{\mathbf k} r_{\mathbf k}\sum_{\mathbf x}\dfrac{\text{e}^{\text{i}(\mathbf k-\mathbf k')\cdot\mathbf x}+\text{e}^{-\text{i}(\mathbf k+\mathbf k')\cdot\mathbf x}}{2}-\dfrac1N\sum_{\mathbf k} r_{\mathbf k}\sum_{\mathbf x}\dfrac{\text{e}^{\text{i}(\mathbf k-\mathbf k')\cdot\mathbf x}-\text{e}^{-\text{i}(\mathbf k+\mathbf k')\cdot\mathbf x}}{2\text{i}}\\
&=\dfrac{r_{\mathbf k'}+r_{\mathbf {-k'}}}{2}+
\text{i}\dfrac{i_{\mathbf k'}-i_{\mathbf {-k'}}}{2},
\end{aligned}
$$
this inplies that

$$
\tilde r_{\mathbf k}=\dfrac{r_{\mathbf k}+r_{\mathbf {-k}}}{2}=\tilde r_{\mathbf {-k}},\quad\tilde i_{\mathbf k}=\dfrac{i_{\mathbf k}-i_{\mathbf {-k}}}{2}=-\tilde i_{\mathbf {-k}},\quad \tilde r_{\mathbf 0}=r_{\mathbf 0},
$$
 since $r_{\mathbf k},r_{\mathbf {-k}},i_{\mathbf k},i_{\mathbf {-k}}$ are independent Gaussian random variables with 0 mean and variance $P(k)$, the variance of $\tilde r_{\mathbf k}$ and $\tilde i_{\mathbf k}$  will be $P(k)/2$, the variance of $\tilde r_{\mathbf 0}$ will be $P(k)$. And $\forall \mathbf k_1,\mathbf k_2$ in the "upper half" of the $k$ space, $r_{\mathbf k_1},r_{\mathbf k_2},i_{\mathbf k_1},i_{\mathbf k_2}$ will be independent because of the independentness of $\{r_{\mathbf k}\}$ and $\{i_{\mathbf k}\}$. The sampling method indeed gives the GRFs determined by the power spectrum $P(k)$.

# Summary of the results of training the autoencoder to compress the GRFs

## The model architecture and Default Hyperparameter
The architectures of autoencoders examined by us can be summarised by figure 4.1, and table 4.1 gives the meaning of hyperparameters varied during hyperparameter scan. The hyperparameter values shown in table 4.2 are used for the experiments unless specified otherwise. 

<center> <img src="data_and_images/archi.svg"  width="1000" height="2000" style="margin: 10px;" /> </center>
<center><h5>Figure 4.1: The architecture of autoencoders examined. knum and Dnum are both lists and hyperparameters specifying the structure of the autoencoder, knum[i] gives the number of channels of the i-th Conv2D layer of the encoder and the len(knum)-2-i th Conv2DTransposed layer of the decoder,Dnum[i] gives the number of neurons of the i-th hidden Dense layer of the encoder and the len(knum)-1-i th Dense layer of the decoder.    </h5></center>

<center><h5>Table 4.1: Meanings of model structure hyperparameters varied during hyperparameter scan and Default hyperparameters.</h5></center>

| Hyperparameter | Type | Description |
| ------------------- | ---------- | -------------------- |
| knum | list | knum[i] gives the number of channels of the i-th Conv2D layer of the encoder and the len(knum)-2-i th Conv2DTransposed layer of the decoder. |
| Dnum | list | Dnum[i] gives the number of neurons of the i-th hidden Dense layer of the encoder and the len(knum)-1-i th Dense layer of the decoder. |
| z_dim | int | Dimention of the latent vector $z$ |

<center><h5>Table 4.2: Default Hyperparameters settings, .</h5></center>

| Hyperparameter | Description |
| ------------------- | -------------------- |
|knum|[512,1024]|
|Dnum|[]|
|z_dim|8|
|kernel size|3|
|stride|2|
|padding|1|
|output_padding of ConvTranspose2d|1|
|bias of Conv2d layers|False|
|Dropout rate|0.2, except that the dropout rate after the last Dense layer of the decoder is 0| 
|Training set, Validation set, and test set|The first 800 GRFs, The consecutive 100 GRFs after the training set, The consecutive 100 GRFs after the Validation set|
|loss|MSE between output and input |
| optimizer |Adam|
| initial learning rate| 1e-4|
| weight-decay | 1e-5 |
|batch size|512|
|epochs|3000|

## The training results
### The training result of default hyperparameter settings
Figure 4.2 shows the training and validation loss curve during the traing process, indicating a slight overfitting. Both curve converge at last. The final train_loss, val_loss and test_loss are 0.0077,0.0089 and 0.0088. Figure 4.3 shows four original GRFs from the test dataset, their reconstructions from the trained autoencoder and the reconstructed GRF with some specific components of the latent vector $z$ are set to 0. Based on Figure 4.3, we find that 

- The reconstructed graphs are coarse-grained version of the original ones. 
- There is no one-to-one correspondence between the the several separate regions of the reconstructed GRFs which are filled with roughly the same colors and different components of $z$, instead,  
- $z_3$ is the most important component since the reconstructed GRF changes most significantly when $z_3$ is set to 0. 
- By setting one of other components except $z_3$ to be zero, the reconstructed GRFs do not change significantly, but if we only keep $z_3$ and set all other components to be zero, the GRF deviates significantly from the reconstructed GRF with all components preserved, suggesting that each single-color region of the reconstructed GRF contains the contribution of every component.

<center> <img src="data_and_images/loss_ae_default.png"  width="800" height="400" style="margin: 10px;" /> </center>
<center><h5>Figure 4.2: Training and Validation Loss curves of the autoencoder training process under default hyperparameter setting</h5></center>

<center> <img src="data_and_images/autoencoder_reconstructions_Default.png"  width="800" height="400" style="margin: 10px;" /> </center>
<center><h5>Figure 4.3: Four original GRFs from the test dataset, their reconstructions from the trained autoencoder and the reconstructed GRF with some specific components of the latent vector $z$ are set to 0. A,B: the parameters of our theory/model metioned in the first section.</h5></center>

### The training results of hyperparameter scan
Since the compressing quality of GRFs plays an important role in fitting the likelihood, we further conduct a hyperparameter scan over different architectures of autoencoders to try to achieve better compressing quality, i.e. lower reconstruction loss. Table 4.3 gives the architectures examined in the hyperparameter scan, for knum=[1024,2048],[512,1024,2048], only Dnum=[],[16],[32],[64] are scanned. Since the default hyperparameter setting exhibits overfitting, we mainly try models with small trainable parameters. Each architecture is tranined by two different initial learning rates $10^{-3}$
and $10^{-4}$. The EarlyStopping callback and ReduceLROnPlateau scheduler are used for all trainings, patience of EarlyStopping and ReduceLROnPlateau are 400 and 200 respectively, the decay factor of ReduceLROnPlateau is set to be 0.3 or 0.5. The bias of Conv2d layers are set to be True. The Training set, Validation set, and test set are the first 4000 GRFs, the consecutive 500 GRFs after the training set and the consecutive 500 GRFs after the Validation set respectively. We use larger dataset to inform the autoencoder more information.

<center><h5>Table 4.3: The architectures examined in the hyperparameter scan, including basically all the conbinations $\{knum\}\times\{ Dnum\}\times\{z_dim\}$</h5></center>

| knum | Dnum | z_dim |
| ------------------- | ---------- | -------------------- |
| [2,4],[3,9],[4,16],[32,64],[64,128],[128,256],[256,512],[1024,2048],[2,4,8],[3,9,27],[16,32,64],[32,64,128],[64,128,256],[128,256,512],[256,512,1024],[512,1024,2048] | [],[16],[32],[64],[64,32],[32,16] | 8,16,32 |

Figure 4.4 shows the relation between the generalization ability of the model and the number of trainable parameters. Based on figure 4.4, we find that:

- By increasing  the number of latent dimention from 8 to 32, the val_loss is improved from around 0.055 to 0.033.
- By increasing the size of training dataset from 1000 to 4000 and remaining other hyperparameters to be the same as the default setting, the val_loss is improved from 0.0089 to around 0.0055.
- The influence of the number of trainable parameters in the interval $[10^3,10^8]$ on the generalization is slight.
- For z_dim=8 and 16, the architecture with knum=[16,32,64] and Dnum=[] are more superior than others; for z_dim=32, the architecture with knum=[32,64,128] and Dnum=[] is the most excellent. All three of them have similar advantages: 
   - 1. They are way smaller compared to all the models achieving better performance which have 10 to 1000 times more trainable parameters than them, but meanwhile val_losses achieved by three of them are pretty close to the best val_loss with a deviation around $5\times 10^{-5}$. 
   - 2. The training and validation loss during the training process (see figure 5) finally converge to roughly the same value. 
   
The MSE between the reconstructed GRFs by three of them and 100000 original GRFs generated independently from the training set are 0.0045,0.0037,0.0029 for z_dim=8,16 and 32, indicating strong generalization ability and a better performance than the model trained by default hyperparameter setting. Figure 4.6 shows the reconstructed GRFs by the selected three autoencoders and the original GRFs, we can see that as z_dim increases, more and more single-color regions emerge, and the reconstructed graph become more and more fine-grained.    

<center> <img src="data_and_images/scan_Minimum_val_loss_Number_of_trainable_parameters.png" width="400" height="400"  style="margin: 10px;" /> </center>
<center><h5>Figure 4.4: Minimum val_loss of each training process vs Number of trainable parameters</h5></center>

<div style="text-align:center;">
(a) <img src="data_and_images/loss_ae5000_8.png" width="300" height="300" style="margin: 10px;" />  
(b) <img src="data_and_images/loss_ae5000_16.png"  width="300" height="300" style="margin: 10px;" />
(c)  <img src="data_and_images/loss_ae5000_32.png"  width="300" height="300" style="margin: 10px;" />  
</div>
<center><h5>Figure 4.5: Loss curves of the training process of three architectures mentioned in the text. (a) z_dim=8 .(b)z_dim=16. (c) z_dim=32 </h5></center>

<div style="text-align:center;">
(a) <img src="data_and_images/autoencoder_reconstructions_8.png" width="500" height="500" style="margin: 10px;" />  
(b) <img src="data_and_images/autoencoder_reconstructions_16.png"  width="500" height="500" style="margin: 10px;" />
(c)  <img src="data_and_images/autoencoder_reconstructions_32.png"  width="500" height="500" style="margin: 10px;" />  
</div>
<center><h5>Figure 4.6: The reconstructed GRFs by the selected three autoencoders and the original GRFs. The Original GRFs are in the first line of each graph (a) z_dim=8 .(b)z_dim=16. (c) z_dim=32 </h5></center>



