<a href="https://colab.research.google.com/github/Hamid-Mofidi/The-Principles-of-Deep-Learning-Theory/blob/main/Ch.%201%3A%20Pretraining%20/1.1%20Gaussian%20Integrals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this book is to develop principles that enable a theoretical understanding of deep learning. **Perhaps the most important principle is that wide and deep neural networks are governed by nearly-Gaussian distributions.** Thus, to make it through the book, you will need to achieve mastery of **Gaussian integration** and **perturbation theory**.

We begin in $§1.1$ with an extended discussion of Gaussian integrals.
Our emphasis will be on calculational tools for computing averages of monomials against Gaussian distributions, culminating in a derivation of Wick’s theorem.

Next, in $§1.2$, we begin by giving a general discussion of expectation values and observables. 

In $§1.3$, we introduce the negative log probability or action representation of a probability distribution and explain how the action lets us systematically deform Gaussian distributions in order to give a compact representation of non-Gaussian distributions.

In particular, we specialize to **nearly-Gaussian** distributions, for which deviations from Gaussianity are implemented by small couplings in the action, and show how perturbation theory can be used to connect the non-Gaussian couplings to observables such as the connected correlators. **By treating such couplings perturbatively, we can transform any correlator of a nearly-Gaussian distribution into a sum of Gaussian integrals**; each integral can then be evaluated by the tools **we developed in §1.1**. This will be one of our most important tricks, as the neural networks we’ll study are all governed by nearly-
Gaussian distributions, with non-Gaussian couplings that become perturbatively small as the networks become wide.

## 1.1 Gaussian Integrals

# Single-variable Gaussian integrals

The simplest single-variable Gaussian function is,
$$
e^{-\frac{z^2}{2}}  \hspace{2in} (1.1)
$$

The Gaussian integral is, 
$$
I_1 = \int_{-∞}^{∞} dz ~~e^{-\frac{z^2}{2}}  = \sqrt {2 π} \hspace{2in} (1.5)
$$

Dividing the Gaussian function (1.1) with this normalization factor, we define the **Gaussian probability distribution** with **unit variance** as
$$
p(z) ≡ \dfrac{1}{\sqrt {2 π} } e^{-\frac{z^2}{2}}  \hspace{2in} (1.6)
$$
which is now properly normalized, i.e., $\int_{-∞}^{∞} dz ~p(z)= 1$. Such a distribution with **zero mean** and **unit variance** is sometimes called the **standard normal distribution**.

Extending this result to a Gaussian distribution with variance $K > 0$ is super-easy. The corresponding normalization factor is given by
$$
I_K = \int_{-∞}^{∞} dz ~~e^{-\frac{z^2}{2K}}  = \sqrt {2 πK}. \hspace{2in} (1.7)
$$

We can then define the Gaussian distribution with variance $K$ as
$$
p(z) ≡ \dfrac{1}{\sqrt {2 πK} } e^{-\frac{z^2}{2K}}.  \hspace{2in} (1.8)
$$

More generally, we can shift the center of the bell curve as
$$
p(z) ≡ \dfrac{1}{\sqrt {2 πK} } e^{-\frac{(z-s)^2}{2K}},  \hspace{2in} (1.9)
$$
so that it is now symmetric around $z = s$. This center value $s$ is called the **mean of the distribution**, because the expected value of $z$ is equal to $s$, i.e., $E(z) = s$.

Focusing on Gaussian distributions (1.7) with zero mean, let’s consider other expectation values for general functions $\mathcal{O}(z)$, i.e.,
$$
E[O(z)] = \dfrac{1}{\sqrt {2 πK} } \int_{-∞}^{∞} dz ~~e^{-\frac{z^2}{2K}} \mathcal{O}(z)   \hspace{1in} (1.11)
$$
We’ll often refer to such functions $\mathcal{O}(z)$ as **observables**, since they can correspond to measurement outcomes of experiments. A special class of expectation values are called **moments** and correspond to the insertion of $z^M$ into the integrand for any integer $M$. The moment integral vanishes for any odd exponent $M$, because then the integrand is **odd** with respect to the sign flip $z \leftrightarrow -z$. As for the even number $M = 2m$ of $z$
insertions, we have,
$$
\begin{aligned}
I_{K,m} &≡ \int_{-∞}^{∞} dz ~~e^{-\frac{z^2}{2K}} ~z^{2m} \\
&= \sqrt{2\pi} K^{\frac{2m+1}{2}}(2m-1)(2m-3)\cdots 1. \hspace{1in} (1.14)
\end{aligned}
$$
Therefore, we see that the even moments are given by the simple formula

$$
E[z^{2m}] = \dfrac{I_{K,m}}{\sqrt{2\pi K}} = K^m (2m-1)!! \hspace{1in} (1.15)
$$
The result (1.15) is **Wick’s theorem for single-variable Gaussian distributions**.


# Multivariable Gaussian integrals

The multivariable Gaussian function is
defined as,
$$
f(z) = \exp \Big[-\frac{1}{2} ∑_{\mu , \nu =1}^{N} z_{\mu}(K^{-1})_{\mu \nu} z_\nu  \Big], \hspace{1.5in} (1.24)
$$
where the variance or covariance matrix $K_{\mu \nu}$ is an $N\times N$ **symmetric positive definite matrix**.  Now, to construct a probability distribution from the Gaussian function (1.24), we again need to evaluate the normalization factor
$$
\begin{aligned}
I_K &= \int d^N z ~\exp \Big[-\frac{1}{2} ∑_{\mu , \nu =1}^{N} z_{\mu}(K^{-1})_{\mu \nu} z_\nu  \Big] \\
 & = \sqrt{2\pi |K|} = \sqrt{2\pi \Pi_{\mu =1}^N \lambda_\mu}, \hspace{2.4in} (1.30)
\end{aligned}
$$
where $|K|$ denotes the determinant of a square matrix $K$, which is equal to  the product of the eigenvalues of matrix $K$.

Having figured out the normalization factor, we can define the zero-mean multivariable Gaussian probability distribution with variance $K_{\mu \nu}$ as
$$
p(z) = \dfrac{1}{\sqrt{2\pi |K|}} ~\exp \Big[-\frac{1}{2} ∑_{\mu , \nu =1}^{N} z_{\mu}K^{\mu \nu} z_\nu  \Big]  \hspace{1in} (1.34)
$$
with the notation 
$$
K^{\mu \nu} = (K^{-1})_{\mu \nu}. \hspace{2in} (1.32)
$$

Next, let’s consider the moments of the mean-zero multivariable Gaussian distribution
$$
\begin{aligned}
E[z_{\mu_1},\cdots, z_{\mu_M}] &≡ \int d^Nz ~p(z)~ z_{\mu_1} \cdots  z_{\mu_M}\\
&= \dfrac{I_{K,(\mu_1,\cdots,\mu_M)}}{I_K}, \hspace{1.1in} (1.36)
\end{aligned}
$$
where $I_K$ was defined in (1.30) and,
$$
I_{K,(\mu_1,\cdots,\mu_M)} ≡ \int d^Nz    ~\exp \Big[-\frac{1}{2} ∑_{\mu , \nu =1}^{N} z_{\mu}K^{\mu \nu} z_\nu  \Big]   ~ z_{\mu_1} \cdots  z_{\mu_M} \quad (1.37)
$$