<!-- File automatically generated using DocOnce (https://github.com/doconce/doconce/):
doconce format ipynb NeuralSampling.do.txt  -->

## Markov Chain Monte Carlo sampling
Help to understand MCMC (Markov Chain Monte Carlo) sampling.

Markov Chain: To start, a Markov chain is a type of mathematical model that describes a sequence of events in which the probability of each event depends only on the state of the previous event. 
This is often referred to as the Markov property.

Monte Carlo Method: The Monte Carlo method is a statistical technique that allows you to make numerical estimates using random sampling. 
For example, if you wanted to estimate the value of pi, you could randomly throw darts at a square with a circle inside it, 
and use the ratio of darts that land inside the circle to the total number of darts thrown to estimate pi. 
This method gets its name from the Monte Carlo Casino in Monaco where games of chance (like roulette) exemplify the generation of random outcomes.

Markov Chain Monte Carlo (MCMC): MCMC combines these two methods. 
It's used to estimate the distribution of a complex model by generating a Markov chain of random samples. 
In other words, MCMC is a technique for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its equilibrium distribution.

To make this more concrete, let's consider the common MCMC method known as the Metropolis-Hastings algorithm:

Initialization: Start with an arbitrary point to be the current state.

**Proposal**: Generate a candidate for the next state by randomly perturbing the current state. 
The way this is done can vary. For example, in the case of a simple random walk, you might just add a small random number to the current state.

**Acceptance/Rejection**: Decide whether to accept or reject the candidate. 
If the candidate is a better fit to the data (i.e., it has a higher posterior probability), then it's always accepted. 
If it's a worse fit, then it's accepted with a probability that depends on how much worse it is. 
This step is crucial because it means that even if you start with a poor initial state, over time, the algorithm will move towards areas of higher probability.

**Iteration**: Return to step 2, using the new state (whether it was the candidate state or the previous state). 
Repeat this process as many times as necessary.

After running this process for a certain number of iterations, the generated samples will approximate the target distribution (provided certain mathematical conditions are met). 
The idea is that you "forget" your initial state and converge to the equilibrium distribution.

One thing to keep in mind is the concept of "burn-in" period. 
Because it takes a while for the Markov chain to "forget" its starting state and converge to the target distribution, 
it's common to throw away some number of the initial samples - this is called the "burn-in" period.

In summary, MCMC methods provide a powerful way to draw samples from complex, high-dimensional distributions, 
making it possible to estimate these distributions or calculate integrals that would be otherwise difficult or impossible to compute. 
They are used extensively in Bayesian statistics and machine learning.

Markov Chain Monte Carlo (MCMC) methods are crucial tools for Bayesian inference. 
In Bayesian inference, we're interested in the posterior distribution of the parameters of our model given the data. 
For many models, this posterior distribution is complex and high-dimensional, making it impossible to calculate directly or sample from directly.

MCMC provides a way to generate samples from these complex posterior distributions. 
Once we have these samples, we can use them to make inferences about the parameters. 
For example, we can estimate the mean of the distribution, its standard deviation, or any other statistics of interest. 
We can also generate predictions by sampling from the predictive distribution, which is derived from the posterior distribution of the parameters.

So, to sum up, MCMC is a method used to perform Bayesian inference when the posterior distribution is too complex to handle with simpler methods. 
Through MCMC, we can obtain a set of samples which approximates the posterior distribution, and these samples can then be used to make inferences about the parameters of the model.

## Formal definition of MCMC
More formally, a Markov chain $M$ (in discrete time) is defined by a set $S$ of states (we consider for discrete time only the case where $S$ has a finite size, denoted by $|S|$) 
together with a transition operator $T$. 
The operator $T$ is a conditional probability distribution $T(s|s')$ over the next state $s$ given a preceding state $s'$. 

The Markov chain $M$ is started in some initial state $s(0)$, and moves through a trajectory of states $s(t)$ via iterated application of the stochastic transition operator $T$. 
Therfore, if $s(t-1)$ is the state at time $t-1$, then the next state $s(t)$ is drawn from the conditional probability distribution $T(s|s(t-1))$. 

An important theorem from probability theory (see, e.g., p. 232 in [[grimmett2020probability]](#grimmett2020probability)) states that if $M$ is irreducible 
(i.e., any state in $S$ can be reached from any other state in $S$ in finitely many steps with probability $w > 0$) and aperiodic 
(i.e., its state transitions cannot be trapped in deterministic cycles), then the probability $p(s(t)=s|s(0))$ converges for 
$t \to \infty$ to a probability $p(s)$ that does not depend on the initial state $s(0)$. 

This state distribution $p$ is called the invariant distribution of $M$. 
The irreducibility of $M$ implies that it is the only distribution over the states $S$ that is invariant under its transition operator $T$, i.e.

<!-- Equation labels as ordinary links -->
<div id="eq:invariant_p"></div>

$$
\begin{equation}
p(s) = \sum_{s' \in S} T(s|s')\cdot p(s').
\label{eq:invariant_p} \tag{1}
\end{equation}
$$

Thus, in order to carry out probabilistic inference for a given
distribution $p$, it suffices to construct an irreducible and aperiodic
Markov chain $M$ that leaves $p$ invariant, i.e., satisfies equation ([1](#eq:invariant_p)).
Then one can answer numerous probabilistic inference questions
regarding $p$ without any numerical computations of probabilities.
Rather, one plugs in the observed values for some of the random
variables (RVs) and simply collects samples from the conditional
distribution over the other RVs of interest when the Markov chain
approaches its invariant distribution.

## Example: MCMC sampling for Bayesian inference
Let's say we have a coin and we don't know if it's fair or not. Our goal is to estimate the probability, p, that the coin lands heads.

In Bayesian terms, p is the parameter of interest and we want to estimate its posterior distribution given some observed data.

Here are our assumptions:

We start with a prior belief that every probability p between 0 and 1 is equally likely (this is our prior distribution, often a beta distribution is used in such a scenario, 
but for simplicity, let's consider a uniform distribution).
We then flip the coin several times and record whether it lands heads or tails.
Let's say we flip the coin 10 times and get 7 heads and 3 tails.

We now want to update our belief about the bias of the coin based on this data. In Bayesian terms, we want to calculate the posterior distribution of p given the data. This is where MCMC comes in.

First, we initialize p at some value. It doesn't matter exactly where we start because the algorithm will eventually converge to the true posterior, but we might start at 0.5 just as a guess.

Next, we propose a new value of p. This proposal is a random value chosen from some neighborhood of the current value of p. This randomness is where the "Monte Carlo" in MCMC comes from.

Once we've proposed a new value of p, we calculate the ratio of two probabilities:

The probability of the data given the proposed value of p (this is the likelihood of the proposal).
The probability of the data given the current value of p (this is the likelihood of the current state).
In our case, the likelihood is given by the binomial distribution because each coin flip is a Bernoulli trial, i.e., it's a trial with two possible outcomes (heads or tails).

The ratio we've calculated is a simplified version of the acceptance ratio in the Metropolis-Hastings algorithm. 
We always accept a proposal if its likelihood is higher than the likelihood of the current state. If its likelihood is lower, we accept it with a probability equal to the acceptance ratio.

We then repeat these steps - propose a new value of p, calculate the acceptance ratio, decide whether to accept the proposal - many times. 
Over time, the values of p that we accept will form a sample from the posterior distribution of p.

Once we have a sample from the posterior, we can use it to make inferences about p. 
For example, the mean of the sample gives an estimate of the expected value of p. 
We could also calculate a credible interval for p (which is a Bayesian analogue of a confidence interval) by finding the range that contains 95% of the sample.

## Gibbs sampling
Gibbs sampling is a specific type of Markov Chain Monte Carlo (MCMC) sampling algorithm. 
MCMC methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. 
The state of the chain after a large number of steps is then used as a sample from the desired distribution. The quality of the sample improves as a function of the number of steps.

Here's a general description of how MCMC methods work:

1. **Initialization**: Start from an arbitrary point in the state space.
2. **Transition**: Define a probability distribution over the state space that depends only on the current state (this is the Markov property). Draw a new state from this distribution.
3. **Convergence**: Repeat the transition step many times. Under certain conditions, the distribution of the chain's state will converge to the desired distribution.

Now, Gibbs sampling is a specific MCMC method that simplifies the transition step by updating one dimension at a time. 
This can make the sampling process more efficient, because it's often easier to sample from the conditional distribution of one variable given the others, 
than from the joint distribution of all variables.

Here's how Gibbs sampling works:

1. **Initialization**: Start from an arbitrary point in the state space.
2. **Transition**: For each dimension in turn, draw a new value from the distribution of that dimension conditional on the current values of the other dimensions. This set of new values becomes the new state.
3. **Convergence**: Repeat the transition step many times. The distribution of the chain's state will converge to the desired distribution.

Gibbs sampling is particularly useful when the joint distribution is complex, but the conditional distributions are relatively simple and easy to sample from. 
This makes it a popular method for sampling in Bayesian statistics, where the posterior distribution can be complex and high-dimensional, 
but the conditional distributions are often standard forms that are easy to work with. 
In the context of training a Restricted Boltzmann Machine (RBM), Gibbs sampling is used to sample from the model's distribution over its visible and hidden units.

## Basic principles of the Restricted Boltzmann Machine
In an RBM, we have a set of visible units, $v$, and a set of hidden units, $h$. These are typically represented as binary variables. The energy of a state $(v, h)$ in an RBM is given by:

<!-- Equation labels as ordinary links -->
<div id="_auto1"></div>

$$
\begin{equation}
E(v,h) = -a^T v - b^T h - v^T W h
\label{_auto1} \tag{2}
\end{equation}
$$

where $a$ and $b$ are the bias terms for the visible and hidden layers, respectively, and $W$ is the matrix of weights connecting the visible and hidden units.

The probability of a particular state $(v, h)$ is given by the Boltzmann distribution:

<!-- Equation labels as ordinary links -->
<div id="_auto2"></div>

$$
\begin{equation}
P(v, h) = \frac{e^{-E(v,h)}}{Z}
\label{_auto2} \tag{3}
\end{equation}
$$

where $Z$ is the partition function, defined as the sum of $e^{-E(v,h)}$ over all possible states of $v$ and $h$:

<!-- Equation labels as ordinary links -->
<div id="_auto3"></div>

$$
\begin{equation}
Z = \sum_{v,h} e^{-E(v,h)}
\label{_auto3} \tag{4}
\end{equation}
$$

In practice, $Z$ is intractable to compute for large networks, as it involves a sum over all possible states of $v$ and $h$.

Training an RBM involves adjusting the weights and biases to maximize the likelihood of the training data. 
This can be done using gradient ascent on the log-likelihood. 
The gradient of the log-likelihood with respect to the weights is given by:

<!-- Equation labels as ordinary links -->
<div id="_auto4"></div>

$$
\begin{equation}
\frac{\partial \log P(v)}{\partial W_{ij}} = \langle v_i h_j \rangle_{data} - \langle v_i h_j \rangle_{model}
\label{_auto4} \tag{5}
\end{equation}
$$

where $\langle \rangle_{data}$ denotes an expectation value taken over the data distribution, and $\langle \rangle_{model}$ denotes an expectation value taken over the distribution defined by the model. 
The terms $\langle v_i h_j \rangle_{data}$ and $\langle v_i h_j \rangle_{model}$ represent the correlation between the visible unit $v_i$ and the hidden unit $h_j$ under the data and model distributions, respectively.

The updates in the code follow this gradient to try to maximize the log-likelihood of the data. 
The "positive phase" corresponds to the $\langle v_i h_j \rangle_{data}$ term and the "negative phase" corresponds to the $\langle v_i h_j \rangle_{model}$ term. 
The positive phase increases the probability of training data, while the negative phase decreases the probability of samples generated by the model.

The difficulty lies in computing the expectation value under the model distribution, $\langle v_i h_j \rangle_{model}$. In the code, we use Gibbs sampling to get an approximation of the model expectation. 
The algorithm used is called Contrastive Divergence (CD), which starts a Gibbs chain from the training data and runs it for a small fixed number of steps. 
The states visited by the Gibbs chain are then used as a proxy for the model distribution.

The training process is not guaranteed to find the global maximum of the log-likelihood, and it is sensitive to the choice of learning rate and other hyperparameters.

This is a basic overview of the theory behind RBMs. There are many details and subtleties that I haven't covered here, but hopefully, this gives you a good starting point for understanding how they work.

## Deriving the gradient of the log likelihood
Here we outline the derivation of the gradient of the log-likelihood for an RBM with respect to the weights. 

Given a dataset $\mathcal{D} = \{v^{(1)}, \ldots, v^{(N)}\}$ of $N$ independent observations, 
the likelihood of the data given the parameters $W$ (weights), $a$ (visible biases), and $b$ (hidden biases) is given by the product of the probabilities assigned to each individual observation:

<!-- Equation labels as ordinary links -->
<div id="_auto5"></div>

$$
\begin{equation}
P(\mathcal{D} | W, a, b) = \prod_{n=1}^{N} P(v^{(n)} | W, a, b)
\label{_auto5} \tag{6}
\end{equation}
$$

Taking the log gives the log-likelihood:

<!-- Equation labels as ordinary links -->
<div id="_auto6"></div>

$$
\begin{equation}
\log P(\mathcal{D} | W, a, b) = \sum_{n=1}^{N} \log P(v^{(n)} | W, a, b)
\label{_auto6} \tag{7}
\end{equation}
$$

Here $P(v^{(n)} | W, a, b)$ is the marginal probability of observing $v^{(n)}$, which we can obtain by summing out the hidden units from the joint distribution $P(v,h)$:

<!-- Equation labels as ordinary links -->
<div id="_auto7"></div>

$$
\begin{equation}
P(v^{(n)} | W, a, b) = \sum_{h} P(v^{(n)}, h | W, a, b)
\label{_auto7} \tag{8}
\end{equation}
$$

Substituting for the joint distribution using the Boltzmann distribution formula:

<!-- Equation labels as ordinary links -->
<div id="_auto8"></div>

$$
\begin{equation}
P(v^{(n)} | W, a, b) = \frac{\sum_{h} e^{-E(v^{(n)}, h)}}{\sum_{v,h} e^{-E(v,h)}}
\label{_auto8} \tag{9}
\end{equation}
$$

Taking the log gives:

<!-- Equation labels as ordinary links -->
<div id="_auto9"></div>

$$
\begin{equation}
\log P(v^{(n)} | W, a, b) = -E(v^{(n)}) - \log \sum_{v,h} e^{-E(v,h)}
\label{_auto9} \tag{10}
\end{equation}
$$

Here $E(v^{(n)})$ is the free energy of the visible configuration $v^{(n)}$, given by $E(v^{(n)}) = -\log \sum_{h} e^{-E(v^{(n)}, h)}$.

Now, to compute the gradient of the log-likelihood with respect to the weights, we need to differentiate this expression. By the chain rule, this gives two terms:

<!-- Equation labels as ordinary links -->
<div id="_auto10"></div>

$$
\begin{equation}
\frac{\partial \log P(v^{(n)} | W, a, b)}{\partial W_{ij}} = -\frac{\partial E(v^{(n)})}{\partial W_{ij}} - \frac{\partial}{\partial W_{ij}} \log \sum_{v,h} e^{-E(v,h)}
\label{_auto10} \tag{11}
\end{equation}
$$

The first term is easy to compute. Because the energy function is linear in the weights, we have:

<!-- Equation labels as ordinary links -->
<div id="_auto11"></div>

$$
\begin{equation}
\frac{\partial E(v^{(n)})}{\partial W_{ij}} = -v^{(n)}_i \sum_h h_j e^{-E(v^{(n)}, h)} / \sum_h e^{-E(v^{(n)}, h)} = -v^{(n)}_i \langle h_j \rangle_{v^{(n)}}
\label{_auto11} \tag{12}
\end{equation}
$$

The second term is trickier. Using the identity $\frac{\partial \log f(x)}{\partial x} = \frac{1}{f(x)} \frac{\partial f(x)}{\partial x}$ and swapping the order of summation and differentiation, we get:

<!-- Equation labels as ordinary links -->
<div id="_auto12"></div>

$$
\begin{equation}
\frac{\partial}{\partial W_{ij}} \log \sum_{v,h} e^{-E(v,h)} = \frac{1}{\sum_{v,h} e^{-E(v,h)}} \sum_{v,h} e^{-E(v,h)} \frac{\partial (-
E(v,h))}{\partial W_{ij}} = \sum_{v,h} P(v,h) v_i \langle h_j \rangle_v = \langle v_i h_j \rangle
\label{_auto12} \tag{13}
\end{equation}
$$

This expectation is taken with respect to the distribution $P(v,h)$ defined by the model.

So putting it all together, we have:

<!-- Equation labels as ordinary links -->
<div id="_auto13"></div>

$$
\begin{equation}
\frac{\partial \log P(v^{(n)} | W, a, b)}{\partial W_{ij}} = \langle v_i h_j \rangle_{v^{(n)}} - \langle v_i h_j \rangle
\label{_auto13} \tag{14}
\end{equation}
$$

This is the expression we use to update the weights in the training algorithm. 
The challenge lies in computing the expectation $\langle v_i h_j \rangle$, which requires summing over all possible configurations of the visible and hidden units - 
an intractable operation for larger networks. 
The Contrastive Divergence (CD) algorithm approximates this expectation using Gibbs sampling, 
starting from the observed data.

The key insight of CD is that we don't need to sample from the model's equilibrium distribution to get an accurate estimate of this expectation. 
Instead, we can start a Markov chain from the observed data and run it for a small number of steps. This procedure, known as Gibbs sampling, allows us to sample from a distribution that's "close" to the observed data.

Here's the basic idea of CD:

1. **Positive phase**: Start with a data vector on the visible units and perform Gibbs sampling to get the hidden states. This is often referred to as "up pass". 
You compute the activation of each hidden unit given the visible units and sample its state from a Bernoulli distribution defined by the logistic sigmoid of its activation. 
This gives you the expectation $\langle v_i h_j \rangle_{\text{data}}$, which is the first term in the gradient of the log-likelihood.

2. **Negative phase**: Now perform Gibbs sampling for several steps, where each step involves updating all the hidden units given the visible units (a "down pass") and 
then updating all the visible units given the hidden units (an "up pass"). The result is a "fantasy" vector on the visible units and its corresponding hidden states, 
which are a sample from the model's distribution. 
This gives you the expectation $\langle v_i h_j \rangle_{\text{model}}$, which is the second term in the gradient of the log-likelihood.

3. **Update the weights**: Now you can compute the gradient of the log-likelihood and use it to update the weights. 
The update rule is $\Delta W_{ij} = \epsilon (\langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{model}})$, where $\epsilon$ is the learning rate.

It's important to note that CD is an approximation of the true gradient of the log-likelihood. 
The accuracy of the approximation depends on the number of Gibbs sampling steps. With more steps, the distribution of the "fantasy" vectors will get closer to the model's 
equilibrium distribution, giving a more accurate estimate of the expectation. However, more steps also mean more computation, so there's a trade-off between accuracy and efficiency. 
In practice, it's been found that even a single step of Gibbs sampling (CD-1) often works well for training RBMs.

## Implementing the RBM
We'll use the RBM to model the sklearn digits dataset.

In [1]:
import numpy as np
from tqdm import tqdm

class RBM:

    def __init__(self, num_visible, num_hidden, verbose=True):
        self.num_hidden = num_hidden
        self.num_visible = num_visible
        self.verbose = verbose

        # Initialize a weight matrix, of dimensions (num_visible x num_hidden), using
        # a uniform distribution between -sqrt(6. / (num_hidden + num_visible))
        # and sqrt(6. / (num_hidden + num_visible)). One could vary the 
        # standard deviation by multiplying the interval with appropriate value.
        np_rng = np.random.RandomState(1234)

        self.weights = np.asarray(np_rng.uniform(
			low=-np.sqrt(6. / (num_hidden + num_visible)),
                        high=np.sqrt(6. / (num_hidden + num_visible)),
                        size=(num_visible + 1, num_hidden + 1)))

        # Insert weights for the bias units into the first row and first column.
        self.weights[0, :] = 0
        self.weights[:, 0] = 0


    def train(self, data, max_epochs=1000, learning_rate=0.1):
        num_examples = data.shape[0]

        # Insert bias units of 1 into the first column.
        data = np.insert(data, 0, 1, axis=1)
        
        # Initialize error trace list
        self.error_trace = []
        iterator = tqdm(range(max_epochs), disable=not self.verbose)
        for epoch in iterator:
            # Clamp to the data and sample from the hidden units. 
            # (This is the "positive CD phase", aka the reality phase.)
            pos_hidden_probs = self._logistic(np.dot(data, self.weights))
            pos_hidden_states = pos_hidden_probs > np.random.rand(num_examples, self.num_hidden + 1)
            # Note that we're using the activation *probabilities* of the hidden states, not the hidden states       
            # themselves, when computing associations. We could also use the states; see section 3 of Hinton's 
            # "A Practical Guide to Training Restricted Boltzmann Machines" for more.
            pos_associations = np.dot(data.T, pos_hidden_probs)

            # Reconstruct the visible units and sample again from the hidden units.
            # (This is the "negative CD phase", aka the daydreaming phase.)
            neg_visible_probs = self._logistic(np.dot(pos_hidden_states, self.weights.T))
            neg_visible_probs[:,0] = 1 # Fix the bias unit.
            
            neg_hidden_probs = self._logistic(np.dot(neg_visible_probs, self.weights))
            # Note, again, that we're using the activation *probabilities* when computing associations, not the states 
            # themselves.
            neg_associations = np.dot(neg_visible_probs.T, neg_hidden_probs)

            # Update weights.
            self.weights += learning_rate*((pos_associations - neg_associations) / num_examples)

            error = np.sum((data - neg_visible_probs) ** 2)
            self.error_trace.append(error)
            iterator.set_postfix({'Error': error})
                
    def run_gibbs(self, v):
        """
        Runs one step of the Gibbs chain starting from visible state v.
        """
        # Compute probabilities of hidden states given visible states.
        hidden_probs = self._logistic(np.dot(v, self.weights))
        hidden_states = hidden_probs > np.random.rand(self.num_hidden + 1)

        # Compute probabilities of visible states given hidden states.
        visible_probs = self._logistic(np.dot(hidden_states, self.weights.T))
        visible_states = visible_probs > np.random.rand(self.num_visible + 1)

        return visible_states



    def _logistic(self, x):
        return 1.0 / (1 + np.exp(-x))

Train the model on the digits dataset:

In [2]:
from sklearn import datasets

# Load Data
digits = datasets.load_digits()
data = digits.images.reshape((len(digits.images), -1))
data[data<7] = 0 # Make the data binary - this is often a good idea in RBMs...
data[data>=7] = 1

# Create RBM
rbm = RBM(num_visible = data.shape[1], num_hidden = 100)

# Train RBM
rbm.train(data, max_epochs = 10000)
plt.plot(rbm.error_trace)

Visualize the weights:

In [3]:
%matplotlib inline

import matplotlib.pyplot as plt

# Visualize the weights
weights = rbm.weights[1:] # remove bias weights
weights = weights.reshape((8, 8, -1))

fig, axes = plt.subplots(10, 10, figsize=(10, 10))

for i, ax in enumerate(axes.ravel()):
    if i < weights.shape[2]:
        ax.imshow(weights[:, :, i], cmap=plt.cm.gray_r)
    ax.axis('off')

Visualisation of the generated data

In [4]:
# Generate data
num_samples = 100
v = np.random.rand(1, rbm.num_visible + 1) > 0.5 # start from a random visible state
samples = []
for _ in range(num_samples):
    for _ in range(100): # run the chain for 100 steps before sampling
        v = rbm.run_gibbs(v)
    samples.append(v)

samples = np.concatenate(samples)

# Visualize the generated data
fig, axes = plt.subplots(10, 10, figsize=(10, 10))
samples = samples[:, 1:] # remove the bias unit

for i, ax in enumerate(axes.ravel()):
    if i < num_samples:
        ax.imshow(samples[i].reshape((8, 8)), cmap=plt.cm.gray_r)
    ax.axis('off')

plt.show()

## Neural sampling
In [[Buesing2011]](#Buesing2011) they introduce a model where the recurrent activity of a network of spiking neurons performs Markov chain Monte Carlo (MCMC) sampling from a desired target distribution.

The authors propose that such networks can approximate Boltzmann machines, a type of stochastic recurrent neural network that can represent and solve complex combinatorial problems. 
In Boltzmann machines (and their simplified counterparts, Restricted Boltzmann Machines), the nodes are binary and they update their states asynchronously based on the weighted sum of inputs they receive. 
In the spiking neuron model proposed in the paper, neurons fire spikes, and the state of the network is defined by the precise timing of these spikes.

The basic idea is that the stochastic firing of neurons can be used to sample from the target distribution, similar to how the binary nodes in a Boltzmann machine sample from the target distribution. 
To be more specific, a neuron's membrane potential corresponds to the activation energy in the Boltzmann machine, 
and the neuron's probabilistic spiking mechanism plays the same role as the stochastic updating of node states in the Boltzmann machine.

While this model is conceptually similar to a Boltzmann machine, it is a significant generalization that is closer to biological reality. 
It offers a possible answer to the question of how the brain can perform complex computations with noisy, probabilistic neurons, 
by suggesting that this noise is not a bug, but a feature that allows the network to explore a wide range of possible solutions to a problem.

In terms of the simple RBM above, this spiking neuron model would replace the simple binary nodes with more complex spiking neurons, 
but the overall process of training the model and sampling from it would be similar. 
In both cases, the learning rule involves updating the weights to increase the likelihood of the observed data and decrease the likelihood of the data sampled from the model. 
The key difference is that in the spiking neuron model, the sampling process is implemented through the temporal dynamics of the spiking neurons, rather than through a simple asynchronous updating rule.

Let $p(z_1, \ldots ,z_K)$ be some arbitrary joint distribution over $K$ binary variables $z_1, \ldots ,z_K$ that only takes on values in $\omega$. 
[[Buesing2011]](#Buesing2011) showed that under a certain computability assumption on $p$, a network $\mathcal{N}$ consisting of $K$ spiking neurons $n_1, \ldots ,n_K$ can sample from $p$ using its inherent stochastic dynamics. 

More precisely, they showed that the stochastic firing activity of $\mathcal{N}$ can be viewed as a non-reversible Markov chain that samples from the given probability distribution $p$. 
If a subset $o$ of the variables are observed, modeled as the corresponding neurons being "clamped" to the observed values, the remaining network samples from the conditional 
distribution of the remaining variables given the observables. Hence, the approach offers a quite natural implementation of probabilistic inference.

## Exercise 1: Implement a spiking neuron model

In this exercise, you will implement a spiking neuron model that can sample from a given distribution.
Implement a LIF model using brian2.
Update the RBM class to use the spiking neuron model instead of the simple binary nodes.

In [5]:
def run_gibbs(self, v, h, t):
    """
    Runs one step of the Gibbs chain starting from visible state v and hidden state h.
    """
    # Update hidden states given visible states.
    h_new = self.spiking_dynamics(v, h, t)

    # Update visible states given hidden states.
    v_new = self.spiking_dynamics(h_new, v, t+1)

    return v_new, h_new