# for more: https://youtu.be/_bqa_I5hNAo?si=zc7ikfvusdMhW-N6

For most of the history, computers were seen as purely logical machines, mechanically crunching numbers to produce rigid, unambiguous solutions.

There was no place for creativity or ambiguity.

After all, when calculating a trajectory to launch a rocket into space, the last thing you want is your calculator dreaming up some funky, non-existing formula or improvising on the spot.

50 years ago, if you asked anyone whether a computer program would sooner master driving a car versus composing a song, the answer would have been unanimous.

Fast forward to 2024, however, we still haven't quite achieved autonomous driving, but the generative AI of all flavors is taken for granted at this point.

So what sparked this shift?

At what point do neural networks transcend mere deterministic computation and begin to create, synthesizing things that never existed before?

Meet the Boltzmann machine, a type of a neural network that dared to embrace chaos and change the course of AI forever.

Developed in 1980's, Boltzmann machines introduced a radical notion.

What if we built uncertainty and randomness into the very fabric of machine learning?

What if, instead of storing rigid facts and performing deterministic computations, our AI could grasp the underlying probabilistic rules that govern the world around us?

In this video, we will build a Boltzmann machine from first principles and explore how concepts of probability and inherent uncertainty can be reconciled with the seemingly rigid nature of computer operations.

If you're interested, stay tuned.

To understand Boltzmann machines, we must first understand their simpler predecessors, associative memory networks, also known as Hopfield networks.

We explored these in depth in the previous video.

So if you haven't seen it, I highly recommend watching it before continuing with this one, as we'll be directly building on those ideas.

But here's a quick refresher.

A Hopfield network is a model of associative memory inspired by the brain's ability to recall complete patterns from partial or noisy inputs.

It operates by assigning a specific energy value to each possible state, and then iteratively minimizing this energy by descending along the energy surface into the nearest well, thus recalling the best matching stored memory.

This energy landscape is shaped by network weights, which are learned by observing data points, patterns we want to memorize, and adjusting the weights to lower the energy associated with those patterns.

Given enough neurons, a Hopfield network has essentially perfect memory and excels at mechanical tasks like pattern completion.

Think of it as a virtuoso classical musician who can recognize and flawlessly reproduce a well known masterpiece from just a few initial notes.

However, while impressive, a Hopfield network's ability to recall and complete patterns is limited to reproducing what it has explicitly learned.

It cannot create new patterns or understand the underlying structure of the data it has seen.

This is where Boltzmann machines come in, offering a more flexible and creative approach to information processing.

To illustrate the difference, let's extend our musical analogy.

Imagine a jazz musician who has internalized not just specific songs, but also the fundamental rules and structures inherent to the music itself.

When given a few opening notes, this musician doesn't simply recall and play an existing piece.

Instead, they leverage a deep understanding of musical theory combined with creativity to improvise and produce something entirely new.

This jazz musician represents a Boltzmann machine.

Unlike an associative network, it doesn't just memorize data points.

Instead, it learns the underlying probability distribution of the data, capturing the essence of what makes a pattern belong to a particular category or style, while incorporating inherent uncertainty into its computations.

At first glance, these two systems might seem fundamentally different, with little in common algorithmically.

However, in fact, they are very closely related.

Just two key technical modifications can transform any Hopfield network into a Boltzmann machine, namely stochasticity and hidden units.

Let's explore each of them in detail.

We will first sprinkle in a dash of randomness and talk about how Boltzmann machines earned their name.

We begin in Austria, 19th century, where a young physicist, Ludwig Boltzmann, is grappling with a fundamental problem.

Imagine a system of particles, like a gas.

Each particle has its own energy, determined by factors such as its velocity.

We can measure the average energy of particles on a macroscopic scale by measuring the temperature.

But what happens at the individual particle level?

We might imagine that particles probably differ in terms of exact energy values.

Indeed, collisions can cause some particles to move faster than others, resulting in a range of energies.

Boltzmann's quest was to understand this energy distribution.

In other words, if we randomly select a particle, what is the probability that it will have a specific energy value?

Boltzmann's insight was to link a state's probability to its energy through an exponential relationship.

Specifically, the probability of a state S with energy E is proportional to the exponent of the negative energy divided by temperature.

Intuitively, lower energy states are more probable than higher energy states and this fundamental relationship quantifies exactly how much more probable.

To understand why the exponent arises here, imagine energy levels as steps on a staircase with particles jumping between them.

Each step represents a small energy increment, Є (epsilon).

For a particle to move up one step, it must gain epsilon units of energy, perhaps through a collision with another particle.

Let's call the probability of such a collision p.

Given a large number of particles, this probability is essentially constant and depends only on the average particle velocity or temperature.

If a particle jumps up one level with a probability p, it might immediately jump again with the same probability.

Since probabilities multiply for independent events, the chance of jumping two levels is p-square, three levels is p-cubed, and so on.

We see a pattern.

The probability of jumping n levels is p to the power of n.

Now, consider a particle increasing its energy by ΔE (delta E).

How many steps must it climb?

Since the gap between the steps is constant, the number of steps is ΔE over Є.

Thus, the probability of making this transition to a higher energy state is \(p^{\Delta E / \varepsilon}\).

To bring it into a more familiar form, we can repackage different constants.

We can move the temperature dependency of p into the exponent and change the base to e or Euler's number, conventionally used in exponential.

Since p is less than one, while e is greater than one, this necessitates a minus sign before the energy in the exponent, since the temperature is always positive.

Consequently, the probability of an energy increase ΔE is equal to the exponent of minus ΔE over temperature.

Oh, and by the way, in textbooks you will usually find a version of it with a Boltzmann constant k in front of the temperature.

But this constant is used to convert the units of temperature measured in Kelvin to energy measured in joules.

But in this video we absorb the Boltzmann constant into temperature.

This equation gives us the relative probability of transitioning from one state to another as a function of the energy difference between them.

But how can we find the absolute probability of a particular energy state?

Consider a toy example.

Suppose there are only three states with energy values of one, two and three respectively.

Temperature = 1.

The equation tells us that state two is 1/e as likely as state one.

State three is 1/e² as likely as state one.

But what about absolute values?

We don't know the baseline probability of state one.

Absolute probabilities must sum to one.

Let probability of state one = x.

Use ratios to express probabilities of others.

Use total probability rule to solve for x.

This demonstrates how we move from relative probabilities to absolute probabilities.

Now plug the absolute energy values into the exponential formula.

Plot relative probabilities vs energy.

Plot absolute probabilities.

Notice: one shape is a vertically rescaled version of the other.

Therefore absolute probability of a state with energy e is proportional to exp(−e).

Divide by Z, where Z is the partition function.

Z ensures the probabilities sum to one.

This is the final Boltzmann distribution linking energy to probability.

Compute Z by summing exp(−energy) across all states.

Probability of a given state = exp(−E) / Z.

Now apply this to Hopfield networks to make them more stochastic.

Recall: Hopfield neurons update deterministically.

Boltzmann machines embrace randomness.

Instead of always choosing lowest energy, they choose probabilistically using Boltzmann distribution.

Consider a single neuron i.

At a given update step, two candidate states: on or off.

Compute energy of each.

First term: edges of neuron i.

Second term: energy of rest of network.

Energy term from rest of network cancels.

Probability depends only on local connections.

Probability of switching on is a function of energy difference.

Energy difference = 2 × weighted input.

Substituting gives sigmoid(weighted input).

When input positive: high probability of switching on.

When input negative: low probability.

Stochastic update rule:

Compute weighted input.

Compute sigmoid probability.

Generate random number 0–1.

If < probability → neuron = 1. Else neuron = −1.

Stochasticity allows moves to higher energy states.

High temp → more random.

Low temp → more deterministic.

Temperature is a hyper-parameter controlling creativity.

This stochastic rule is essential.

It allows the network to escape local minima.

It explores many states.

Enables learning complex probability distributions.

Random update rule is the key modification for inference.

But does stochasticity change learning?

Yes → leads to contrastive learning rule.

In Hopfield networks, learning = lowering energy of stored patterns.

In Boltzmann machines, goal = learn probability distribution of data.

Ideally, network should spend more time in states corresponding to training data.

These states must have higher probability.

From Boltzmann distribution → higher probability = lower energy relative to others.

But lowering one energy affects Z.

Z depends on energies of all states.

Learning goal: maximize probability of training data while accounting for entire energy landscape.

Objective: maximize log probability of data.

Take log of joint probability.

Substitute Boltzmann distribution.

Insight: maximize log probability =

Minimize energy of training patterns.

Minimize partition function.

Minimize Z means raising energy of non-training states.

Two forces:

Digging wells around data.

Raising surface for unrealistic states.

Weight update rule = contrastive Hebbian learning.

First term: average product xi xj when clamped to data (Hebbian).

Second term: average product xi xj during free-running (anti-Hebbian).

Interpretation:

Strengthen correlations present in data.

Weaken correlations present only in hallucinated states.

Called contrastive because it contrasts constrained vs free phases.

Positive phase: clamp visible neurons to data.

Negative phase: allow free-running sampling.

In practice:

For Hebbian term: iterate over training examples.

For anti-Hebbian term: run network freely from random states.

Repeat many times → iterative learning.

Learning alternates between positive and negative phases.

This gradually shapes the energy landscape so valleys = data, peaks = unrealistic states.

So far: only visible units.

Final modification: hidden units.

Hidden units encode internal representations.

Capture abstract features.

Visible units = observation.

Hidden units = latent structure.

Implementation:

Add neurons.

Some = visible, some = hidden.

Hidden units update same as visible.

In learning:

Positive phase: clamp visible, let hidden update.

Negative phase: let all run freely.

Apply contrastive learning to all weights.

Hidden units learn to represent structure without being explicitly taught.

Over time they learn abstract features ∼ early deep learning.

Restricted Boltzmann Machines (RBM):

No visible–visible connections.

No hidden–hidden connections.

Only visible ↔ hidden.

Allows parallel unit updates.

Much more efficient.

Despite restrictions, RBMs are expressive and practical.

Summary:

Hopfield networks → store/recall.

Add randomness + probability → Boltzmann machines.

Add hidden units → abstract representation + generation.

Contrastive learning → captures distributions.

Foundation for modern generative AI.

Closing remarks + sponsor message (Shortform).

Guides, summaries, references, broader context.

Encouragement to subscribe and continue learning.


## I. Physics of Energy & Probability (Boltzmann Distribution Foundations)

1. Energy increment per step  
$$\Delta E = n\,\varepsilon$$

2. Probability of climbing one step  
$$p$$

3. Probability of climbing \(n\) steps  
$$p^n$$

4. Probability of energy increase by \(\Delta E\)  
$$p^{\Delta E / \varepsilon}$$

5. Exponential form of transition probability  
$$\Pr(\Delta E)=e^{-\Delta E/T}$$

(Where temperature absorbs Boltzmann constant: \(kT \rightarrow T\))

---

## II. From Relative Probabilities to Absolute Probabilities

6. Relative probability of a state with energy \(E\)  
$$\tilde{p}(E)=e^{-E}$$

7. Partition function  
$$Z=\sum_{s} e^{-E(s)}$$

8. Boltzmann distribution (final form)  
$$p(s)=\frac{e^{-E(s)}}{Z}$$

---

## III. Energy in Hopfield / Boltzmann Networks

9. Energy of a network configuration  
$$E(x)= -\sum_{i<j} w_{ij} x_i x_j$$

---

## IV. Local Neuron Update (Boltzmann Machine Stochastic Rule)

10. Energy when neuron \(i\) is ON  
$$E(x_i=1) = -\sum_j w_{ij} x_j + C$$

11. Energy when neuron \(i\) is OFF  
$$E(x_i=-1) = +\sum_j w_{ij} x_j + C$$

12. Energy difference between states  
$$\Delta E = -2\sum_j w_{ij} x_j$$

13. Boltzmann probability of neuron \(i\) turning ON  
$$p(x_i = 1)=\frac{1}{1 + e^{\Delta E/T}}$$

14. Sigmoid form  
$$p(x_i = 1)=\sigma\left( \frac{2}{T}\sum_j w_{ij} x_j \right)$$

---

## V. Learning Objective (Maximum Likelihood)

15. Log-probability of dataset \(\{x^{(1)},\dots,x^{(N)}\}\)  
$$\log P = \sum_{n=1}^{N} \log p(x^{(n)})$$

16. Substitute Boltzmann distribution  
$$\log P = \sum_{n=1}^{N} \left( -E(x^{(n)}) - \log Z \right)$$

17. Derivative (core learning gradient)  
$${\partial \log P \over \partial w_{ij}}
= -\langle x_i x_j \rangle_{\text{data}}
+ \langle x_i x_j \rangle_{\text{model}}$$

---

## VI. Contrastive Hebbian Learning Rule

18. Final weight update rule  
$$\Delta w_{ij} = \eta\left( \langle x_i x_j \rangle_{\text{data}} - \langle x_i x_j \rangle_{\text{model}} \right)$$

Positive phase:  
$$\langle x_i x_j \rangle_{\text{data}}$$

Negative phase:  
$$\langle x_i x_j \rangle_{\text{model}}$$

---

## VII. Hidden Units (same rules, extended)

19. Full Boltzmann machine energy  
$$
E(v,h)=
-\sum_{i,j} w_{ij} v_i h_j
-\sum_{i<j} a_{ij} v_i v_j
-\sum_{i<j} b_{ij} h_i h_j
$$

20. Restricted Boltzmann Machine (RBM) energy  
$$E(v,h) = -\sum_{i,j} w_{ij} v_i h_j$$

---

## VIII. Inference in RBM (Parallel Updates)

21. Hidden unit activation  
$$p(h_j = 1 \mid v) = \sigma\left( \sum_i w_{ij} v_i \right)$$

22. Visible unit activation  
$$p(v_i = 1 \mid h) = \sigma\left( \sum_j w_{ij} h_j \right)$$

---

## Complete Equation List (Quick Index)

$$\Delta E = n\varepsilon$$  
$$p, \; p^n, \; p^{\Delta E/\varepsilon}$$  
$$e^{-\Delta E/T}$$  
$$e^{-E}$$  
$$Z=\sum_s e^{-E(s)}$$  
$$p(s)=e^{-E(s)}/Z$$  
$$E(x)= -\sum_{i<j} w_{ij} x_i x_j$$  
$$E(x_i=1)= -\sum_j w_{ij} x_j + C$$  
$$E(x_i=-1)= +\sum_j w_{ij} x_j + C$$  
$$\Delta E = -2\sum_j w_{ij} x_j$$  
$$p(x_i=1)=\frac{1}{1+e^{\Delta E/T}}$$  
$$\sigma\left( \frac{2}{T}\sum_j w_{ij} x_j \right)$$  
$$\log P = \sum_n \log p(x^{(n)})$$  
$$\log P = \sum_n (-E(x^{(n)}) - \log Z)$$  
$${\partial \log P \over \partial w_{ij}}
= -\langle x_i x_j \rangle_{\text{data}}
+ \langle x_i x_j \rangle_{\text{model}}$$  
$$\Delta w_{ij}
= \eta(\langle x_i x_j \rangle_{\text{data}}
- \langle x_i x_j \rangle_{\text{model}})$$  
$$
E(v,h)=
-\sum_{i,j} w_{ij} v_i h_j
-\sum_{i<j} a_{ij} v_i v_j
-\sum_{i<j} b_{ij} h_i h_j
$$  
$$E(v,h)= -\sum_{i,j} w_{ij} v_i h_j$$  
$$p(h_j=1|v)=\sigma\left(\sum_i w_{ij} v_i\right)$$  
$$p(v_i=1|h)=\sigma\left(\sum_j w_{ij} h_j\right)$$
