# How physics advanced Generative AI

## Bibliography

https://www.americanscientist.org/article/first-links-in-the-markov-chain

https://www.kdnuggets.com/2023/01/introduction-markov-chains.html

1. Paper: [https://arxiv.org/pdf/1503.03585.pdf](https://arxiv.org/pdf/1503.03585.pdf)
2. Paper: https://arxiv.org/pdf/2006.11239.pdf
3. Paper: https://arxiv.org/pdf/2102.09672.pdf
4. Paper: https://arxiv.org/pdf/2105.05233.pdf

[https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/](https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/)

https://www.assemblyai.com/blog/how-physics-advanced-generative-ai/

[What are Diffusion Models?](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)

https://www.youtube.com/watch?v=fbLgFrlTnGU

https://www.youtube.com/watch?v=HoKDTa5jHvg

https://www.youtube.com/watch?v=TBCRlnwJtZU

https://www.youtube.com/watch?v=687zEGODmHA

[Modern Computer Vision and Deep Learning (CS 198-126)](https://ml-berkeley.notion.site/Modern-Computer-Vision-and-Deep-Learning-CS-198-126-0e28ffea0c4140f28399dd823c527bec)

https://learnopencv.com/denoising-diffusion-probabilistic-models/#Forward-Diffusion-Process

https://mathematica.stackexchange.com/questions/269181/diffusion-probabilistic-model-in-deep-generative-modeling

https://e-dorigatti.github.io/math/deep%20learning/2023/06/25/diffusion.html

https://thiago-lira.medium.com/a-toy-diffusion-model-you-can-run-on-your-laptop-20e9e5a83462

https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf


# Markov chains

In probability, a discrete-time Markov chain (DTMC) is a sequence of random variables, known as a stochastic process, in which the value of the next variable depends only on the value of the current variable, and not any variables in the past. For instance, a machine may have two states, A and E. When it is in state A, there is a 40% chance of it moving to state E and a 60% chance of it remaining in state A. When it is in state E, there is a 70% chance of it moving to A and a 30% chance of it staying in E. The sequence of states of the machine is a Markov chain. If we denote the chain by 
${\displaystyle X_{0},X_{1},X_{2},...}$ then $X_{0}$ is the state which the machine starts in and 
${\displaystyle X_{10}}$ is the random variable describing its state after 10 transitions. The process continues forever, indexed by the natural numbers.

![image.png](attachment:image.png)

__Definition__

A discrete-time Markov chain is a sequence of random variables ${\displaystyle X_{0},X_{1},X_{2},...}$ with the Markov property, namely that the probability of moving to the next state depends only on the present state and not on the previous states:

$${\displaystyle \Pr(X_{n+1}=x\mid X_{1}=x_{1},X_{2}=x_{2},\ldots ,X_{n}=x_{n})=\Pr(X_{n+1}=x\mid X_{n}=x_{n}),}$$ 

if both conditional probabilities are well defined, that is, if 

$${\displaystyle \Pr(X_{1}=x_{1},\ldots ,X_{n}=x_{n})>0.}$$

The possible values of Xi form a countable set S called the state space of the chain.[1]

Markov chains are often described by a sequence of directed graphs, where the edges of graph n are labeled by the probabilities of going from one state at time n to the other states at time n + 1, ${\displaystyle \Pr(X_{n+1}=x\mid X_{n}=x_{n}).}$ The same information is represented by the transition matrix from time n to time n + 1. However, Markov chains are frequently assumed to be time-homogeneous (see variations below), in which case the graph and matrix are independent of n and are thus not presented as sequences.

## Counting Vowels and Consonants: application to Alexander S. Pushkin’s poem “Eugeny One-gin.”

![image-3.png](attachment:image-3.png)

![image-5.png](attachment:image-5.png)
Markov’s sample comprised the first 20,000 letters of the poem, which is about an eighth of the total. He eliminated all punctuation and white space, jamming the characters into one long, unbroken sequence. In the first phase of his analysis he arranged the text in 200 blocks of 10×10 characters, then counted the vowels in each row and column. From this tabulation he was able to calculate both the mean number of vowels per 100-character block and the variance, a measure of how widely samples depart from the mean. Along the way he tallied up the total number of vowels (8,638) and consonants (11,362).

In a second phase Markov returned to the unbroken sequence of 20,000 letters, combing through it to classify pairs of successive letters according to their pattern of vowels and consonants. He counted 1,104 vowel-vowel pairs and was able to deduce that there were 3,827 double consonants; the remaining 15,069 pairs must consist of a vowel and a consonant in one order or the other.

With these numbers in hand, Markov could estimate to what extent Pushkin’s text violates the principle of independence. The probability that a randomly chosen letter is a vowel is 8,638/20,000, or about 0.43. If adjacent letters were independent, then the probability of two vowels in succession would be (0.43) 2 , or about 0.19. A sample of 19,999 pairs would be expected to have 3,731 double vowels, more than three times the actual number. Thus we have strong evidence that the letter probabilities are not independent; there is an exaggerated tendency for vowels and consonants to alternate. (Given the phonetic structure of human language, this finding is not a surprise.)

## Time Series Example
 

Here is an example of how a Markov chain might be used to model the evolution of a time series:

Suppose we have a time series of stock prices, and we want to use a Markov chain to model the evolution of the stock's price over time. We can define a set of states that the stock's price can take on (e.g. "increasing," "decreasing," and "stable"), and specify the probability of transitioning between these states. For example, we might define the transition probabilities as follows:

![image.png](attachment:image.png)

This matrix specifies the probability of transitioning from one state to another given the current state. For example, if the stock's price is currently increasing, there is a 60% chance that it will continue to increase, a 30% chance that it will decrease, and a 10% chance that it will remain stable.

Once we have defined the Markov chain and its transition probabilities, we can use it to predict the future evolution of the stock's price by simulating the transitions between states. At each time step, we would use the current state and the transition probabilities to determine the probability of transitioning to each possible next state.

### Example in Python
 

Here is an example of how to implement a Markov chain in Python:

Modeling the stock price to predict whether it is increasing, decreasing, or stable.

In [1]:
import numpy as np

# Define the states of the Markov chain
states = ["increasing", "decreasing", "stable"]

# Define the transition probabilities
transition_probs = np.array([[0.6, 0.3, 0.1], [0.4, 0.4, 0.2], [0.5, 0.3, 0.2]])

# Set the initial state
current_state = "increasing"

# Set the number of time steps to simulate
num_steps = 10

# Simulate the Markov chain for the specified number of time steps
for i in range(num_steps):
    # Get the probability of transitioning to each state
    probs = transition_probs[states.index(current_state)]
    
    # Sample a new state from the distribution
    new_state = np.random.choice(states, p=probs)
    
    # Update the current state
    current_state = new_state
    
    # Print the current state
    print(f"Step {i+1}: {current_state}")

Step 1: increasing
Step 2: increasing
Step 3: increasing
Step 4: decreasing
Step 5: stable
Step 6: stable
Step 7: decreasing
Step 8: decreasing
Step 9: increasing
Step 10: decreasing


This code defines a simple Markov chain with three states ("increasing," "decreasing," and "stable") and specifies the transition probabilities between these states. It then simulates the Markov chain for 10 time steps, sampling a new state at each time step according to the transition probabilities and updating the current state accordingly. The output of this code will be a sequence of states representing the evolution of the system over time, as shown below:

![image.png](attachment:image.png)

If we set the current state to “decreasing” and run the code, we obtain the following output:

![image-2.png](attachment:image-2.png)

Note that this is a very simplified example, and in practice, you may need to use more states and consider more complex transition probabilities in order to model the behavior of the system accurately. However, this illustrates the basic idea of how a Markov chain can be implemented in Python.

Markov chains can be applied to a wide range of problems, and they can be implemented in Python using a variety of tools and libraries, including 'numpy' and the scipy.stats library.

# Generative machine learning models: Density Modeling for Data Synthesis

Assume that all data comes from a distribution pdata(x): 
* The goal of generative machine learning models is to learn this distribution to the best of their ability — the distribution approximated by the model is denoted as pθ(x)
* We generate new data by sampling from the learned distribution
* In practice, train models to maximize the expected log likelihood of pθ(x) (or minimizing negative log likelihood)/minimize divergence between pθ(x) and pdata(x)

![image.png](attachment:image.png)


# Difusion models

![image.png](attachment:image.png)

Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. 

Diffusion models are a class of generative models that have gained significant attention in recent years due to their ability to model complex and high-dimensional data distributions. These models are based on the diffusion process, which is a stochastic process that describes the spread of particles or molecules from regions of high concentration to regions of low concentration. In the context of machine learning, diffusion models use the diffusion process to model the probability density function of the data distribution, allowing for the generation of realistic and diverse samples.


Another kind of generative modeling technique that takes inspiration from physics (non-equilibrium statistical physics and stochastic differential equations to be more exact)!
Main idea: convert a well-known and simple base distribution (like a Gaussian) to the target (data) distribution iteratively, with small step sizes, via a Markov chain

## Physical picture

if a drop of food coloring is placed into a glass of water, the food coloring spreads out to eventually create a uniform color in the glass. Why is this?

![image.png](attachment:image.png)

The uniform color is a result of the atoms of food coloring spreading out over time. There are many more ways for the billions of atoms to be in different places than all in the same place, just as there are many more ways for 50% of coins to land heads-up than 100% of them. When all of the atoms are concentrated in a single drop, they can be considered to be “100% heads-up”; when the atoms are spread out evenly, they can be considered to be “50% heads-up”.

This process is called diffusion, and it inspires models like DALL-E 2 and Stable Diffusion.

Diffusion Models view the pixels of images as atoms. Similarly to how the random motion of food coloring will always lead to a uniform color, the “random motion” of pixels will always lead to “TV static”, which is the image equivalent of uniform food coloring.

![image-2.png](attachment:image-2.png)

Importantly, no matter where we place the initial drop of food coloring, over time all possible starting positions will yield this same final state of uniform color.

![image-3.png](attachment:image-3.png)

Note in particular that it is impossible to go backward and figure out where the drop initially was from this uniform state since all initial states lead to it. The lack of injectivity makes it impossible to go backward in general

![image-4.png](attachment:image-4.png)

We always know how drops will diffuse in forward time, but we don’t know how to reverse-diffuse the uniform coloring due to this issue of injectivity. However, if we relegate our concerns to one particular drop, then we can model this process both forward and backward in time.

![image-5.png](attachment:image-5.png)

Diffusion Models use this same principle in the image domain. In particular, the different “drops” for Diffusion Models correspond to different types of images. For example, these drops could correspond to images of dogs, images of humans, and images of handwritten digits.

![image-6.png](attachment:image-6.png)

By picking just one type of image, say images of dogs, Diffusion Models can learn to go backwards in time for that one type of image, just like how we can learn to go backwards in time from the uniform color by picking just one drop.

![image-7.png](attachment:image-7.png)

## Image generation with Diffusion Models

It may be unclear why we would want to do this - if we have a dataset of images of dogs, why would we want to go forward and backward like this? The answer lies in the fact that the figure directly above is slightly deceptive - a particular image of a dog is not analogous to the drop of food coloring - it is the entire class of images of dogs that is analogous to the drop of food coloring.

Particular images of dogs are actually analogous to particular atoms in the drop of food coloring. Recall from above that relegating our concerns to one initial drop allowed us to model the diffusion process forward and backward in time.

![image.png](attachment:image.png)

Understanding how the diffusion process works in reverse-time allows us to trace individual atoms back to their starting points in the drop. In particular, we pick a random atom from the uniform food coloring, and then reverse time to see where in the initial drop of food coloring it started from.

![image-2.png](attachment:image-2.png)

We mimic this process with Diffusion Models. Analogously, we pick a random image of TV static (“atom”) and then go backwards through time to figure out where it started in the data distribution (“initial drop”). That is, we determine which image of a dog led to that image of TV static in forward-time.

![image-3.png](attachment:image-3.png)

With Diffusion Models, we model the physics that maps our data distribution to TV static. Since TV static is easy to generate, we pick a random image of TV static and run physics in reverse-time to generate a new image.

![image-4.png](attachment:image-4.png)

Diffusion Models lie at the foundation of much of the progress in Generative AI in the image domain. Text-to-image models like Imagen and DALL-E 2 augment this process, allowing us to tell the model what we want the generated image to look like.

## Applications:

Diffusion models have been applied to various applications in machine learning, including image generation, image denoising, and anomaly detection, among others.

### Image generation:

One of the most popular applications of diffusion models is image generation, where the models are trained to generate realistic and diverse images. The generated images are often of higher quality and diversity than those generated by other generative models, such as GANs and VAEs.

![image.png](attachment:image.png)

### Image denoising:

Diffusion models can also be used for image denoising, where the models are trained to remove noise from images. The denoising is achieved by applying the forward transformations of the diffusion process to the noisy images, and then applying the inverse transformations to obtain denoised images.

### Anomaly detection:

Another application of diffusion models is anomaly detection, where the models are trained to detect anomalous data points in a given dataset. This is achieved by estimating the likelihood of the data points using the diffusion process, and then identifying the data points with low likelihood as anomalies.

## Steps

* Forward Process
* Reverse Process


### Step 1
![image.png](attachment:image.png)

![image-3.png](attachment:image-3.png)

### Step 2
![image-2.png](attachment:image-2.png)
![image-4.png](attachment:image-4.png)

using a neural network

## Math

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

## Diffusion Models Architecture:

Diffusion models are based on a continuous-time diffusion process, which is defined by a stochastic differential equation. The basic idea behind diffusion models is to transform a simple and tractable distribution, such as a Gaussian distribution, into a complex and high-dimensional distribution that matches the true data distribution.

The basic architecture of a diffusion model consists of a sequence of invertible transformations, where each transformation is applied to the previous state to obtain the next state. These transformations are typically defined by a neural network, such as a multi-layer perceptron or a convolutional neural network. The key idea behind these transformations is to gradually add noise to the data distribution, while keeping the data distribution invariant. This is achieved by gradually increasing the variance of the noise, starting from a small value and increasing it over time.

At each step of the diffusion process, the model generates a sample by applying the inverse transformations in reverse order to a sample from a simple distribution, such as a Gaussian distribution. The result is a sample that belongs to the true data distribution, as defined by the diffusion process.

## Training Procedure:

The training procedure for diffusion models involves two key steps: the data pre-processing step and the diffusion step.

In the data pre-processing step, the data is pre-processed to obtain a set of samples that are normalized and centered around zero. This is typically achieved by subtracting the mean of the data and dividing by the standard deviation. The pre-processed data is then used to train the diffusion model.

In the diffusion step, the model is trained to estimate the probability density function of the pre-processed data. This is achieved by maximizing the likelihood of the pre-processed data, which is defined as the product of the probability densities of the data at each step of the diffusion process. The likelihood is maximized using stochastic gradient descent, where the gradients are estimated using a variant of the Langevin dynamics algorithm.

# Implementation in Mathematica

Here we follow:

https://mathematica.stackexchange.com/questions/269181/diffusion-probabilistic-model-in-deep-generative-modeling

For simplicity, we consider an example of generating handwritten digit images, learning from the MNIST dataset. First, we corrupt the target images by gradually adding Gaussian noise on them, eventually turning the original data distribution into an isotropic Gaussian distribution of equal dimension before noisification. Thereafter, we learn a hierarchy of neural nets to reverse the noisification process. Finally, starting from an isotropic Gaussian, we sequentially sample using the learned hierarchy of neural nets, and obtain novel samples of the target distribution.

Below are the implementation details. First, let's load the data. Notice that, in unconditional generating setting, we only need the 28×28 dimensional handwritten digit images. The class labels for these images are therefore discarded.

In [None]:
data=ResourceData[ResourceObject["MNIST"]][[;;,1]]; 
RandomSample[data,20]

We define the hyperparameters of the denoising diffusion probabilistic model:

In [None]:
size = {28, 28};(*dimension of the target data image*)
channel = 1;(*channel of the target data image,1 for grey scale,3 for RGB*)
T = 200;(*how many steps for corrupting the image*)
c = 32;(*base channel size of the UNET applied here*)
Tc = 16;(*encoding of time step for corrupting the image*)
depth = 3;(*depth of the UNET*)
batch = 256;(*mini-batch size for training*)
b1 = 10^-4;(*initial variance of noise*)
bT = 0.02;(*final variance of noise*)
b[t_] := b1 + (t - 1)/(T - 1)*(bT - b1);(*linear schedule for increasing variance of noise*)


We define a forward noising process that sequentially produces latents $𝑋_1$ through $𝑋_𝑇$
 by imposing Gaussian noise of variance $b_t$
 
 ![image.png](attachment:image.png)
 

Here, bt  is scheduled to increase linearly from b1 to 𝑏𝑇

![image-3.png](attachment:image-3.png)

. Given the Gaussian nature of this corruption strategy, the distribution of the corrupted image at time step t conditioned on its true value, as can be easily proved via mathematical induction, is:

![image-2.png](attachment:image-2.png)

where 

![image-4.png](attachment:image-4.png)

Therefore, we can directly obtain corrupted image samples at any time step t:

In [None]:
a[t_]:=(Table[1-b[i],{i,t}])/.List->Times;
GXT[x0_,t_,noise_]:=Sqrt[a[t]]*x0+Sqrt[1-a[t]]*noise

Below are examples of the corrupted images across time steps. As 𝑡→∞, we can always transform any original data distribution into an isotropic Gaussian distribution.

In [None]:
Block[{steps=Join[{0,1},Range[10,200,10]],selects=RandomSample[data,6]},
TableForm[Table[Image[GXT[ImageData[slc],t,RandomReal[NormalDistribution[0,1],size]]],{slc,selects},{t,steps}],
TableSpacing->{.5,.5},TableAlignments->Center,TableHeadings->{None,Map[Style["t="<>ToString[#],{FontFamily->"Arial",10}]&,steps]}]]

Next, we learn a hierarchy of neural nets to sequentially reverse the noisification. Specifically, we need to learn $𝑝_𝜃_𝑡(𝑋𝑡−1|𝑋𝑡)$ for 𝑡∈[1,𝑇], where $𝑝_𝜃_𝑡$
is a probability distribution function parameterized by a neural net with parameter $𝜃_𝑡$. The loss function for training this hierarchy of neural nets is a reformulation of the negative log likelihood (for proof, see Equation 47-58 in https://arxiv.org/pdf/2208.11970.pdf):

# Advantages and limitations:

Diffusion models have several advantages over other generative models. One advantage is that they can generate high-quality and diverse samples, as they are able to model complex and high-dimensional data distributions. They also have a simpler and more transparent architecture compared to other models, such as GANs, which can be difficult to train and tune.

Another advantage of diffusion models is their ability to handle missing data and incomplete data. This is because the diffusion process can be applied to incomplete data, and the missing values can be imputed by sampling from the estimated probability distribution.

However, diffusion models also have some limitations. One limitation is that they can be computationally expensive to train, especially for large datasets and high-dimensional data. The training procedure involves estimating the gradients of the log-likelihood using a variant of the Langevin dynamics algorithm, which can be computationally intensive.

Another limitation is that diffusion models may not be able to capture complex data distributions with multiple modes, as the diffusion process assumes a unimodal distribution. This can lead to generated samples that are overly smooth and lack diversity.

## Future Research:

Despite the limitations of diffusion models, they hold great potential for future research in machine learning. One promising direction is to combine diffusion models with other generative models, such as GANs and VAEs, to create hybrid models that have the advantages of both models. For example, a hybrid model could use the diffusion process to model the low-frequency components of the data distribution and use a GAN to model the high-frequency components.

Another promising direction is to explore the use of diffusion models for unsupervised and semi-supervised learning tasks. Diffusion models can be used to estimate the probability distribution of the data, which can be used for clustering, dimensionality reduction, and other unsupervised learning tasks. They can also be used for semi-supervised learning, where a small set of labeled data is used to guide the generation of new samples.

## Conclusion:

Diffusion models are a class of generative models that have gained significant attention in recent years due to their ability to model complex and high-dimensional data distributions. These models are based on the diffusion process, which is a stochastic process that describes the spread of particles or molecules from regions of high concentration to regions of low concentration. The basic architecture of a diffusion model consists of a sequence of invertible transformations, where each transformation is applied to the previous state to obtain the next state.

Diffusion models have been applied to various applications in machine learning, including image generation, image denoising, and anomaly detection, among others. They have several advantages over other generative models, including their ability to generate high-quality and diverse samples and their simpler and more transparent architecture. However, they also have some limitations, such as their computational cost and their ability to capture complex data distributions with multiple modes.

Future research in diffusion models holds great promise, particularly in the areas of hybrid models and unsupervised and semi-supervised learning. With continued research and development, diffusion models have the potential to become a valuable tool for machine learning and data analysis.

__Exercises:__

Follow the tutorials:

https://e-dorigatti.github.io/math/deep%20learning/2023/06/25/diffusion.html

https://thiago-lira.medium.com/a-toy-diffusion-model-you-can-run-on-your-laptop-20e9e5a83462

https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html