# Generative AI with ODEs and SDEs : An Introduction to Flow Matching and Diffusion Models

With the recent advancements in the Machine Learning ecosystem, Flow Matching & Diffusion (or Score Matching Denoising Diffusion) algorithms have become the prolific state-of-the-art for Generative AI frameworks in non sequential modalities. The origins of these generative frameworks that drive powerful generative tools from image generation (see [Stable Diffusion](https://example.com) or [Midjourney](https://www.midjourney.com/explore?tab=video_top)) to video generation (see [Sora](https://openai.com/sora/)) and even Protien Structure Synthesis ([RF-Diffusion](https://www.bakerlab.org/2023/07/11/diffusion-model-for-protein-design/)), are rooted in Physics with ODEs and SDEs at their core.

This series of articles intends to:
- introduce Flow and Diffusion models from first principles in a simplified manner along with the minimal but necessary amount of mathematics required for the same
- make the connections of the said models to foundations rooted in physics more apparent
- guide the audience on how to implement and apply these algorithms

**Pre requisites**: I assume that you have basic background knowledge in vector calculus and probability theory.


**Acknowledgements**: The content of this tutorial is the result of efforts made to understand MIT 6.S184: Generative AI with Stochastic Differential Equations by [Peter Holderrieth](https://www.peterholderrieth.com/) and the Flow Matching paper by [Yaron Lipman](https://scholar.google.com/citations?user=vyteiT4AAAAJ&hl=en)


#### In this article we will formalize "generating an image" and Construction of Flow and Diffusion Models.

## 1. From Generation to Sampling 

**Goal** : Formalize what it means to generate something.

#### 1.1 We represent objects that we want to generate as vectors

$$
z \in \mathbb{R}^d
$$

- Images are generally represented as a vector $z$ in vector space $\mathbb{R}^{H \times W \times 3}$:

  $$z \in \mathbb{R}^{H \times W \times 3}$$

  Where $H$ is the height of the image in pixels, $W$ is the width, and 3 represents the three color channels (Red, Green, Blue).
<br><br>
- Videos are represented in a similar manner as a vector $z$ in a four-dimensional vector space:

  $$z \in \mathbb{R}^{T \times H \times W \times 3}$$

  Where $T$ represents different frames along the temporal axis of the video, where each frame is an image.
  <br><br>
- Molecular Structures, in a similar manner:

  $$z \in \mathbb{R}^{N \times 3}$$

  Where $N$ represents the number of atoms, and each atom has 3 co-ordinates.

For a generative task in any modality, be it Protein Structures or Images or Videos, our goal is to generate a corresponding vector representation.
<br><br>

#### 1.2 Now let's think about what it means to successfully **generate** something.
<center>  Prompt: "Picture of a wolf"</center>

| ![Noise](./blogAssets/0_noise.png) | ![Frozen River](./blogAssets/1_river.png) | ![Eagle in Snow](./blogAssets/2_eagle.png) | ![Wolf in Snow](./blogAssets/wolf.png) |
|---|---|---|---|
| <center> Useless </center>| <center>Bad (better than just noise)</center> | <center>Wrong subject, Not Wolf (better than just a river)</center> |<center> Great! </center> |

We can qualitatively rank these images on how good they align with the prompt but these are subjective statements - How do we *mathematically formalize* these qualitative statements?

In the domain of generative modeling, the language of **probability theory** sees prolific use to *mathematically  formalize* "how good a generated sample is".

We introduce this object called **data distribution** that converts the statement of "how good something is" to "how likely something is on a specific data distribution".

Assuming our data distribution is all the images on the internet, we pose the evaluation of images (rather samples) as "how *likely* are we to find this image on the internet, given this specific prompt?"

<center>  Prompt: "Picture of a wolf"</center>

| ![Noise](./blogAssets/0_noise.png) | ![Frozen River](./blogAssets/1_river.png) | ![Eagle in Snow](./blogAssets/2_eagle.png) | ![Wolf in Snow](./blogAssets/wolf.png) |
|---|---|---|---|
| <center> Impossible </center>| <center>Extremely Rare</center> | <center>Unlikely</center> |<center> Very Likely! </center> |

<div style="border: 1px solid #ccc; padding: 10px; text-align: center; margin: 15px auto;">
  <h4 style="margin: 0;">How good an image is → How likely it is under the data distribution</h4>
</div>

With that we have translated subjective evaluations, into statements of probability theory.

#### 1.3 Formalizing Generation as Sampling from the Data Distribution
**Data Distribution**: Distribution of objects that we want to generate: $P_{data}$

In the case of continuous variables, data distributions are typically represented by probability density functions (PDFs). Hence, we will refer to the data distribution and its corresponding density function interchangeably as $P_{data}$.

$$\textbf{Probability Density}\space P_{data} : z \in \mathbb{R}^d \rightarrow \mathbb{R}_{\ge 0}$$

$$z \rightarrow P_{data}(z)$$

$P_{data}$ goes from the vector space of $\mathbb{R}^d$, which is space of our objects, and gives you a non negative number.

Given an object $z$, it gives you probability of how likely that object is under $P_{data}$

From this context we can establish that, *generating high-quality samples* is equivalent to drawing samples from the target distribution $P_{data}$, i.e., generating samples $z \in \mathbb{R}^d$ for which $P_{data}(z)$ is high. 

**Note** : We do not know what $P_{data}$ is, we postulate it as some distribution that we want to sample objects from. In case of a model that is supposed to generate images of dogs we postulate there to be a distribution $P_{data}$ that assigns high probability to images of dogs and lower probabilities to images that do not resemble dogs. Despite not having access to the entire continuous distribution nor to the analytical form for its *PDF*, which is intractable given high dimensionality of vector space $\mathbb{R}^d$, we devise a framework that lets us sample objects $z$ from $P_{data}$ relying on sparse samples from a subset (our dataset) of this continuous distribution over $\mathbb{R}^d$.

With that, we have formalized **Generation as Sampling** $P_{data}$
$$
    z \sim P_{data} \implies \text{z is a high quality image i.e., sample with high probability under}\space P_{data}
$$

We roll the dice and get a sample $z \sim P_{data}$.

If we somehow do this correctly, we should have high quality samples $z$ that yield high likelihood or probability under the postulated $P_{data}$

#### 1.4 Dataset
As discussed before we do not have access to the entire target distribution $P_{data}$, nor to the analytical expression for it's PDF, however we do have access to a dataset 
$$\mathcal{D} = \{z_i\}_{i=1}^N \sim P_{data}, \quad z_i \in \mathbb{R}^d$$
that consists of independent and identically distributed finite samples $\sim P_{data}$.

As a consequence, in practice we compute $\text{Monte Carlo}$ estimates of the expected loss and gradients over $P_{data}$ by using samples from $\mathcal{D}$, to optimize a network, as we will discuss further.

A dataset consists of **finite samples** from the data distribution:
$$ z_{1},...,z_{N} \sim P_{data}$$

Examples:
- Images : Publicly available images on the internet
- Videos : YouTube
- Protien Structures : Scientific Data (e.g. Protien Data Bank)


#### 1.5 Conditional and Unconditional Generation
**Unconditional generation** implies sampling from $P_{data}$ to generate high-quality samples that are more likely under $P_{data}$.

$$Z \sim P_{data}$$

$$\text{can be implicitly though of as fixed prompt generation, where samples are reflective of whatever the distribution} \space P_{data} \space \text{encapsulates}$$

![owl1](./blogAssets/1_owl.png)![owl2](./blogAssets/2_owl.png)![owl3](./blogAssets/3_owl.png)

**Conditional Generation** allows to generate samples conditioned on a variable $y$, which can be verbose prompts like *"moonlit night over a lotus pond in starry night style"* or a class labels like "cat", "dog", "owl". For this we introduce the object called conditional data distribution denoted as $P_{data}(.|y)$. Which basically means what's the distribution of data given prompt $y$.

Conditional Generation means sampling the conditional data distribution:
$$ Z \sim P_{data}(.|y)$$
$$ \text{for high quality samples from conditional generation, it is imperative that the data distribution encapsulates samples corresponding with the condtion}\space y$$

| ![pond](./blogAssets/1_pond.png) | ![venice](./blogAssets/2_venice.png) | ![sunset](./blogAssets/3_sunset.png) |
|---|---|---|
| <center> Lotus Pond Starry Night </center>| <center>Venice canals in water color</center> | <center>Sunset snow covered mountains</center>|


The article is in progress, thank you for your patience :)