# Getting started: `gfn` from scratch

The goal of this notebook is two-fold:

* to take you through both the intuitions of why and how to use GFlowNets
* to teach you the basic structures and components of `gfn` to train your own GFlowNets

This is neither a theoretical course on GFlowNets (see resources at the end) nor a guide on how to invent your own `gfn` components (we have [other tutorials for that](link)).

If you're already familiar with GFlowNets, you can skip to **part 3** where we look at the code.

Let's get started!

## 1. GFlowNets in a few words

GFlowNets are a family of methods to construct *generative models*. In other words, we want to be able to obrain / create / sample new objects from a given distribution.

The key notion with GFlowNets is that we will *sequentially* build these objects, and we will train neural networks (you'll soon undetsand where they are needed) such that when we're done, the probability to obtain a sample is *proportional* to its "quality".

This *quality* of a sample must be measurable by what we will call a *reward function*. If we're creating Lego spaceships block by block, then we need to be able to assess how "good" a resulting spaceship is. Let's say that we can, *i.e.* that we have a function that takes a Lego construction and returns an unnormalized score for it.

Now what the GFlowNet procedure does is:
* Start from an initial state (empty construction)
* Give that state to a Neural Network that will output a probability distribution over potential next blocks to add to the construction (and its location)
  * There's also a special action the network can take that means "Stop, I'm done"
* When that stop action is sampled, then we give that sample to the reward function and obtain a score, to tell the neural network how good a job it did
* Then we update the neural network and start over to the initial state

Forget about losses etc. for now. The GFlowNet jargon sounds a lot like Reinforcement Learning and to some extent, it *is*. The key differnence though, is that the GFlowNet is not trained to *maximize* is return (=discounted sum of rewards), it is trained to sample *proportionally to it*.

What this means is that Lego spaceships with quality / reward `r` are twice as likely to be generated by the GFlowNet as a spaceship with reward `r / 2`! I'll insist because this is one of the **most important** features of GFlowNets: they will sample proportionally to the reward, and that ensures *diversity*.

In other words, if you have a multimodal distribution, a converged GFlowNet is guaranteed to sample *all modes*, proportionally to their value / reward. And while they are not the only method with that capability, they are the only ones with Neural Networks at their core to "intelligently" (in a data-driven way and across samples & trajectories) explore the space it is sampling from.



## 2. Key concepts

As explained, the **reward function** is the function that takes in a sample and outputs its quality. That reward is expected to be non-negative and the higher the better.

Sometimes the reward function is not easily accessible, and one can use a **proxy** instead. A proxy can be, for instance, a neural network trained on some data sets to produce the quantity of interest.

A **state** is a representation of the object we seek to construct. It does not have to be a complete object: there is the *initial* state we start from, *intermediate* states (which could have a value), and *terminal* states when we say: this is an actual, finished sample from the distribution.

The possible steps that update a state in order to construct an object to sample are called **actions**.

The sequence of state and actions taken from the initial state to the final state are called **trajectories**. You take an action to go from one state to the next in the trajectory that sequentially builds the final sample from the initial state.

If you see a state as a *node* and an action as an *edge* then the space of all possible trajectories from the initial state (called **source node**) can be seen as a **DAG**: a directed acyclic graph. This is actually a *condition* for GFlowNets to work: when one creates an environment, they must prevent any action that would create a cycle.

The probability we assign each action that can be taken at one given state is called a **forward policy**.

Sometimes (depending on the loss used) you need to go backwards in the graph: access a node's parents and not it's children. The probability we assign each action that *would have lead to the current state* is called a **backward policy**.

We want to *learn* transitions in the graph. This will be done by **estimators**. Typically, estimators will be neural networks.

In the **Flow-Matching Loss** (ref: paper), one seeks to match the incoming flows to the outgoing flows of states. We'll therefore have an estimator for the *forward policy* and apply it to:

1. All the parent nodes of a given `s` such that `p -> s`
2. All the children nodes of a given `s` such that `s -> c`

In the **Trajectory-Balance Loss**, one additionally needs to estimate the *backward policy*. We therefore need to have a second estimator for it. We also need an estimator for Z, the flow's normalizing constant.

In other words, an **estimator** is a function that is updated / learned from trajectories.

The last but maybe most important concept to understand is the **environment**: this is the aggregation of how states are represented, what is the reward for terminal states (samples), what actions are allowed or not from a given state, what actions lead to that states etc. In that sense, it is very close to a Reinforcement Learning environment: the GflowNet takes actions and gets states and rewards from the environment.

## 3. `gfn` in practice

Let's start with a toy task:

* we're given an environment (a 100x100, 2D grid)
* and its reward function, that places high rewards in the corners of the grid

<p align='center'>
    <img src='../../assets/imgs/examples/notebooks/2D-grid-reward.png' width=200>
</p>

Remember that we want to **sample** proportionally to the reward. In other words, we want our GFlowNet to generate 2D points in the 100x100 space (=samples) and that the probability to obtain any point in the grid is proportional to its reward (=the value of that point).

In this example, not only do we know the exact reward function, but the state-space is relatively small: there's only `100 * 100 = 1e4` possible points in the grid. But if you had a 5D grid of size 1000 that would be `1e15` possible values