In [5]:
from gfn import GFNAgent

In [6]:
agent = GFNAgent(epochs=200)

First, let's take a look at the environment. The default is a 2D 8x8 grid with high reward in the corners.

In [None]:
agent.env.plot_reward_2d()

We can also look at the model structure. Notice that in this implementation, the learned parameter `z0` is separate from the neural net, and that the neural net has two output "heads": `foward_policy` and `backward_policy`. 

In [None]:
agent.model.summary()

In [None]:
agent.z0

For this demonstration, we'll just show that the GFlowNet can learn a policy that generates trajectories proportional to the reward. To do that, we'll first sample a large training set using the untrained, random policy.

In [None]:
agent.sample(5000)
agent.plot_sampled_data_2d()

In [None]:
u_modes, u_positions = agent.count_modes()
print(f'There are {u_modes} unique modes and {u_positions} unique positions in the training data.')

Before training, the policy is uniform. The likelihood of transitioning vertically or laterally (arrows) or terminating (red octogon) is essentially uniform at every point.

The probability of terminating at each position is plotted below, and we can see that without training, it looks nothing like the reward environment we plotted above. The termination probabilities are large enough, that any trajectory is unlikely to leave the origin (bottom left).

In [None]:
agent.plot_policy_2d()

In [None]:
l1_error_before = agent.compare_env_to_model_policy()

Let's train it and see if we can do better!

In [None]:
agent.train()

Let's plot the trained policy and sample from it to get a probability distribution over the environment:

In [None]:
agent.plot_policy_2d()

In [None]:
l1_error_after = agent.compare_env_to_model_policy()

In [None]:
print(f'L1 error before {l1_error_before:.2f} and after {l1_error_after:.2f}')

While not perfect, the model has certainly learned to generate trajectories through the environment with probability proportional to the reward! It's a far better approximation than the untrained policy, and you can imagine that it would get better with some tweaks (e.g. more training, different optimizer, etc.)

In [2]:
from gfn import GFNAgent
agent = GFNAgent(epochs=200)
l = agent.sample_trajectories(2)

[[0.376489   0.25777655 0.36573445]
 [0.37648899 0.25777654 0.36573447]]
[[0. 1. 0.]
 [0. 0. 1.]]
[[0.34950482 0.31951355 0.33098163]
 [0.37648899 0.25777654 0.36573447]]
[[0. 0. 1.]
 [0. 0. 1.]]
[[0.34950482 0.31951355 0.33098163]
 [0.33333333 0.33333333 0.33333333]]
[[0. 0. 1.]
 [0. 0. 1.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 0. 1.]
 [1. 0. 0.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 1. 0.]
 [0. 0. 1.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[1. 0. 0.]
 [1. 0. 0.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 0. 1.]
 [1. 0. 0.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 1. 0.]
 [0. 1. 0.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 0. 1.]
 [0. 1. 0.]]
[[0.33333333 0.33333333 0.33333333]
 [0.33333333 0.33333333 0.33333333]]
[[0. 1. 0.]
 [1. 0. 0.]]
[[0.33333333 0.33333

In [None]:
l[0].shape, l[1].shape

In [None]:
import tensorflow as tf
one_hot_positions = tf.one_hot(l[0][0], 8, axis=-1)
one_hot_positions[0]

In [None]:
import tensorflow_probability as tfp
tfd = tfp.distributions
for i, action in enumerate(tfd.Categorical(probs=l[1][0]).sample().numpy()):
    print(action == (agent.action_space - 1))

In [None]:
l[1][0]