# Exercises

1. How would you define reinforcement learning? How is it different from regular supervised or unsupervised learning?
2. Can you think of three possible applications of RL that were not mentioned in this lesson? For each of them, what is the environment? What is the agent? What are some possible actions? What are the possible rewards?
3. What is the discount factor? Can the optimal policy change if you modify the discount factor?
4. How do you measure the performance of a reinforcement learning agent?
5. What is the credit assignment problem? When does it occur? How can you alleviate it?
6. What is the point of using a replay buffer?
7. What is an off-policy RL algorithm?
8. Use policy gradients to solve OpenAI gym's LunarLander-v2 environment. You will need to install the Box2D dependencies (`pip install --user gym[box2d]`).
9. Use tf-agents to train an agent that can achieve a superhuman level at SpaceInvaders-v4 using any of the available algorithms.
10. If you have $100 to spare, you can purchase a Raspberry Pi 3 plus some cheap robotics components, install TensorFlow on theh Pi, & go wild! For an example, check out the posts by Lukas Biewald, or take a look a GoPiGo or BrickPi. Start with simple goals like making the robot turn around to find the brightest angle (if it has a light sensor) or the closest object (if it has a sonar sensor), & move in that direction. Then you can start using deep learning: for example, if the robot has a camera, you can try to implement an object detection algorithm so it detects people & moves toward them. You can also try to use RL to make the agent learn on its own how to use the motors to achieve that goal. Have fun!

---

1. Reinforcement learning is the oldest field of machine learning where an agent makes observations & takes actions within an environment, & in return it receives rewards. It learns to act in a way that maximises its expected reward over time. Reinforcement learning does have some form of supervision, through rewards. Although we do not directly tell the agent how to perform a task, the rewards let the model know when it is making progress or when it is failing. Unsupervised & supervised learning generally find patterns in the data to make predictions, while in reinforcement learning, the goal is to find a good policy. Reinforcement learning is more complex: it has to find the right balance between exploring the environment, looking for new ways to get rewards, & exploiting sources of rewards it already knows, while supervised & unsupervised learning systems don't worry about exploration. Training instances are also independent for supervised & unsupervised learning while in reinforcement learning, consecutive observations are not independent. Consecutive observations are very correlated. However, sampling from the replay memory (buffer) can get you independent observations.
2. (*a*) Self-driving cars: the environment is the real world, the agent is the car, the actions are the external capabilities of the car, forward & reverse acceleration, steering, etc. The rewards could be negative incrementers: you want to go from point A to B in as little time as possible, while obeying traffic laws & avoiding/preventing accidents. (*b*) Probably Youtube's ads or any ads or any recommender system: the environment is Youtube. The actions would be the recommendations. The rewards are views or traffic. (*c*) drug discovery & design: the environment would be a simulated chemical environment used to analyse behavior between molecules & atoms in a human body, actions could be moving the developed chemical compound through its chemical pathway, the reward would be if the compound performs what it is intended to do.
3. One problem that reinforcement learning algorithms face is the credit assignment problem. When an agent gets an reward, it is hard for it to know which actions should get credited (or blamed) for it. To tackle this problem, you can evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor $\gamma$ at each step: $\gamma * R_n + \gamma^2 * R_{n + 1} + \gamma^3 * R_{n + 2} + ...$ where R is the earned reward at its corresponding time step. The discount factor is any value between 0 & 1. Typical discount factors vary from 0.9 to 0.99. Changing the discount factor changes the optimal policy, because the weight of the future rewards change as well.
4. Look at the rewards it gets. If there are multiple episodes, then look at the total rewards it gets on average.
5. Answered already.
6. Since consecutive observations are correlated, using a replay buffer can help to reduce the correlations in the training batch. All experiences are stored in the replay buffer & sampled at random at each training iteration, helping gradient descent perform optimally.
7. An off-policy algorithm is an algorithm where the policy being trained is not necessarily the one being executed. For example, the Q-learning example is an off-policy algorithm: the policy being executed (the exploration policy) is completely random, while the policy being trained will always choose the action with the highest Q-value. This is the opposite of an on-policy algorithm, which explore the world using the policy being trained (e.g., policy gradients algorithm).

# 8.

In [None]:
import gymnasium

env = gymnasium.make("LunarLander-v2", render_mode = "human")
env.observation_space
obs = env.reset()
obs

In [None]:
env.action_space

In [None]:
import tensorflow as tf
from tensorflow import keras

n_inputs = env.observation_space.shape[0]
n_outputs = env.action_space.n

model = keras.models.Sequential([
    keras.layers.Input(shape = [n_inputs]),
    keras.layers.Dense(32, activation = "relu"),
    keras.layers.Dense(32, activation = "relu"),
    keras.layers.Dense(n_outputs, activation = "softmax")
])

In [None]:
import numpy as np

def lander_play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        probas = model(obs[np.newaxis])
        logits = tf.math.log(probas + keras.backend.epsilon())
        action = tf.random.categorical(logits, num_samples = 1)
        loss = tf.reduce_mean(loss_fn(action, probas))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(action[0, 0].numpy())
    return obs, reward, done, grads

def lander_play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = lander_play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done: 
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

In [None]:
def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted

def discount_and_normalise_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                              for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

In [None]:
n_iterations = 200
n_episodes_per_update = 16
n_max_steps = 1000
discount_rate = 0.99

optimiser = keras.optimizers.Nadam(learning_rate = 0.005)
loss_fn = keras.losses.sparse_categorical_crossentropy

In [None]:
mean_rewards = []

for iteration in range(n_iterations):
    all_rewards, all_grads = lander_play_multiple_episodes(env, n_episodes_per_update,
                                                           n_max_steps, model, loss_fn)
    mean_reward = sum(map(sum, all_rewards)) / n_episodes_per_update
    print("\rIteration: {}/{}, mean reward: {:.1f} ".format(iteration + 1,
                                                            n_iterations,
                                                            mean_reward), end = "")
    mean_rewards.append(mean_reward)
    all_final_rewards = discount_and_normalise_rewards(all_rewards, discount_rate)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean([final_reward * all_grads[episode_index][step][var_index]
                                     for episode_index, final_rewards in enumerate(all_final_rewards)
                                     for step, final_reward in enumerate(final_rewards)], axis = 0)
        all_mean_grads.append(mean_grads)
    optimiser.apply_gradients(zip(all_mean_grads, model.trainable_variables))

# 9.

In [None]:
env = gym.make("SpaceInvaders-v0", render_mode = "human")
height, width, channels = env.observation_space.shape
actions = env.action_space.n
env.unwrapped_get_action_meanings()

In [None]:
model = keras.models.Sequential([
    keras.layers.Input(shape = [3, height, width, channels]),
    keras.layers.Conv2D(32, (8, 8), 4, activation = "relu"),
    keras.layers.Conv2D(64, (4, 4), 2, activation = "relu"),
    keras.layers.Conv2D(64, (3, 3), activation = "relu"),
    keras.layers.Flatten(),
    keras.layers.Dense(512, activation = "relu"),
    keras.layers.Dense(256, activation = "relu"),
    keras.layers.Dense(actions, activations = "linear")
])

In [None]:
def build_agent(model, actions):
    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr = "eps", value_max = 1.,
                                  value_min = 0.1, value_test = 0.2, nb_steps = 10000)
    memory = SequentialMemory(limit = 1000, window_length = 3)
    dqn = DQNAgent(model = model, memory = memory, policy = policy,
                   enable_dueling_network = True, dueling_type = "avg",
                   nb_actions = actions, nb_steps_warmup = 1000)
    return dqn

In [None]:
dqn = build_agent(model, actions)
dqn.compile(keras.optimizers.Adam(learning_rate = 1e-4))
dqn.fit(env, nb_steps = 10000, verbose = 2)