 # Warning !
This workshop assumes you know how to use a RL environment and build a neural network in PyTorch.\
 You might want to familiarize yourself with [Q learning](https://github.com/PoCInnovation/Workshops/tree/master/ai/Reinforcement_Learning) and [PyTorch](https://github.com/PoCInnovation/Workshops/tree/master/ai/Pytorch) before you begin this workshop.

<center>

# DQN - Deep Q Network implementation in PyTorch

> We set out to create a single algorithm that would be able to develop
> a wide range of competencies on a varied range of challenging tasks [...]
> To achieve this, we developed a novel agent, a deep Q-network
> (DQN), which is able to combine reinforcement learning with a class
> of artificial neural network known as deep neural networks.

<cite>
- Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning (2015).
</cite>

<br>

<img src="./landing.gif" style="border-radius: 10px; margin: 10px; height: 300px; width: 500px">

</center>




The agent pictured above was the result of 30 minutes of training on 1000 episodes of 1000 frames each. During this workshop, your job will be to solve the [LunarLander](https://gymnasium.farama.org/environments/box2d/lunar_lander/) environment by implementing a DQN.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import random
import gymnasium as gym

env = gym.make("LunarLander-v2", render_mode="rgb_array")

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

<center>

## 1. Q learning Recap

</center>

Let's recap the most important things to remember from the [Q learning](https://github.com/PoCInnovation/Workshops/tree/master/ai/Reinforcement_Learning) workshop in a series of exercises:

### a. Q function

Implement this function in python:

![Q Function](https://wikimedia.org/api/rest_v1/media/math/render/svg/7c8c6f219d5ceabd052cb058a5135bfdac86dc0c)


First define the `new_value()` function then use it to define `Q_new`

In [None]:
torch.manual_seed(42)

LEARNING_RATE = 0.05
GAMMA = 0.99

q_current_m = torch.rand(4).unsqueeze(-1)
q_next_m = torch.rand(4).unsqueeze(-1)
reward_m = 100

def target_value(reward: int, gamma: int, q_next: torch.Tensor) -> torch.Tensor:
    # Enter your code here
    return None

# Enter your code here (use the `target_value()` function)
q_new = None

print(f"q_new is: {q_new}\n")
print(f"Expected: {torch.as_tensor([5.8575, 5.8990, 5.3764, 5.9506]).unsqueeze(-1)}")

### b. Epsilon Greedy

Implement the epsilon greedy algorithm in Python:

```
with probability `epsilon`: act randomly
otherwise: act greedily
```

Use `action_space.sample()` to get a random action and `greedy_action` to get the greedy action

In [None]:
greedy_action_m = q_current_m.argmax().detach().item()
action_space_m = env.action_space

def epsilon_greedy_action(epsilon: float, greedy_action: int, action_space: gym.spaces.Discrete):
    pass

action = epsilon_greedy_action(1, greedy_action_m, action_space_m)
assert (0 <= action < action_space_m.n), "action should be between 0 and 3"

### c. Neural Network

Implement a Neural Network using PyTorch:

- You can use any architecture you want but a [linear transformation](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) will suffice.
- Use `env` for your [input](https://www.gymlibrary.ml/content/api/#gym.Env.observation_space) and [output](https://www.gymlibrary.ml/content/api/#gym.Env.action_space) layer sizes (see links for documentation) because this model will be used to predict the best `action` for each given `state` (observation)
- Don't forget the [activation functions](https://pytorch.org/docs/stable/nn.functional.html#non-linear-activation-functions)

In [None]:
class NeuralNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()

        # Enter your code here: ~ 3 lines depending on the amount of layers you want
        
        

        #

    def forward(self, x):
        # Enter your code here: ~ 3 lines depending on the amount of layers you want
        
        

        #
        return x

    def predict(self, x):
        return x.argmax().detach().item()

print(NeuralNetwork(env))

Nice ! The functions you've defined during this little recap will come in handy during the rest of the workshop, so make sure they work as they should !

<center>

## 2. DQN

</center>

Now, it's time for us to learn what a DQN is !

### a. The Algorithm Explained

The original DQN algorithm from [Human-level control through deep reinforcement learning](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) defines the algorithm as the following (see page 7 of the document):

![algorithm](algo.png)

You may notice some familiar elements:

- <b>action-value function Q with random weights θ</b>:\
this is the neural network we defined earlier
- <b>with probability ε</b>:\
this is our epsilon-greedy strategy
- <b>execute action in emulator</b>:\
`env.step(action)`
- <b>observe reward r / set s<sub>t+1</sub></b>:\
`new_state, reward, done, _ = env.step(action)`
- <b>targets function</b>:\
it is the same function we defined earlier in `target_value()`

With these in mind, the algorithm should already make more sense to you.
But let's try to understand the new elements:
- <b>replay memory D</b>:\
The DQN relies on the agent's past experiences for its training.\
The replay memory `D` is a list of every moment in the agent's life, comprising of each `state, reward, action, done` and `new_state` of each iteration of our `for` loop.\
Its capacity `N` is the size of this memory, meaning that once you have `N` elements memorized, the `N+1` element will replace the first element inside the memory (one good way to implement this in python is by using a [deque](https://docs.python.org/3/library/collections.html#collections.deque)).\
We use this memory to [retrieve a minibatch](https://www.w3schools.com/python/ref_random_sample.asp) of these moments, aka `transitions` for our training. 
- <b>target action-value function Q<sup>-</sup> with weights θ<sup>-</sup> = θ</b>:\
The DQN uses two neural networks: one generally called the `online_network` and its clone, the `target_network` which copies the `online_network`'s weights every `C` steps.\
This method is necessary to stabilize learning and prevent [catastrophic forgetting](https://en.wikipedia.org/wiki/Catastrophic_interference).\
`target_value()` will use the value returned by `target_network.forward()` as its `q_next` argument.
- <b>gradient descent</b>:
We update the `online_network`'s weights using a [gradient descent](https://pytorch.org/docs/stable/optim.html#taking-an-optimization-step).

### b. Implement the algorithm

Let's try to implement all of these concepts in python; feel free to scroll back up if you don't remember the explanations. It will make more sense to you as you go through each of them one at a time.

<center>

#### Replay Memory

In [None]:
from collections import deque
import numpy as np
BATCH_SIZE = 32

class Memory():
    def __init__(self, N):
        self.D = deque(maxlen=N)
        
    # note that we also store the `done` value in our memory
    # we will use it when we set the target_value 
    # for the `if episode terminates at step j + 1` condition
    def store_transition(self, state, action, reward, done, new_state):
        # Enter your code here: ~ 1 line
        return None
    
    def retrieve_transitions(self):
        # Enter your code here: ~ 1 line
        transitions = None

        # Retrieving each element from sample
        states = ([t[0] for t in transitions])
        actions = ([t[1] for t in transitions])
        rewards = ([t[2] for t in transitions])
        dones = ([t[3] for t in transitions])
        new_states = ([t[4] for t in transitions])

        # Converting elements to tensors 
        # and adding a dimension where needed with unsqueeze()
        states_t = torch.as_tensor(np.array(states), dtype=torch.float32)
        actions_t = torch.as_tensor(np.array(actions), dtype=torch.int64).unsqueeze(-1)
        rewards_t = torch.as_tensor(np.array(rewards), dtype=torch.float32).unsqueeze(-1)
        dones_t = torch.as_tensor(np.array(dones), dtype=torch.float32).unsqueeze(-1)
        new_states_t = torch.as_tensor(np.array(new_states), dtype=torch.float32)

        return states_t, actions_t, rewards_t, dones_t, new_states_t

<center>

#### Networks

- find a method which copies a network's weights onto another network.
> You might find something like that on the [official PyTorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html)... (no need to save the parameters beforehand, there is an easier method for our purpose which fits in one line)
- we will also setup our online [optimizer](https://pytorch.org/docs/stable/optim.html) and [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions)

In [None]:
LEARNING_RATE = 5e-4

def update_target_network():
    # Enter your code here: ~ 1 line
    pass
    #

online_network = NeuralNetwork(env)
target_network = NeuralNetwork(env)

# Choose an optimizer and set it to `online_network`'s parameters
optimizer = None
# Choose a loss function
criterion = None

<center>

#### Algorithm

#### [Don't know how to use Tensorboard with PyTorch ?](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html)

In [None]:
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

# N
MEMORY_CAPACITY = 10000
# M
EPISODES = 1000
# T
FRAMES = 1000
# C
UPDATE_FREQUENCY = 1000


Before we train the algorithm, we will need to fill our replay memory with random actions:

In [None]:
replay = Memory(MEMORY_CAPACITY)

state, _ = env.reset()
for frame in range(MEMORY_CAPACITY):
    action = env.action_space.sample()

    new_state, reward, done, _, _ = env.step(action)

    replay.store_transition(state, action, reward, done, new_state)

    state = new_state

    if done:
        state, _ = env.reset()

If all your methods are well implemented, you should have a functional DQN agent.

Don't forget to view the progress using Tensorboard\
(usually on `localhost:6006` after running `tensorboard --logdir=runs` inside the workshop directory).

In [None]:
epsilon = 1.0
steps = 0

for episode in range(EPISODES):
    state, _ = env.reset()
    # our epsilon will reach 0.1 after half the episodes are finished
    epsilon = max(0.1, epsilon - 1.0 / EPISODES * 2)

    episode_reward = 0
    episode_loss = []

    for frame in range(FRAMES):
        # we need to convert state into a tensor to pass it into our network
        state_t = None
        q_values = online_network.forward(state_t)
        greedy_action = online_network.predict(q_values)
        
        action = epsilon_greedy_action(epsilon, greedy_action, env.action_space)

        new_state, reward, done, _, _ = env.step(action)
        episode_reward += reward

        replay.store_transition(state, action, reward, done, new_state)

        states, actions, rewards, dones, new_states = replay.retrieve_transitions()

        # if dones is 0, meaning the episode terminates at next step,
        # the target_value becomes Y = rewards
        # because everything else is multiplied by 0
        Y = None

        action_q_values = online_network.forward(states).gather(dim=1, index=actions)

        # Enter your code: ~ 1 line
        loss = None
        
        episode_loss.append(loss.item())

        # Make a gradient descent using the optimizer and the loss: ~ 3 lines
        
        

        #

        if steps % UPDATE_FREQUENCY == 0:
            update_target_network()

        steps += 1

        state = new_state

        if episode % 50 == 0:
            env.render()

        if done: 
            break

    writer.add_scalar("Reward/train", episode_reward, episode)
    writer.add_scalar("Loss/train", np.mean(episode_loss), episode)

    if episode % 10 == 0:
        print(f"Episode {episode}:")
        print(f"\tReward:\t{episode_reward}")
        print(f"\tLoss:\t{np.mean(episode_loss)}")
        print(f"\tEpsilon:\t{epsilon}")

### c. Test the model

By running the below code, you will see your model's final version after 1000 episodes.

In [None]:
env = gym.make("LunarLander-v2", render_mode="human")
state, _ = env.reset()
while True:
    state_t = torch.as_tensor(np.array(state), dtype=torch.float32)
    q_values = online_network.forward(state_t)
    action = online_network.predict(q_values)

    new_state, reward, done, _, _ = env.step(action)

    state = new_state

    env.render()

    if done:
        state, _ = env.reset()

<center>

## 3. Improvements

</center>

Well, you've trained a DQN to play the LunarLander environment and it's doing pretty good, huh ?

Here's a few things you could do if you're interested in learning more within the field of Reinforcement Learning:

- Try changing the hyperparameters to find the optimal implementation of the algorithm:
    - the `BATCH_SIZE` or the `MEMORY_CAPACITY` could have an impact on the agent's long term memory if you run into problems related to catastrophic forgetting after a few episodes
    - the learning rate or the optimizer and loss function could have an impact on how well your agent learns: a popular optizimer for DQN is the [RMSProp](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html) and the prefered loss function is the [SmoothL1Loss](https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss)
    - maybe you can change the sizes of the hidden layers or their amount
- Turn this DQN into a [DoubleDQN](https://arxiv.org/pdf/1509.06461.pdf)
- Remove the `target_network` entirely by using [DeepMellow](https://cs.brown.edu/~gdk/pubs/deepmellow.pdf)
- Try out [other environments](https://gymnasium.farama.org/): only some minor changes are required for most algorithms. For [Atari games](https://gymnasium.farama.org/environments/atari/), for example, you only need to preprocess the observations using [gym wrappers](https://gymnasium.farama.org/api/wrappers/) and use [convolutions](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html). Be warned, though, because the Atari environments will take a <bold>lot</bold> longer to train than LunarLander !