## Local Setup

If you prefer to work locally, see the following instructions for setting up Python in a virtual environment.
You can then ignore the instructions in "Colab Setup".

If you haven't yet, create a [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) environment using:
```
conda create --name rl_exercises
conda activate rl_exercises
```
Torch recommends installation using conda rather than pip, so run e.g.:
```
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia
```
For this exercise, you require a CUDA-enabled GPU, as training an image-based model on the CPU takes a very long time.
Visit [the installation page](https://pytorch.org/get-started/locally/) to see the options available for different CUDA versions.
The remaining dependencies can be installed with pip:
```
pip install matplotlib numpy tqdm ipykernel "gymnasium[atari, accept-rom-license, classic-control, other]" stable-baselines3
```

Even if you are running the Jupyter notebook locally, please run the code cells in **Colab Setup**, because they define some global variables required later.

## Colab Setup

Google Colab provides you with a temporary environment for python programming.
While this conveniently works on any platform and internally handles dependency issues and such, it also requires you to set up the environment from scratch every time.
The "Colab Setup" section below will be part of **every** exercise and contains utility that is needed before getting started.

**IMPORTANT**: For this exercise, you require a GPU runtime environment, as training an image-based model on the CPU takes a very long time.
To do this, select "Change runtime type" from the context menu in the top right corner (next to the **Connect** button), and select **T4 GPU**.

There is a timeout of about ~12 hours with Colab while it is active (and less if you close your browser window).
Any changes you make to the Jupyter notebook itself should be saved to your Google Drive.
We also save all recordings and logs in it by default so that you won't lose your work in the event of an instance timeout.
However, you will need to re-mount your Google Drive and re-install packages with every new instance.

In [None]:
"""Your work will be stored in a folder called `rl_ws23` by default to prevent Colab 
instance timeouts from deleting your edits.
We do this by mounting your google drive on the virtual machine created in this colab 
session. For this, you will likely need to sign in to your Google account and allow
access to your Google Drive files.
"""

from pathlib import Path
try:
    from google.colab import drive
    drive.mount("/content/gdrive")
    COLAB = True
except ImportError:
    COLAB = False

# Create paths in your google drive
if COLAB:
    DATA_ROOT = Path("/content/gdrive/My Drive/rl_ws23")
    DATA_ROOT.mkdir(parents=True, exist_ok=True)

    DATA_ROOT_STR = str(DATA_ROOT)
    %cd "$DATA_ROOT"
else:
    DATA_ROOT = Path.cwd() / "rl_ws23"

# Install python packages
if COLAB:
    %pip install matplotlib numpy tqdm "gymnasium[atari, accept-rom-license, classic-control, other]" stable-baselines3

# Exercise 2

Designed by Ge Li (ge.li@kit.edu) and Balázs Gyenes, inspired by the official PyTorch DQN implementation and the cleanrl implementation of DQN on Atari.

In this homework, we are going to implement the Deep Q-Network algorithm and apply it to the Atari 2600 game "Breakout".
Atari was a popular game console during the 1980s, and Breakout is a game where few layers of bricks line the top of the screen and the goal is to destroy them all by repeatedly bouncing a ball off a paddle into them.

![The Breakout Environment](https://gymnasium.farama.org/_images/breakout.gif)

Refer to <https://gymnasium.farama.org/environments/atari/breakout/> for more details.

Gymnasium's Atari environment is used to simulate the game.
The game's action space has 4 discrete actions: **NOOP**, **FIRE**, **RIGHT**, **LEFT**.
(**NOOP** is the term for an action that has no effect, short for "no-operation".)
The player starts each game (or each trajectory) with 5 lives, and loses a life whenever they miss an incoming ball.
A new ball must be triggered with the **FIRE** action.
The player gets one point per brick that they hit with the ball, after which this brick is destroyed.
If the player loses all 5 lives, the game (and the trajectory) will be over.
The final score for each game is the accumulated score of all 5 lives.

You can play the game yourself by running the cell below! (Only works locally, not in Google Colab)

In [None]:
if not COLAB:
    import gymnasium as gym
    from gymnasium.utils.play import play

    env = gym.make(
        "ALE/Breakout-v5",
        repeat_action_probability=0.0,  # deterministic mode
        render_mode="rgb_array",  # render to a numpy array for display with other tools
    )
    print(f"Action meanings: {env.unwrapped.get_action_meanings()}")
    print(
        "Try Breakout yourself! Press the spacebar to start, and a/d to move the paddle left and right. Press Esc to exit."
    )
    play(
        env,
        keys_to_action={
            " ": 1,  # restart after lost life
            "d": 2,  # right
            "a": 3,  # left
        },
        noop=0,
        fps=20,
        zoom=4,
    )

### Code Overview

The code below is organized into the following blocks.
The bolded sections require your input:
 * Import statements and utility functions for plotting
 * Environment setup including Gymnasium wrappers
 * Hyperparameters
 * Updating the exploration rate, epsilon
 * The replay buffer
 * The QNetwork based on a CNN
 * **Action selection during policy rollout**
 * **Model optimization based on samples from the replay buffer**
 * **The main training loop, which coordinates rollouts, updating the replay buffer, and network updates**

In [None]:
import random  # the standard library's random module is the best way to sample from a deque
from collections import deque  # the replay buffer will be implemented as a deque
from collections import namedtuple  # transitions will be stored as namedtuples
import time
from typing import Sequence

# Progress bar
from tqdm import tqdm

# Plotting
import matplotlib.pyplot as plt
from IPython import display

# Although we use Deep Learning, we still need to use some CPU memory for the
# Replay Buffer, since Image data normally takes a lot of space and GPU memory
# is more expensive than CPU memory
import numpy as np

# Deep Learning Platform
import torch
import torch.nn as nn
import torch.optim as optim

SEED = 2

# Set random seeds
np.random.seed(SEED)
torch.manual_seed(SEED)

# Use GPU to speed up your training
# detail: the code below will detect the hardware you have and set the
# device to Nvidia "cuda" instead of "cpu".
assert (
    torch.cuda.is_available()
), "Change your Colab runtime to use a GPU, as training an image-based network takes forever on a CPU."
device = torch.device("cuda")


# Required to display matplotlib figures as cell output
%matplotlib inline

def plot_rewards(num_games: Sequence[int], average_score: Sequence[float]) -> None:
    """
    Plot average game score
    Args:
        num_games: x axis for different number of games
        average_score: y axis for average game scores

    Returns:
        None
    """
    display.clear_output(wait=True)
    plt.plot(num_games, average_score)
    plt.xlabel("Num of games")
    plt.ylabel("Mean game score")
    display.display(plt.gcf())

### Environment setup
Create the environment and apply some wrappers to preprocess the data.
On each step, the wrapper will return the last 4 grayscale screenshots as the observation (or state).
For more preprocessing details, please refer to [Machado et al. (2018)](https://arxiv.org/abs/1709.06009)

In [None]:
import gymnasium as gym
from gymnasium.wrappers import FrameStack, AtariPreprocessing
from gymnasium.utils.save_video import save_video
from stable_baselines3.common.atari_wrappers import FireResetEnv

# create the environment, which has some wrappers by default
env = gym.make(
    "ALE/Breakout-v5",
    frameskip=1,  # disable frameskip, as it's applied in AtariPreprocessing
    repeat_action_probability=0.0,  # deterministic mode
    render_mode="rgb_array_list",  # render to a numpy array for saving with other tools
)

# apply typical preprocessing used for Atari environments, according to Machado et al. (2018)
# Noop Reset: Obtains the initial state by taking a random number of no-ops on reset, default max 30 no-ops.
# Max-pooling: Pools over the most recent two observations from the frame skips
# Termination signal when a life is lost: When the agent loses a life during the environment, then the environment is terminated.
# Resize to a square image: Resizes the atari environment original observation shape from 210x180 to 84x84 by default
# Grayscale observation: If the observation is colour or greyscale, by default, greyscale.
# We set frame_skip to 1 because frameskip is already applied by default
env = AtariPreprocessing(env, frame_skip=4, terminal_on_life_loss=True)

# Each time we reset the environment, press the "FIRE" button to launch the next ball
env = FireResetEnv(env)

# the agent observes the last 4 frames from the environment
env = FrameStack(env, num_stack=4)

# reset and apply random seed
env.reset(seed=SEED)
NUM_ACTIONS = env.action_space.n

# show the stack of wrappers and the underlying environment
print(env)

### Set hyperparameters

In [None]:
# In Google Colab, this code block renders as a form where hyperparameters can be adjusted.

# @markdown Number of transitions to sample from the replay buffer for each update:
BATCH_SIZE = 32  # @param {type: "integer"}
# @markdown Discount factor:
GAMMA = 0.99  # @param {type: "number"}
# @markdown Initial exploration rate (epsilon):
EPS_START = 1.0  # @param {type:"slider", min:0, max:1, step:0.05}
# @markdown Final exploration rate (epsilon) after epsilon decay is complete:
EPS_END = 0.05  # @param {type:"slider", min:0, max:1, step:0.05}
# @markdown Exploration rate (epsilon) decays linearly during initial fraction of training:
EPS_FRACTION = 0.1  # @param {type:"slider", min:0, max:1, step:0.05}
# @markdown Target network is updated every N steps of the environment
TARGET_UPDATE_INTERVAL = 10000  # @param {type: "integer"}

# @markdown Learning rate (alpha)
LEARNING_RATE = 1e-4  # @param {type: "number"}
# @markdown Replay buffer size
BUFFER_SIZE = 100000  # @param {type: "integer"}
# @markdown Minimum number of environment samples collected before learning of the Q function begins
LEARNING_STARTS = 10000  # @param {type: "integer"}
# @markdown Total number of environments step for training
TOTAL_STEPS = 500000  # @param {type: "integer"}

# @markdown After backpropagation, clip the gradients if their total magnitude is greater than this:
MAX_GRAD_NORM = 10  # @param {type: "number"}

# @markdown Optimize the NN after this many rollouts:
TRAIN_FREQ = 4  # @param {type: "integer"}

### Update exploration rate
The exploration rate (epsilon) will decrease from 1 to 0.05 during the first 10% of training steps.

In [None]:
def update_eps(total_time_steps: int) -> float:
    """
    This is a helper function to update exploration rate, the exploration rate
    will decrease from 1 to 0.05 during the first 10% of the training steps

    Args:
        total_time_steps: total time steps from the start of the training

    Returns:
        eps: exploration rate

    """
    return max(
        EPS_START - total_time_steps / (TOTAL_STEPS * EPS_FRACTION),
        EPS_END,
    )

### Replay buffer

The replay buffer stores the last N transitions, where a transition is a state, action, reward, the associated next state, as well as whether the trajectory terminated or truncated.
Typically, the replay buffer stores the last 1 million transitions, we use 100K transitions for simplicity.

In [None]:
# Definition of transition stored by the replay buffer
Transition = namedtuple(
    "Transition",
    ("state", "action", "next_state", "reward", "terminated", "truncated"),
)


class ReplayBuffer:
    """
    Definition of reply buffer used by DQN
    """

    def __init__(self, capacity: int) -> None:
        """Initialize the ReplayBuffer with certain capacity"""
        self.memory = deque([], maxlen=capacity)

    def push(self, *args: np.ndarray) -> None:
        """Save a transition"""
        self.memory.append(Transition(*args))

    def sample(
        self,
        batch_size: int,
    ) -> tuple[
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
        torch.Tensor,
    ]:
        """
        Sample transitions and transfer them from numpy array to PyTorch tensor
        Args:
            batch_size: mini batch size

        Returns:
            5 batches data
        """
        # Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
        # a detailed explanation). This converts batch-array of Transitions
        # to Transition of batch-arrays.
        batch = Transition(*zip(*random.choices(self.memory, k=batch_size)))

        # prepare each element by concatenating, converting to tensor and pushing
        # to device
        terminated_b = torch.tensor(batch.terminated, device=device)
        truncated_b = torch.tensor(batch.truncated, device=device)
        state_b = torch.from_numpy(np.stack(batch.state)).to(device)
        next_state_b = torch.from_numpy(np.stack(batch.next_state)).to(device)
        action_b = torch.from_numpy(np.stack(batch.action)).to(device)
        reward_b = torch.from_numpy(np.stack(batch.reward)).to(device, torch.float)
        return state_b, action_b, next_state_b, reward_b, terminated_b, truncated_b

    def __len__(self) -> int:
        """Length of the ReplayBuffer"""
        return len(self.memory)


# Instantiate
replay_buffer = ReplayBuffer(BUFFER_SIZE)

### Deep Q-Network with CNN

The following neural network is trained to represent the Q function, and implicitly represents the agent's policy as well.
The input is the state `s`, and the output is a vector of the Q values for all possible actions in this state.
Each state is a 3-dimensional Tensor of shape `(channel, height, width)`.
During rollout, there is no batch dimension, as there is only one environment.
Therefore, the input to the network is only 3-dimensional.
During optimization, the network receives a batch of states sampled from the replay buffer, and the input is therefore 4-dimensional, with an extra batch dimension in the front.
The underlying Pytorch code always expects 4-dimensional Tensors, but we must remove any batch dimensions we have added before returning.
Images of uint8 type \[0-255\] are converted to floats and rescaled to \[0-1\) before being passed to the network.

In [None]:
class CnnQNetwork(nn.Module):
    """
    Q-Network with CNN
    """

    def __init__(self) -> None:
        """
        Initialize the network, which contains 2D convolution layers (image
        process), Normalization layers (offer better numerical stability),
        and Fully Connected layers
        Args:
            height: height of the image in pixel
            width: width of the image in pixel
            outputs: number of actions
        """
        super().__init__()
        self.network = nn.Sequential(
            nn.Conv2d(4, 32, kernel_size=8, stride=4),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Flatten(),  # flatten all convolutional features into one long vector
            nn.Linear(3136, 512),
            nn.ReLU(),
            nn.Linear(512, NUM_ACTIONS),
        )

    def forward(self, x: torch.Tensor):
        """
        Forward pass function, given a state s, compute value of all actions
        i.e. Q(s, a)
        Called with either one element to determine next action, or a batch
        during optimization.
        Args:
            x: state

        Returns:
            q_s_a = value of actions given this state
        """
        # Shape of x:
        # [num_frames=4, height=84, width=84] or [batch_size, num_frames=4, height=84, width=84]
        #
        # Shape of q_s_a:
        # [NUM_ACTIONS=4] or [batch_size, NUM_ACTIONS=4]

        if (ndim := x.ndim) == 3:
            x = x.unsqueeze_(dim=0)  # add a batch dimension of size 1
        else:
            assert ndim == 4

        # convert to float and rescale between [0-1), then do network forward pass
        q_s_a = self.network(x / 255.0)

        # restore original dimensionality
        if ndim == 3:
            q_s_a.squeeze_(dim=0)
        return q_s_a


# Instantiate policy net and its optimizer and push to GPU
policy_net = CnnQNetwork().to(device)
optimizer = optim.Adam(policy_net.parameters(), lr=LEARNING_RATE)

# Instantiate target net using NN parameters of the policy net
target_net = CnnQNetwork().to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.requires_grad_(False)  # Never record gradients with respect to these weights

## Sample action using epsilon-greedy exploration method (3pts)

In the following block, you are going to select actions given a single **state** (3-dimensional Tensor) and an exploration rate **eps** (epsilon).
Each action is an integer from the set {0, 1, 2, 3}.
The program flow is:
* Apply the epsilon-greedy method to decide if we go exploration or exploitation
* If exploitation, use the policy net to compute the **Q(s,a)** and select the
 action with the highest value.
* If exploration, randomly choose an action.

Hints:
* The epsilon-greedy method can be achieved by an if-statement comparing eps with a random float bounded by \[0, 1\), try `random.random()` to get a random number.
* For exploitation:
    * You do not need to track gradients when computing **Q(s,a)** (think about why!), so you should use a context scope `with torch.no_grad():`.
    Disabling gradients when you don't need them always provides a significant speedup.
    * You can apply **argmax()** to get the action, which is the index of the max **Q(s,a)**.
* For exploration:
    * Randomly picking actions can be achieved by generating a tensor of random integers.
    `torch.randint()` is useful here, but be careful with the range and shape.
    These integers will later be used to index PyTorch tensors, and must therefore be 64 bit integers.
    Ensure this by passing `dtype` as `torch.long`.


In [None]:
def select_action(state: torch.Tensor, eps: float) -> torch.Tensor:
    """

    Args:
        state: state tensor of shape [num_frames=4, height=84, width=84]
        eps: exploration rate

    Returns:
        selected_action: Discrete action as a scalar (a tensor of dimension 0). The action is in the set {0, 1, 2, 3}

    """
    ## TODO ##
    selected_action = ...

    # Some code to help check the validity of the output
    assert isinstance(selected_action, torch.Tensor)
    assert selected_action.dtype == torch.long
    assert selected_action.ndim == 0

    return selected_action

## Optimize the model (7pts)

In the following block, you are going to compute a loss which will be used to optimize the parameters of the network.

The workflow is:
* Define a loss function (use the Huber loss).
* Sample a mini batch of transitions from the replay buffer.
* Query the policy network for Q values for the states.
* Choose the Q values corresponding to the actions that were taken in the sampled transitions.
* Query the target network for Q values for the next states.
* Choose the maximum Q value over all actions for each next state.
* Compute targets with $r + \gamma~\max_{a'} Q(s', a')$.
Take care to only add the max Q of the next state, if the transition was not terminal!
* Use the loss function to compute the Huber loss.
* Carry out back propagation to get the gradients, clip them, and step the optimizer.

Hints:
* You can use `torch.logical_not` or the `~` operator to invert a Tensor of boolean values.
* To pick data from tensor A using tensor B as index, use `torch.gather`.
* To add an extra dimension to a Tensor after dimension `i`, use `torch.unsqueeze(dim=i)`. The new dimension will have a length of 1.
* To remove dimension `i` from a Tensor, use `torch.squeeze(dim=i)`. This only works dimension `i` has a length of 1.
* To get maximum values of a Tensor along some axis, use `torch.max`.
* Multiplying a Tensor of booleans with a Tensor of floats does what you expect and may be useful.
* Once you instantiate a loss function (.e.g `nn.HuberLoss()`), you can call that instance to compute a loss value.

In [None]:
## TODO ##
loss_func = ...


def optimize_model():
    # Do not train until we have enough data in the buffer
    if len(replay_buffer) < LEARNING_STARTS:
        return

    # Sample mini-batch of transitions
    (
        states,
        actions,
        next_states,
        rewards,
        terminateds,
        truncateds,
    ) = replay_buffer.sample(BATCH_SIZE)

    ## TODO ##
    loss = ...

    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy_net.parameters(), MAX_GRAD_NORM)
    optimizer.step()

### Main loop (5pts)

In the following block, you are going to finish the main loop of the training procedure.

The program flow is:
* Firstly initialize variables for program state, like game score, number of games, total steps, etc.
* Start the game to get its first state
* While loop until max training steps has been reached:
    * While loop until **terminated or truncated** (usually because a life was lost):
        * Update exploration rate
        * Apply a `torch.no_grad()` context so that gradients are not recorded for rollouts:
            * Convert state from numpy array to torch Tensor and push to GPU
            * Select action
            * Push action to CPU and convert to numpy array
            * Execute action in the environment, receiving a tuple of results, which includes the next state
            * Save transition into replay buffer
            * Move to the next state
            * Update counters
        * Optimize policy net when necessary
        * Update target net when necessary
    * Update progress bar, plot of running average game score
    * Save video of policy rollout
    * Reset environment to prepare for next trajectory

Hint:
* To convert a numpy array to torch, use `torch.from_numpy`.
* A torch Tensor can be pushed to the CPU using `x.cpu()` or to the GPU using `x.to(device)`.
* A torch Tensor **on the CPU** can be converted to a numpy array using `x.numpy()`.
* Use `env.step(action)`to perform an action.
This function returns a tuple of 5 values: `(next_state, reward, terminated, truncated, info)`.
The info is not relevant for this exercise.

When you start training, the early stage will be quite noisy.
**You may need 20+ min to get the average score more than 1**.
The entire training should take about 45 minutes, and the average score at the end should be around 2-3.
Every 250 episodes, a video of the policy is saved (either to Google Drive or locally).
You can use this to visually confirm that the policy is learning.

In [None]:
list_num_game = []
list_mean_game_scores = []

# Main Loop of training
state, _ = env.reset(seed=2)
terminated, truncated = False, False
eps = EPS_START
num_time_steps = 0
num_game = 0
game_score = 0
game_length = 0
game_scores = deque([], maxlen=10)  # Store the score of the latest 10 games
video_folder = DATA_ROOT / "exercise_2" / time.strftime("%Y-%m-%d_%H-%M")

# Progress bar
with tqdm(total=TOTAL_STEPS, position=0, leave=True, unit="steps") as pbar:
    # Loop until total time steps has been reached
    while num_time_steps < TOTAL_STEPS:
        # Loop until trajectory is over
        while not (terminated or truncated):
            # Get exploration rate
            eps = update_eps(num_time_steps)

            # Rollout
            with torch.no_grad():
                ## TODO ##

                # Update game score and length and total time steps
                game_score += reward
                game_length += 1
                num_time_steps += 1

            # Optimize model
            if num_time_steps % TRAIN_FREQ == 0:
                optimize_model()

            # Update the target network
            if num_time_steps % TARGET_UPDATE_INTERVAL == 0:
                target_net.load_state_dict(policy_net.state_dict())

        # record game score and length
        game_scores.append(game_score)
        mean_game_score = np.asarray(game_scores).mean().item()
        pbar.update(game_length)
        num_game += 1

        # Reset game score and length
        game_score = 0
        game_length = 0

        # Print some result in the progress bar for every 10 games
        if num_game % 10 == 0:
            pbar.set_description(
                f"Game #{num_game - 9}-{num_game}, Avg_score: {mean_game_score:.3f},"
                f" eps: {eps:.3f}"
            )

        # Plot the average reward curve for every 50 games
        if num_game % 50 == 0:
            list_num_game.append(num_game)
            list_mean_game_scores.append(mean_game_score)
            plot_rewards(list_num_game, list_mean_game_scores)

        # save video of last game, only every 250 games
        save_video(
            env.render(),
            video_folder=video_folder,
            episode_trigger=lambda x: x % 250 == 0,
            name_prefix="breakout",
            episode_index=num_game,
            fps=30,
        )

        # reset environment for next trajectory
        state, _ = env.reset()
        terminated, truncated = False, False

    pbar.close()
    print("Finished!")

## Self-test questions (optional)

Where and how can we change the model to get a Double DQN?

.## TODO ##



After training on Breakout for 500000 steps, the agent manages to break a handful of squares each game.
But in the lecture, it said that DQN can achieve superhuman performance on Atari?
Why is our agent so bad?

.## TODO ##


