<h1> Approximating value function with neural networks </h1>

In those two tutorials, we will be approximating state-action value function $Q(s,a)$ by a neural network. The environment we will use is the cart-pole environment of the OpenAI gym library. In this environment, the goal is to balance an inverse pendulum. Once the pendulum fails, the episode terminates. As long as the pendulum is more or less upright, you obtain reward $+1$ for each step. The environment is considered solved once you can balance the pendulum for $200$ steps or more. Actions are $0$ and $1$ for pushing the cart to the left or right. The state-space contains four continuous variables.

See [https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py) and [https://gym.openai.com/envs/CartPole-v0/](https://gym.openai.com/envs/CartPole-v0/) for more details.

In [None]:
%matplotlib inline
import gym
from gym.wrappers import Monitor

import matplotlib
import matplotlib.pyplot as plt
from PIL import Image
from IPython import display as ipythondisplay

!apt-get install x11-utils > /dev/null 2>&1 
!pip install pyglet > /dev/null 2>&1 

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

env = gym.make('CartPole-v0')

We will break the whole algorithm into several blocks we will implement separately. The first and the simplest one is the replay memory. Replay memory is a cyclic buffer that will be used to store triplets of state, action, and the sampled $Q(s,a)$.

In [None]:
class ReplayMemory:
  def __init__(self, capacity):
    # TODO: create a cyclic buffer of a given size
    self.capacity = capacity
    self.memory = #TODO
    pass

  def put(self, state, action, q_state_action):
    # TODO store a sample into the buffer
    pass

  def sample(self, number):
    # TODO samples a given number of samples uniformly i.i.d. from the buffer
    return []

  def size(self):
    # TODO gets the actual size of the buffer
    # we will need this method later ...
    return 0

In the next step, we need to create a simple neural network to approximate the $Q$-values. If you do not know how to create a simple neural network, try visiting, for example, [this tutorial](https://towardsdatascience.com/building-neural-network-using-pytorch-84f6e75f9a).

In [None]:
from torch import nn

class Network(nn.Module):
  def __init__(self):
    super().__init__()
    self.hidden = #TODO
        
  def forward(self, x):
    # TODO implement the forward pass through the network.
    x = self.hidden(x)

    # TODO

    return x

Next, we will create a helper class <code>Model</code> that we will use to access the network. We will implement the greedy and $\varepsilon$-greedy policies, together with the optimization step.

In [None]:
import torch
import numpy as np
import random

class Model:
  def __init__(self) -> None:
      super().__init__()
      self.network = Network()
      # we will need a proper loss function and an optimizer
      # you may get inspired here:
      # https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
      # or select one from the list here:
      # https://neptune.ai/blog/pytorch-loss-functions
      # https://pytorch.org/docs/stable/optim.html
      self.criterion = # TODO choose your loss
      self.optimizer = # TODO chose your optimizer

  def greedy_policy(self, observation):
    # TODO implement the greedy policy, return 0/1
    with torch.no_grad(): 
      # TODO
      return 0

  def greedy_q_value(self, observation):
    # TODO implement the method to estimate U(s) as max_a Q(s,a)
    # we will need this method to calculate the target value to learn
    with torch.no_grad(): 
      return 0.0

  def eps_greedy(self, observation, epsilon):
    # TODO implement epsilon-greedy policy
    return 0

  def optimize_batch(self, replay_memory, batch_size):
    # in this method, we will actually train the neural network

    # if there are not enough samples in the history, skip learning
    if(replay_memory.size() < batch_size): return

    sample = replay_memory.sample(batch_size)

    prediction_vals = # TODO calculate predicted values by the netwrok on the sample
    targets = # TODO this is what the learning should have predicted

    # actual training
    loss = self.criterion(prediction_vals, targets)
    self.optimizer.zero_grad()
    loss.backward()
  
    for param in self.network.parameters():
      param.grad.data.clamp_(-1, 1)
    self.optimizer.step()


Finally, we are ready to put everything together in the OpenAI gym library.

In [None]:
# TODO feel free to modify
NUMBER_OF_EPISODES = 1000
REPLAY_MEMORY_SIZE = 3000
BATCH_SIZE = 256

model = Model()
history = ReplayMemory(REPLAY_MEMORY_SIZE)

for e in range(NUMBER_OF_EPISODES):
  last_obs = env.reset()

  for i in range(500): # the environment won't let us more than 500 actions
    epsilon = # TODO calculate the epsilon

    # do the step
    last_action = model.eps_greedy(last_obs, epsilon)
    obs, reward, done, _ = env.step(last_action)

    # store into history and optimize the model
    td_target = # TODO calculate the sampled value
    history.put(last_obs, last_action, td_target)

    model.optimize_batch(history, BATCH_SIZE)

    last_obs = obs

    # break and print how long we were able to balance the pendulum
    if done:
      print(i) # to know the length of the episode
      break

Now, if we want to test our approach, we can visualize it using the following code. Feel free to skip this step.

In [None]:
# just a single test loop
screen = env.render(mode='rgb_array')
plt.imshow(screen)

for e in range(1):
  last_obs = env.reset()

  for i in range(500):
    last_action = model.greedy_policy(last_obs)
    obs, reward, done, _ = env.step(last_action)

    plt.imshow(screen)
    ipythondisplay.clear_output(wait=True)
    ipythondisplay.display(plt.gcf())

    last_obs = obs

    if done:
      print(i)
      break

## A second neural network?

Maybe you already found out that learning the neural network this way might not be very stable. We are trying to match the outputs to $Q$-values that are not static. This causes a drift in the expected outputs that might cause the neural network to encounter large errors between the sampled $Q$-value $r + \gamma \max_{a'} Q(s', a')$ and the predicted $Q(s,a)$ one. To stabilize the learning, a common approach is to include a second network with fixed $Q$-values to use as a target reference.

A very good explanation of why we need a second network is in the answer to this query:
https://stackoverflow.com/questions/54237327/why-is-a-target-network-required.
Also, an explanation of how the second network is used is here:
https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/, or in this example: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

Therefore, your next task is to modify your code to include a second network. Is learning faster? More stable?

## Challenge 1 - learn from images

Deep reinforcement learning is often connected with learning from images directly. Our state space might be formed by the difference of images of the cart pole between the two episodes. Basically, we might capture the image of the screen in two consecutive steps and pass their difference to a **convolutional neural network** to predict the next movement. The difference captures information about the velocity of the cart, the angular velocity of the pendulum, and positional information. Your task is to take your code and modify it to work this way. Of course, you do not need to implement everything yourself; for example, feel free to copy method <code>get_screen</code> (and take other inspiration) from https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html.

## Challenge 2 - double inverted pendulum

If you do not like the previous challenge, you might try a different one - a double inverted pendulum. The problem is very similar to the inverted pendulum; however, it is much harder to balance the double pendulum. Also, the state space is much larger. Therefore, you will need a larger network and more computational time.

See https://github.com/openai/gym/blob/master/gym/envs/mujoco/inverted_double_pendulum.py for documentation. Note that the action is continuous; you will need to modify your network to account for infinite action space.

In [None]:
!apt-get install -y \
    libgl1-mesa-dev \
    libgl1-mesa-glx \
    libglew-dev \
    libosmesa6-dev \
    software-properties-common

!apt-get install -y patchelf

!pip install free-mujoco-py
import mujoco_py

env = gym.make('InvertedDoublePendulum-v2')