<a href="https://colab.research.google.com/github/Fjoelsak/pong-from-pixels/blob/THM-workshop/scripts/pong-from-pixels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="left"> <img src="https://www.thm.de/_thm/logos/thm.svg" width="200"></p>

# Unsere challenge: Pong from pixles

Wir wollen im Folgenden den in der Vorlesung eingeführten Reinforcement learning Algorithmus **Vanilla Policy Gradients** auf die **Pong** Umgebung von OpenAI gym anwenden.
Anstatt ein package mit fertig implementierten Algorithmus zu laden, ist die Implementierung nochmals vorgegeben, damit wir damit arbeiten und einiges ausprobieren können.

![alt text](https://miro.medium.com/max/160/1*P4l2XZUffcJfJQjQ125wSw.gif?raw=true)



Zunächst installieren wir einige packages zur Visualisierung des Agenten im Umgebungsmodell.

In [1]:
!pip install -q gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -q -y xvfb python-opengl ffmpeg > /dev/null 2>&1

from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gym.wrappers import Monitor
import base64
# IO
from pathlib import Path

und definieren uns eine Funktion, die das aufgenommene Video wiedergibt.

In [2]:
display = Display(visible=0, size=(1400, 900))
display.start()

def show_video():
    html = []
    for mp4 in Path("video").glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 300px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

## Das Umgebungsmodell

wir initialisieren das Umgebungsmodell und schauen es uns zunächst noch einmal genauer an.

In [3]:
import gym

# initialize the gym environment
env = gym.make("Pong-v0")

# wrap the environment for visualization purposes
env = Monitor(env, './video', force=True, video_callable=lambda episode: True) 
env.reset()
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video()

Was passiert hier genau?

*   die Umgebung **initialisiert** und **reseten**. 
*   in der while-Schleife wird die jeweilige **action** des Agenten festgelegt und ein Zeitschritt (**step**) durchgeführt


das heißt, der Agent führt die Action im Umgebungsmodell aus, dass sich dementsprechend verändert. Zusätzlich wird ein reward-Signal ausgegeben. Done ist lediglich ein flag, ob das Ziel erreicht ist und info weitere Informationen zum Zeitschritt. 
In diesem einfachen Beispiel werden willkürlich actions gewählt und ausgeführt. Es liegt demnach ein Umgebungsmodell ohne lernenden Agenten vor.

Die Pong Umgebung ist eine von vielen Atari Umgebung, die eine gemeinsame Klasse als Ursprung haben. Dementsprechend ergibt sich auch das zunächst überraschende Ergebnis, dass der ```action_space``` aus sechs diskreten **actions** besteht. 

In [11]:
print("Action:", envaction_space)
env.unwrapped.get_action_meanings(

Action: Discrete(6)


['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

Hierbei hat action 0 keine Auswirkungen. Die Aktionen 2 und 4 haben ein Heraufschieben und die Aktionen 3 und 5 ein Herabschieben des eigenen Schlägers zur Folge.



Als **Observation** erhält der Agent ein 210x160 RGB Bild des Spiels (daher auch der Name pong from pixels). Mit 3 Farbwerten erhält man hier also einen Input von 100800 Pixel. 

In [16]:
print("Observation", obs.shape)

Observation (210, 160, 3)


Später lässt sich das zu sehende Bild weiter optimieren, da einige Bereiche, wie das Scoreboard und der weiße Streifen nicht notwendig zum Lernen eine Spielstrategie sind. Weiterhin reicht es bspw. auch die beiden Schläger und den Ball zu erkennen, sodass eine grayscaling Ansicht völlig ausreicht. Später dazu mehr!

![alt text](https://miro.medium.com/max/160/1*TzrWM3-3l9EHr0oBXN3hnA.gif?raw=true)

Das **reward**-Signal, mit dem der Agent bestimmte Aktionen bewerten kann, ist hier einfach +1 für jede Runde, die der Agent gewinnt und -1 für jede Runde die der Computer gewinnt.

In [12]:
print("Reward:", reward)

Reward: -1.0


In [19]:
import numpy as np

def prepro(I):
  """ preprocessing the 210x160x3 uint8 frame into 6000 (75x80) 1D float vector """
  I = I[35:185] # crop - remove 35px from start & 25px from end of image in x, to reduce redundant parts of image (i.e. after ball passes paddle)
  I = I[::2,::2,0] # downsample by factor of 2.
  I[I == 144] = 0 # erase background (background type 1)
  I[I == 109] = 0 # erase background (background type 2)
  I[I != 0] = 1 # everything else (paddles, ball) just set to 1. this makes the image grayscale effectively
  return I.astype(np.float).ravel() # ravel flattens an array and collapses it into a column vector

prepro(obs).shape

(6000,)

## Policy Gradients

Im Folgenden wollen wir uns nochmals den Reinforcement learning Algorithmus Vanilla Policy Gradients aus der Vorlesung anschauen.

Zunächst einige Anmerkungen:

- der Input, den der Agent erhält (**observation**) 

In [21]:
""" Majority of this code was copied directly from Andrej Karpathy's gist:
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5 """

""" Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym. """
import numpy as np
import pickle as pickle
import gym

from gym import wrappers

#################################################
# hyperparameters to tune
#################################################
H = 200 # number of hidden layer neurons
batch_size = 10 # used to perform a RMS prop param update every batch_size steps
learning_rate = 1e-3 # learning rate used in RMS prop
gamma = 0.99 # discount factor for reward
decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2

# Config flags - video output and res
resume = True # resume training from previous checkpoint (from save.p  file)?
render = False # render video output?

# model initialization
D = 75 * 80 # input dimensionality: 75x80 grid
if resume:
  model = pickle.load(open('save.p', 'rb'))
else:
  model = {}
  model['W1'] = np.random.randn(H,D) / np.sqrt(D) # "Xavier" initialization - Shape will be H x D
  model['W2'] = np.random.randn(H) / np.sqrt(H) # Shape will be H

grad_buffer = { k : np.zeros_like(v) for k,v in model.items() } # update buffers that add up gradients over a batch
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items() } # rmsprop memory

def sigmoid(x):
  return 1.0 / (1.0 + np.exp(-x)) # sigmoid "squashing" function to interval [0,1]

def prepro(I):
  """ preprocessing the 210x160x3 uint8 frame into 6000 (75x80) 1D float vector """
  I = I[35:185] # crop - remove 35px from start & 25px from end of image in x, to reduce redundant parts of image (i.e. after ball passes paddle)
  I = I[::2,::2,0] # downsample by factor of 2.
  I[I == 144] = 0 # erase background (background type 1)
  I[I == 109] = 0 # erase background (background type 2)
  I[I != 0] = 1 # everything else (paddles, ball) just set to 1. this makes the image grayscale effectively
  return I.astype(np.float).ravel() # ravel flattens an array and collapses it into a column vector

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  """ this function discounts from the action closest to the end of the completed game backwards
  so that the most recent action has a greater weight """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(range(0, r.size)): 
    if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add
  return discounted_r

def policy_forward(x):
  """This is a manual implementation of a forward prop"""
  h = np.dot(model['W1'], x) # (H x D) . (D x 1) = (H x 1) (200 x 1)
  h[h<0] = 0 # ReLU introduces non-linearity
  logp = np.dot(model['W2'], h) # This is a logits function and outputs a decimal.   (1 x H) . (H x 1) = 1 (scalar)
  p = sigmoid(logp)  # squashes output to  between 0 & 1 range
  return p, h # return probability of taking action 2 (UP), and hidden state

def policy_backward(eph, epx, epdlogp):
  """ backward pass. (eph is array of intermediate hidden states) """
  """ Manual implementation of a backward prop"""
  """ It takes an array of the hidden states that corresponds to all the images that were
  fed to the NN (for the entire episode, so a bunch of games) and their corresponding logp"""
  dW2 = np.dot(eph.T, epdlogp).ravel()
  dh = np.outer(epdlogp, model['W2'])
  dh[eph <= 0] = 0 # backpro prelu
  dW1 = np.dot(dh.T, epx)
  return {'W1':dW1, 'W2':dW2}


#####################################################################
# main program
#####################################################################
env = gym.make("Pong-v0")
env = wrappers.Monitor(env, './video', force=True, video_callable=lambda episode_id: episode_id%10==0)
observation = env.reset()
prev_x = None # used in computing the difference frame
xs,hs,dlogps,drs = [],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 0
while episode_number < 21:
  if render: env.render()

  # preprocess the observation, set input to network to be difference image
  cur_x = prepro(observation)
  # we take the difference in the pixel input, since this is more likely to account for interesting information
  # e.g. motion
  x = cur_x - prev_x if prev_x is not None else np.zeros(D)
  prev_x = cur_x

  # forward the policy network and sample an action from the returned probability
  aprob, h = policy_forward(x)
  # The following step is randomly choosing a number which is the basis of making an action decision
  # If the random number is less than the probability of UP output from our neural network given the image
  # then go down.  The randomness introduces 'exploration' of the Agent
  action = 2 if np.random.uniform() < aprob else 3 # roll the dice! 2 is UP, 3 is DOWN, 0 is stay the same

  # record various intermediates (needed later for backprop).
  # This code would have otherwise been handled by a NN library
  xs.append(x) # observation
  hs.append(h) # hidden state
  y = 1 if action == 2 else 0 # a "fake label" - this is the label that we're passing to the neural network
  # to fake labels for supervised learning. It's fake because it is generated algorithmically, and not based
  # on a ground truth, as is typically the case for Supervised learning

  dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

  # step the environment and get new measurements
  observation, reward, done, info = env.step(action)
  reward_sum += reward
  drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

  if done: # an episode finished
    episode_number += 1

    # stack together all inputs, hidden states, action gradients, and rewards for this episode
    epx = np.vstack(xs)
    eph = np.vstack(hs)
    epdlogp = np.vstack(dlogps)
    epr = np.vstack(drs)
    xs,hs,dlogps,drs = [],[],[],[] # reset array memory

    # compute the discounted reward backwards through time
    discounted_epr = discount_rewards(epr)
    # standardize the rewards to be unit normal (helps control the gradient estimator variance)
    discounted_epr -= np.mean(discounted_epr)
    discounted_epr /= np.std(discounted_epr)

    epdlogp *= discounted_epr # modulate the gradient with advantage (Policy Grad magic happens right here.)
    grad = policy_backward(eph, epx, epdlogp)
    for k in model: grad_buffer[k] += grad[k] # accumulate grad over batch

    # perform rmsprop parameter update every batch_size episodes
    if episode_number % batch_size == 0:
      for k,v in model.items():
        g = grad_buffer[k] # gradient
        rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
        model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
        grad_buffer[k] = np.zeros_like(v) # reset batch gradient buffer

    # boring book-keeping
    running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
    print ('resetting env. episode reward total was %f. running mean: %f' % (reward_sum, running_reward))
    if episode_number % 20 == 0: pickle.dump(model, open('save.p', 'wb'))
    reward_sum = 0
    observation = env.reset() # reset env
    prev_x = None

  if reward != 0: # Pong has either +1 or -1 reward exactly when game ends.
    print (('ep %d: game finished, reward: %f' % (episode_number, reward)) + ('' if reward == -1 else ' !!!!!!!!'))

show_video()

ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: 1.000000 !!!!!!!!
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: -1.000000
ep 0: game finished, reward: 1.000000 !!!!!!!!
resetting env. episode reward total was -19.000000. running mean: -19.000000
ep 1: game finished, reward: -1.000000
ep 1: game

References:

https://towardsdatascience.com/intro-to-reinforcement-learning-pong-92a94aa0f84d

[http://karpathy.github.io/2016/05/31/rl/](http://karpathy.github.io/2016/05/31/rl/)