# Policy Gradients algorithm for Pong

Trains an agent with (stochastic) Policy Gradients on Pong. Uses OpenAI Gym.

Original source: [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/)

## Imports

In [1]:
import numpy as np
import _pickle as pickle
import gym

## Hyperparameters

*H*: Number of hidden layers neurons<br>
*batch_size*: Number of episodes to do param update<br>
*gamma*: Discount factor for reward<br>
*decay_rate*: Decay factor for optimizer<br>

In [2]:
H = 200
batch_size = 10 
learning_rate = 1e-4
gamma = 0.99 
decay_rate = 0.99

resume = False
render = False

## Model initialization

In [3]:
def init_model():
    '''
    Initialization of the neural network model
    '''
    D = 80 * 80 #Input dimen
    if resume:
        pass
    else:
        model = {}
        model['W1'] = np.random.randn(H,D) / np.sqrt(D) #"Xavier" init
        model['W2'] = np.random.randn(H) / np.sqrt(H)

    #Update buffers that add up gradients over a batch
    grad_buffer = { k : np.zeros_like(v) for k,v in model.items()}  
    #Rmsprop memory
    rmsprop_cache = { k : np.zeros_like(v) for k,v in model.items()}
    
    return model, grad_buffer, rmsprop_cache

## Activation function

In [4]:
def sigmoid(x):
    return 1.0/(1.0 + np.exp(-x))

## Preprocessing

In [5]:
def prepro(I):
    '''
    Preprocess 210x160x3 uint8 frame into (80x80) 1D float vector
    '''
    I = I[35:195] #Crop
    I = I[::2,::2,0] #Downsample by factor of 2
    I[I==144] = 0 #Erase background type 1
    I[I==109] = 0 #Erase background type 2
    I[I != 0] = 1 #Everything else set to 1
    return I.astype(np.float).ravel()

## Rewarding

In [6]:
def discount_reward(r):
    '''
    Take 1D float array of rewards and compute discounted reward
    '''
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(0,r.size)):
        if r[t] != 0: running_add = 0 #Reset the sum, since this was a game boundary
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

## Forward propagation

In [7]:
def policy_forward(x):
    h = np.dot(model['W1'],x)
    h[h<0] = 0 #ReLU
    logp = np.dot(model['W2'],h)
    p = sigmoid(logp)
    return p,h #Return prob and hidden state

## Backprop

In [8]:
def policy_backward(eph, epdlogp):
    '''
    Backward pass (eph is array of intermediate hidden states)
    '''
    dW2 = np.dot(eph.T, epdlogp).ravel()
    dh = np.outer(epdlogp, model['W2'])
    dh[eph <= 0] = 0 #Backprop prelu
    dW1 = np.dot(dh.T, epx)
    return {'W1':dW1, 'W2': dW2}

## Training

In [9]:
def train():
    env = gym.make('Pong-v0') #No module
    observation = env.reset()
    prev_x = None #Computing the difference frame
    xs,hs,dlogps,drs = [],[],[],[]
    running_reward = None
    reward_sum = 0
    episode_number = 0
    while True:
        if render: env.render()

        #Preprocess the observation, set input to network to be difference image
        cur_x = prepro(observation)
        x = cur_x - prev_x if prev_x is not None else np.zeros(D)
        prev_x = cur_x

        #Forward the policy network and sample action from the returned prob
        aprob, h = policy_forward(x)
        action = 2 if np.random.uniform() < aprob else 3 #roll the dice!

        #Record various intermediates (needed later for backprop)
        xs.append(x) #Observation
        hs.append(h) #Hidden state
        y = 1 if action == 2 else 0 #A 'fake label'
        dlogps.append(y - aprob) #Grad that encourages the action that was taken to be taken

        #Step the environment and get new measurements
        observation, reward, done, info = env.step(action)
        reward_sum += reward

        drs.append(reward) # Record reward (has to be done after we call step() to get reward for previous action)

        if done:
            episode_number += 1

            #Stack together all inputs, hs, action gradients, and rewards for this episode
            epx = np.vstack(xs)
            eph = np.vstack(hs)
            epdlogp = np.vstack(dlogps)
            epr = np.vstack(drs)
            xs,hs,dlogps,drs = [],[],[],[]

            #Compute the discounted reward backwards through time
            discounted_epr = discount_rewards(epr)
            #Standardize the rewards to be unit normal
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)

            #Policy gradient magic
            epdlogp *= discount_reward # Modulate the gradient with advantage
            grad = policy_backward(eph, epdlogp)
            for k in model: grad_buffer[k] += grad[k] #Accumulate grad over batch

            #Perform rmsprop parameter update every batch_size episodes
            if episode_number % batch_size == 0:
                for k,v in model.items():
                    g = grad_buffer[k] #Gradient
                    rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate) * g**2
                    model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5)
                    grad_buffer[k] = np.zeros_like(v) #Reset batch gradient buffer

            #Boring book-keeping
            running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
            print('Resetting env. episode reward total was %.3f. running mean: %.3f' % (reward_sum, running_reward))
            if episode_number % 100 == 0: pickle.dump(model, open('save.p','wb'))
            reward_sum = 0
            observation = env.reset()
            prev_x = None

        if reward != 0: #Pong has either +1 or -1 reward exactly when game ends
            print('Ep %d: game finished, reward: %.f' % (episode_number,reward)) + ('' if reward == -1 else ' !!!!!!!!')

## Warning

Pong environment not available on Windows (12/10/17)

## References

1. [Deep Reinforcement Learning: Pong from Pixels](http://karpathy.github.io/2016/05/31/rl/)
2. [Simple Reinforcement Learning with Tensorflow: Part 2 - Policy-based Agents](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724)
3. [Reinforcement learning with policy gradient](http://minpy.readthedocs.io/en/latest/tutorial/rl_policy_gradient_tutorial/rl_policy_gradient.html)
4. [Deep Deterministic Policy Gradients in TensorFlow](http://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html)
5. [Simple reinforcement learning methods to learn CartPole](http://kvfrans.com/simple-algoritms-for-solving-cartpole/)
6. [REINFORCEMENT LEARNING (RL) – POLICY GRADIENTS I](https://theneuralperspective.com/2016/11/25/reinforcement-learning-rl-policy-gradients-i/)
7. [Solving the Basic Game of Pong](https://www.youtube.com/watch?v=pN7ETkOizGM&ab_channel=SirajRaval)
8. [Pong-v0](https://gym.openai.com/envs/Pong-v0/)