# Pong with PPO algorithm :D

Train an agent to play pong! 


First, let's install the required dependencies

In [None]:
!pip install gym[atari]
!pip install gym[accept-rom-license]
!pip install torchvision
!pip install ipython
!pip install progressbar

Now, we can start setting up the game! Let's create the environment... (this is taken from https://www.gymlibrary.dev/environments/atari/pong/)

In [None]:
import gym
from parallelEnv import parallelEnv 
env = gym.make('PongDeterministic-v4')

Notice this environment is Deterministic, meaning it produces the same result for a particular input. This is easier to model, but a stochastic environment would be better. **Why do you think that is?** (Hint: https://www.gymlibrary.dev/environments/atari/)

The environment was created! Now let's explore... try finding out what the observation space and action space are! (Hint: https://www.gymlibrary.dev/content/basic_usage/). **What do you think this observation space corresponds to?**

In [None]:
#------------TO DO---------------#
print("The state (observation) space is: ", None)

print("The action space is: ", None)
#--------------------------------#

print("List of available actions: ", env.unwrapped.get_action_meanings())

As you could see, the action space consists of 6 possible actions the agent can perform. However, in this model we will only use two actions 'RIGHTFIRE' = 4, and 'LEFTFIRE' = 5. This way, our policies can be simple (output the probability of going to the right, so that the probability of going to the left will just be 1 minus that probability). The 'FIRE' term of the actions are just to ensure that the game starts again after you lose a life.

Good! Now we know a little bit about how the environment works. Let's create the policy the agent will have to follow. The input is the stack of two different frames (which captures the movement), and the output is a number $P_{\rm right}$, the probability of moving left. Note that $P_{\rm left}= 1-P_{\rm right}$. Don't worry, this is already implemented for you, just run the cell!

In [None]:
# Import the Necessary Packages
import torch
import torch.nn as nn
import torch.nn.functional as F

class Policy(nn.Module):

    def __init__(self):
        super(Policy, self).__init__()
        # 80x80x2 to 38x38x4
        # 2 channel from the stacked frame
        # (80-6)/2 +1 =38  --> 38x38x4
        self.conv1 = nn.Conv2d(2, 4, kernel_size=6, stride=2, bias=False)
        # 38x38x4 to 9x9x32
        # (38-6)/4 +1 = 0 ---> 9x9x32
        self.conv2 = nn.Conv2d(4, 16, kernel_size=6, stride=4)
        self.size=9*9*16
        
        # two fully connected layer
        self.fc1 = nn.Linear(self.size, 256)
        self.fc2 = nn.Linear(256, 1)

        # Sigmoid to 
        self.sig = nn.Sigmoid()
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(-1,self.size)
        x = F.relu(self.fc1(x))
        return self.sig(self.fc2(x))

Here we can better visualize the structure of our policy, and we also define our optimizer.

In [None]:
device = torch.device("cpu")
policy = Policy().to(device)
print(policy)
import torch.optim as optim
optimizer = optim.Adam(policy.parameters(), lr=1e-4)

Now, let's define some functions that will be useful in order to visualize our environment, and play in it.


In [None]:
RIGHT=4
LEFT=5
import numpy as np
import torch
import matplotlib.pyplot as plt
from IPython.display import HTML, display 
from matplotlib import animation
import random as rand
# convert outputs of parallelEnv to inputs to pytorch neural net
# this is useful for batch processing 
def preprocess_batch(images, bkg_color = np.array([144, 72, 17])):
    list_of_images = np.asarray(images)
    if len(list_of_images.shape) < 5:
        list_of_images = np.expand_dims(list_of_images, 1)
    # subtract bkg and crop
    list_of_images_prepro = np.mean(list_of_images[:,:,34:-16:2,::2]-bkg_color, axis=-1)/255.
    batch_input = np.swapaxes(list_of_images_prepro,0,1)
    return torch.from_numpy(batch_input).float().to(device)


def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim.to_jshtml())


# function to animate a list of frames
def animate_frames(frames):
    plt.axis('off')

    # color option for plotting
    # use Greys for greyscale
    cmap = None if len(frames[0].shape)==3 else 'Greys'
    patch = plt.imshow(frames[0], cmap=cmap)  

    fanim = animation.FuncAnimation(plt.gcf(), lambda x: patch.set_data(frames[x]), frames = len(frames), interval=30)
    

    display(display_animation(fanim)) 
    
# play a game and display the animation
# nrand = number of random steps before using the policy
def play(env, policy, time=2000, preprocess=None, nrand=5):
    env.reset()

    # start game
    env.step(1)
    
    # perform nrand random steps in the beginning
    for _ in range(nrand):
        frame1, reward1, is_done, _, _ = env.step(np.random.choice([RIGHT,LEFT]))
        frame2, reward2, is_done, _, _ = env.step(0)
    
    anim_frames = []
    
    for _ in range(time):
        
        frame_input = preprocess_batch([frame1, frame2])
        prob = policy(frame_input)
        
        # RIGHT = 4, LEFT = 5
        action = RIGHT if rand.random() < prob else LEFT
        frame1, _, is_done, _, _ = env.step(action)
        frame2, _, is_done, _, _ = env.step(0)

        if preprocess is None:
            anim_frames.append(frame1)
        else:
            anim_frames.append(preprocess(frame1))

        if is_done:
            break
    
    env.close()
    
    animate_frames(anim_frames)
    return 



Great! We can use our play() function defined before to visualize our Pong game and how our agent does with the policy we created. In this Pong game, you control the right paddle (your agent is the green paddle), and you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal.

Keep in mind that the policy hasn't been trained, so right now, the agent will mostly just do random actions when playing. You can change the 'time' parameter in the function below to visualize more or less frames, whatever you want.

In [None]:
play(env, policy, time=100) 

As we can see, our agent is a very bad player right now. Let's try actually implementing the PPO algorithm to try and make our agent beat the computer! **Here we will define this PPO function, and your job is to complete the lines of code that are implementing the clip step of the algorithm.**

In [None]:
import numpy as np
import torch

RIGHT=4
LEFT=5

# convert states to probability, passing through the policy
def states_to_prob(policy, states):
    states = torch.stack(states)
    policy_input = states.view(-1,*states.shape[-3:])
    return policy(policy_input).view(states.shape[:-3])

def clipped_surrogate(policy, old_probs, states, actions, rewards, discount=0.995, epsilon=0.1, beta=0.01):

    discount = discount**np.arange(len(rewards))
    rewards = np.asarray(rewards)*discount[:,np.newaxis]
    
    # convert rewards to future rewards
    rewards_future = rewards[::-1].cumsum(axis=0)[::-1]
    
    mean = np.mean(rewards_future, axis=1)
    std = np.std(rewards_future, axis=1) + 1.0e-10

    rewards_normalized = (rewards_future - mean[:,np.newaxis])/std[:,np.newaxis]
    
    # convert everything into pytorch tensors
    actions = torch.tensor(actions, dtype=torch.int8, device=device)
    old_probs = torch.tensor(old_probs, dtype=torch.float, device=device)
    rewards = torch.tensor(rewards_normalized, dtype=torch.float, device=device)

    # convert states to policy (or probability)
    new_probs = states_to_prob(policy, states)
    new_probs = torch.where(actions == RIGHT, new_probs, 1.0-new_probs)
    
    # ratio for clipping
    ratio = new_probs/old_probs



    #---------------------------------------TO DO---------------------------------------------#
    # clipped function
    clip = torch.clamp(None) #use the epsilon variable here
    clipped_surrogate = torch.min(None) #you need to use the clip variable here
    #-----------------------------------------------------------------------------------------#



    # include a regularization term
    # this steers new_policy towards 0.5
    # add in 1.e-10 to avoid log(0) which gives nan
    entropy = -(new_probs*torch.log(old_probs+1.e-10)+(1.0-new_probs)*torch.log(1.0-old_probs+1.e-10))

    
    # this returns an average of all the entries of the tensor
    # effective computing L_sur^clip / T
    # averaged over time-step and number of trajectories
    # this is desirable because we have normalized our rewards
    return torch.mean(clipped_surrogate + beta*entropy)

Nice job! Now you have your PPO algorithm all done. For this training, we will use batch processing of multiple environments to try and make our agent learn even faster. Let's first define the function that will get the trajectories of each of these environments. Just run the cell!

In [None]:
import numpy as np
import torch
import torch.optim as optim
from parallelEnv import parallelEnv 
RIGHT=4
LEFT=5

# collect trajectories for a parallelized parallelEnv object
def collect_trajectories(envs, policy, tmax=200, nrand=5):
    
    # number of parallel instances
    n=len(envs.ps)

    #initialize returning lists and start the game!
    state_list=[]
    reward_list=[]
    prob_list=[]
    action_list=[]

    envs.reset()
    
    # start all parallel agents
    envs.step([1]*n)
    
    # perform nrand random steps
    for _ in range(nrand):
        #print( envs.step(np.random.choice([RIGHT, LEFT],n)))
        fr1, re1, _, info = envs.step(np.random.choice([RIGHT, LEFT],n))
        fr2, re2, _, info = envs.step([0]*n)
    
    for t in range(tmax):

        # prepare the input
        # preprocess_batch properly converts two frames into 
        # shape (n, 2, 80, 80), the proper input for the policy
        # this is required when building CNN with pytorch
        batch_input = preprocess_batch([fr1,fr2])
        
        # probs will only be used as the pi_old
        # no gradient propagation is needed
        # so we move it to the cpu
        probs = policy(batch_input).squeeze().cpu().detach().numpy()
        
        action = np.where(np.random.rand(n) < probs, RIGHT, LEFT)
        probs = np.where(action==RIGHT, probs, 1.0-probs)
        
        
        # advance the game (0=no action)
        # we take one action and skip game forward
        fr1, re1, is_done, info = envs.step(action)
        fr2, re2, is_done, info = envs.step([0]*n)

        reward = re1 + re2
        
        # store the result
        state_list.append(batch_input)
        reward_list.append(reward)
        prob_list.append(probs)
        action_list.append(action)
        
        # stop if any of the trajectories is done
        # we want all the lists to be rectangular
        if is_done.any():
            break


    # return pi_theta, states, actions, rewards, probability
    return prob_list, state_list, action_list, reward_list



We are now ready to train our policy!

Depending on your CPU, this may take from 45 minutes to 2 hours. 

**While this trains, briefly explain in your own words what is happening here during this training phase. Look closely at each line of code inside the two 'for' cycles.**

In [None]:
from parallelEnv import parallelEnv
import numpy as np
import progressbar as pb
RIGHT=4
LEFT=5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# training loop max iterations
episode = 400

# widget bar to display progress

widget = ['training loop: ', pb.Percentage(), ' ', pb.Bar(), ' ', pb.ETA()]


timer = pb.ProgressBar(widgets=widget, maxval=episode).start()


envs = parallelEnv('PongDeterministic-v4', n=8, seed=42)

discount_rate = .99
epsilon = 0.1
beta = .01
tmax = 320
SGD_epoch = 4


mean_rewards = []

for e in range(episode):

  
    old_probs, states, actions, rewards = collect_trajectories(envs, policy, tmax=tmax)
        
    total_rewards = np.sum(rewards, axis=0)



    for _ in range(SGD_epoch):
        L = -clipped_surrogate(policy, old_probs, states, actions, rewards, epsilon=epsilon, beta=beta)

        optimizer.zero_grad()
        L.backward()
        optimizer.step()
        del L
    
    
    epsilon*=.999
    
    # the regulation term also reduces
    # this reduces exploration in later runs
    beta*=.995
    
    # get the average reward of the parallel environments
    mean_rewards.append(np.mean(total_rewards))
    
    # display some progress every 20 iterations
    if (e+1)%20 ==0 :
        print("Episode: {0:d}, score: {1:f}".format(e+1,np.mean(total_rewards)))
        print(total_rewards)
        
    # update progress widget bar
    timer.update(e+1)
    
timer.finish()

Nice! Now your policy is all trained up. Let's visualize how the rewards your agent obtained changed each episode. **Copy and paste this graph in your document and explain what is happening.**

In [None]:
plt.plot(mean_rewards)

It looks like your agent obtained decent rewards in the later episodes. That's what we wanted! If you trained for a higher number of episodes, your agent might have gotten even better rewards. But I'm sure this training you did will be enough to beat the computer now... let's test it out

In [None]:
play(env, policy, time=600) 

Yay! Our agent did amazing against the computer. You can now save the policy you trained for future occasions. And if you ever want to use it again, you can just load it back up! Thanks for playing Pong :D

In [None]:
# save your policy!
torch.save(policy, 'PPO.policy')

# load policy if needed
# policy = torch.load('PPO.policy')

# BONUS

Try out training your policy with 2 different number of episodes as before (they can be higher or lower than 400). Put the mean rewards graph you obtained in your document and analyze why they look like that. 

***IMPORTANT: Before uploading the modified .ipybn files be sure to clear all the outputs.***