<h1> Table of Content  </h1>

<ol>
    <li><span style="color:green">OpenAI-GYM setup : Pong and cartpool</span></li> 
    <li><span style="color:green">Generate Experiece:</span></li>
    <ul>
        <li>Buffer to store trajectories</li>
        <li>Reward Transformations: Advantage, GAE etc</li>
        <li>Normalizing reward</li>
        <li>Observations preprocessing methods</li>
    </ul>
    <li><span style="color:green">Models code: Actor, critic etc</span></li>
    <li>Training Codes</li>
    <ul>
        <li>Optimizers: LR, Initialization etc</li>
        <li>Metrics To track: Reward, entropy , etc</li>
        <li>Tensorboard</li>
        <li>Logging</li>
        <li>Model Checkpoints</li>
        <li>Hyper-parameter search Techniques</li>
        <li>Script to automate process</li>
    </ul>
    <li>Update Rules: Policy Algos</li>
    <ul>
        <li>Actor Update rule: Vanila, TPO, PPO</li>
        <li>Value Funtion fit: Regression on discounted reward, TD(1)</li>
    </ul>
    <li>Documentation and Good Reference </li>
</ol>    


<p><b>Referencess ::</b>

1. Spinningup openAI  2. Karapathy Pong from pixel   3. 
    
A. https://spinningup.openai.com/en/latest/algorithms/vpg.html
    
   https://github.com/openai/spinningup/tree/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg

B. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5    
</p>


<b>Important Observations</b>

- even for episodic task, use GAE or discounted reward for each observation leading upto terminal reward

#Pong-v0
https://medium.com/gradientcrescent/fundamentals-of-reinforcement-learning-automating-pong-in-using-a-policy-model-an-implementation-b71f64c158ff


- actions:: 0> no movement 2>>up 3>> down
- Reward:: +1 -> win::  -1 -> loss :: 0 -> otherwise
- [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0 …]
This is not very useful, as we’ve no idea about the significance of the actions preceding the reward, and the number of actions preceding a reward may vary. To address this, we can introduce a discount function, which utilizes a decay rate (gamma, defined earlier) to distribute the normalized earned reward across a number of preceding frames



In [218]:
import torch
import torch.nn as nn
from torch.distributions.normal import Normal
from torch.distributions.categorical import Categorical
from torch.optim import Adam

import numpy as np
import scipy.signal

import gym
import time

import matplotlib.pyplot as plt

In [219]:
def discount_cumsum(x, discount):
    """
    magic from rllab for computing discounted cumulative sums of vectors.
    input: 
        vector x, 
        [x0, 
         x1, 
         x2]
    output:
        [x0 + discount * x1 + discount^2 * x2,  
         x1 + discount * x2,
         x2]
    """
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]



In [259]:
class Actor(nn.Module):
    
    def __init__(self, obs_dim, act_dim, hidden_sizes=[256], model = None):
        super().__init__()

        self.obs_dim = obs_dim
        self.act_dim = act_dim

        if model == None:
            # obsdim --> obs_dim --> 256 --> act_dim
            ind = self.obs_dim
            layers = []
            for l in hidden_sizes:
                layers.append(nn.Linear(ind,l))
                layers.append(nn.ReLU())
                ind = l

            layers.append(nn.Linear(ind, self.act_dim))
            model = nn.Sequential(*layers)

        self.logits_net = model

    def _distribution(self, obs):
        logits = self.logits_net(obs)
        return Categorical(logits=logits)

    def _log_prob_from_distribution(self, pi, act):
        return pi.log_prob(act)

    def forward(self, obs, act=None):
        # Produce action distributions for given observations, and 
        # optionally compute the log likelihood of given actions under
        # those distributions.
        pi = self._distribution(obs)
        logp_a = None
        if act is not None:
            logp_a = pi.log_prob(act)
        return pi, logp_a


class Critic(nn.Module):

    def __init__(self, obs_dim, hidden_sizes=[256], model = None):
        super().__init__()
        
        if model == None:
            ind = obs_dim
            layers = []
            for l in hidden_sizes:
                layers.append(nn.Linear(ind,l))
                layers.append(nn.ReLU())
                ind = l

            layers.append(nn.Linear(ind, 1))
            model = nn.Sequential(*layers)                          
        
        self.v_net = model
            
    def forward(self, obs):
        return torch.squeeze(self.v_net(obs), -1) # Critical to ensure v has right shape.

    

class Agent(nn.Module):
    
    """
    
    Data collection
    1.STEP:: obs --> step --> act, v, log
    
    Training
    2. self.pi   obs --> pi --> prob/logp_a
    3. Self.v    ovs --> v ---> v
    
    """


    def __init__(self, obs_dim, act_dim, 
                 hidden_sizes=[64,64], actor_model=None,critic_model=None):
        super().__init__()
        
        
        self.pi = Actor(obs_dim, act_dim, hidden_sizes, actor_model) 

        # build value function
        self.v  = Critic(obs_dim, hidden_sizes, critic_model)

    def step(self, obs):
        with torch.no_grad():
            pi = self.pi._distribution(obs)
            a = pi.sample()
            logp_a = pi.log_prob(a)  #self.pi._log_prob_from_distribution(pi, a)
            v = self.v(obs)
        return a.cpu().numpy(), v.cpu().numpy(), logp_a.cpu().numpy()

    def act(self, obs):
        return self.step(obs)[0]    
    

In [255]:
# Pong

"""
action:: 2 --> up
         3 --> down

"""


def prepro(I):
    """ prepro 210x160x3 frame into 6400 (80x80) 1D float vector """
    I = I[35:195] # crop
    I = I[::2,::2,0] # downsample by factor of 2
    I[I == 144] = 0 # erase background (background type 1)
    I[I == 109] = 0 # erase background (background type 2)
    I[I != 0] = 1 # everything else (paddles, ball) just set to 1
    return I.astype(np.float).ravel()




class BufferNew:
    """
    A buffer for storing trajectories experienced by agent interacting
    with the environment, and using Generalized Advantage Estimation (GAE-Lambda)
    for calculating the advantages of state-action pairs.
    """

    def discount_rewards(self,r, gamma):
        """ take 1D float array of rewards and compute discounted reward """
        r = np.array(r)
        discounted_r = np.zeros_like(r)
        running_add = 0

        for t in reversed(range(0, r.size)):
            if r[t] != 0: running_add = 0 # if the game ended (in Pong), reset 
            running_add = running_add * gamma + r[t] 
            discounted_r[t] = running_add
        return discounted_r
    
    def normalize(self, r):
        r = np.array(r)
        r -= np.mean(r) #normalizing the result
        r /= np.std(r) #idem using standar deviation
        return r
        
    def reset(self, gamma=0.99, lam=0.95):
        self.obs_buf = [] 
        self.act_buf = [] 
        self.rew_buf = [] 
        
        self.val_buf = [] 
        self.logp_buf = []
        
        # Used for training
        self.adv_buf = [] 
        self.ret_buf = []

        
        
        #---------environmet----------------
        self.obs_buf_ep = [] 
        self.act_buf_ep = [] 
        self.rew_buf_ep = [] 
        
        #---------Model---------------------
        self.val_buf_ep = [] 
        self.logp_buf_ep = []

        
        #-------For GAE training-------------
        self.adv_buf_ep = [] 
        self.ret_buf_ep = [] 
        
        self.gamma, self.lam = gamma, lam
    
    def __init__(self, gamma=0.99, lam=0.95, dev = torch.device('cpu')):
        self.epochs = 0 
        self.reset(gamma, lam)
        self.device = dev
        
        

    def store(self, obs =0, act = 0, rew = 0, val = 0, logp = 0):
        """
        Append one timestep of agent-environment interaction to the buffer.
        """
#         print("store called")
        self.obs_buf_ep.append(obs)
        self.act_buf_ep.append(act)
        self.rew_buf_ep.append(rew)
        
        self.val_buf_ep.append(val)
        self.logp_buf_ep.append(logp)


    def finish_path(self, last_val=0):
        """
        Call this at the end of a trajectory, or when one gets cut off
        by an epoch ending. This looks back in the buffer to where the
        trajectory started, and uses rewards and value estimates from
        the whole trajectory to compute advantage estimates with GAE-Lambda,
        as well as compute the rewards-to-go for each state, to use as
        the targets for the value function.
        The "last_val" argument should be 0 if the trajectory ended
        because the agent reached a terminal state (died), and otherwise
        should be V(s_T), the value function estimated for the last state.
        This allows us to bootstrap the reward-to-go calculation to account
        for timesteps beyond the arbitrary episode horizon (or epoch cutoff).
        """
        #print("inside finish_path")
        self.epochs += 1
        rews = np.append( np.array(self.rew_buf_ep), last_val)
        vals = np.append( np.array(self.val_buf_ep), last_val)
        
        # the next two lines implement GAE-Lambda advantage calculation
        deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]
        self.adv_buf_ep = self.discount_rewards(deltas, self.gamma * self.lam)
        self.adv_buf_ep = self.normalize(self.adv_buf_ep)
#         print("advantage calculated")
        # the next line computes rewards-to-go, to be targets for the value function
        self.ret_buf_ep = self.discount_rewards(rews, self.gamma)[:-1]
#         print("ret done")
        
        #print(f"rews : {rews.shape} \n vals: {vals.shape} \n adv_buf : {self.adv_buf_ep.shape} \n ret: {self.ret_buf_ep.shape}")

#        print("path finished")
        
#         if self.epochs % 100 == 0:
#             x = np.arange(0,len(self.adv_buf_ep))
#             plt.clf()
#             plt.title("reward/advantage curve")
#             plt.plot(x, self.adv_buf_ep, label="advantage")
#             plt.plot(x, self.ret_buf_ep, label="rew to go")
#             plt.xlabel("samples")
#             plt.ylabel("adv/rewtogo")
#             plt.legend(loc='best')
#             plt.savefig(os.path.join(os.getcwd(),"Graph", str(self.epochs)+"_"+"adv_rew.png"))
        

        
        
        
        self.obs_buf.append(self.obs_buf_ep) 
        self.act_buf.append(self.act_buf_ep) 
        self.rew_buf.append(self.rew_buf_ep)
        
        self.val_buf.append(self.val_buf_ep) 
        self.logp_buf.append(self.logp_buf_ep)
        
        # Used for training
        self.adv_buf.append(self.adv_buf_ep) 
        self.ret_buf.append(self.ret_buf_ep)
#         print("done all ")
        #store episodes
    

    def get(self):
        """
        Call this at the end of an epoch to get all of the data from
        the buffer, with advantages appropriately normalized (shifted to have
        mean zero and std one). Also, resets some pointers in the buffer.
        """
        data = dict(obs=self.obs_buf, act=self.act_buf, ret=self.ret_buf,
                    adv=self.adv_buf, logp=self.logp_buf)
        #print("get() :: dictionary created")
        
        data_t = {}
        for k,v in data.items():
            #print(k)
            v = np.array(v)
            #print(v.shape)
            #print(v.dtype)
            data_t[k] = torch.as_tensor(v, dtype=torch.float32).to(self.device)
        return data_t#{k: torch.as_tensor(v, dtype=torch.float32) for k,v in data.items()}



In [269]:
# Set up function for computing VPG policy loss
def compute_loss_pi(data, ac):
    obs, act, adv, logp_old = data['obs'], data['act'], data['adv'], data['logp']

#     print(f"{obs.device} {act.device}")
    # Policy loss
    pi, logp = ac.pi(obs, act)
    loss_pi = -(logp * adv).mean()
    #print(f"\n compute_loss_pi:: \n\n obs: {obs.shape} \n adv: {adv.shape}\n logp {logp.shape}\n loss_pi = -(logp * adv).mean() --> {loss_pi.shape} ")

    # Useful extra info
    approx_kl = (logp_old - logp).mean().item()
    ent = pi.entropy().mean().item()
    pi_info = dict(kl=approx_kl, ent=ent)

    return loss_pi, pi_info

# Set up function for computing value loss
def compute_loss_v(data, ac):
    obs, ret = data['obs'], data['ret']
    l = ((ac.v(obs) - ret)**2).mean()
    #print(f"\n compute_loss_v \nobs: {obs.shape} \n ret: {ret.shape} \n ((ac.v(obs) - ret)**2).mean() : {l.shape} ")
    return l

def update(data, pi_optimizer, train_v_iters, vf_optimizer, ac):
   
    # Get loss and info values before update
    pi_l_old, pi_info_old = compute_loss_pi(data, ac)
    pi_l_old = pi_l_old.item()
    v_l_old = compute_loss_v(data, ac).item()

    # Train policy with a single step of gradient descent
    pi_optimizer.zero_grad()
    loss_pi, pi_info = compute_loss_pi(data, ac)
    pi_loss = loss_pi.item()
    loss_pi.backward()
    pi_optimizer.step()

    # Value function learning
    for i in range(train_v_iters):
        vf_optimizer.zero_grad()
        loss_v = compute_loss_v(data, ac)
        loss_v.backward()
        v_loss = loss_v.item()
        vf_optimizer.step()

    # Log changes from update
    kl, ent = pi_info['kl'], pi_info_old['ent']
    #print(f"kl: {kl} \n ent: {ent}")
    return pi_loss, v_loss, ent
    

In [281]:
from torch.utils.tensorboard import SummaryWriter
%matplotlib inline  

# default `log_dir` is "runs" - we'll be more specific here
writer = SummaryWriter('pg_800')
# %tensorboard --logdir=runs/vpg_1
# #model
# writer.add_graph(net, images)
# writer.close()
import os
cwd = os.getcwd()

checkpt = os.path.join(cwd,"model")#,str(5))
# checkpt += ".pt"
print(checkpt)



epochs=10000
gamma=0.99
pi_lr=1e-4
vf_lr=1e-4 
train_v_iters=60
lam=0.97
max_ep_len=1000,
save_freq=50
log_freq = 10


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

last_epoch = 5450

loadpth = os.path.join(checkpt,"agent_" + str(last_epoch) + ".pt")

# Create actor-critic module
ac = Agent(80*80, 2, hidden_sizes = [256,64])
ac.load_state_dict(torch.load(loadpth))
ac.to(device)

buf = BufferNew(gamma, lam, device)

# Set up optimizers for policy and value function
pi_optimizer = Adam(ac.pi.parameters(), lr=pi_lr)
vf_optimizer = Adam(ac.v.parameters(), lr=vf_lr)



D:\Reinforcement\PolicyGradient\model
cuda


In [249]:
logDict = {}

In [282]:
print(len(logDict))

437


In [283]:


# Prepare for interaction with environment
start_time = time.time()
env = gym.make("Pong-v0") 
o, ep_ret, ep_len = env.reset(), 0, 0

epr,epl = 0,0
pilrun,vlrun, entrun = 0,0,0
# Main loop: collect experience in env and update/log each epoch
for epoch in range(last_epoch+1,epochs):
    buf.reset()
    while True:
        o = prepro(o)
        a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32).to(device))
        if a==0:
            act = 2
        else:
            act = 3
        next_o, r, d, _ = env.step(act)
        ep_ret += r
        ep_len += 1
        #print(ep_len)
        # save and log
        buf.store(o, a, r, v, logp)
        #logger.store(VVals=v)

        # Update obs (critical!)
        o = next_o
        
        if d:
            buf.finish_path(0)
            epr += ep_ret
            epl += ep_len
            #print(f"completed one game  r:{ep_ret} in {ep_len}")
            o, ep_ret, ep_len = env.reset(), 0, 0
            break
            
#         end = d or ep_len>1000
#         if end:
#             #print("done*****************\n****************\n***********")
#             if d:
#                 buf.finish_path(0)
#                 print(ep_ret)
#                 o, ep_ret, ep_len = env.reset(), 0, 0
#                 break
#             else:
#                 val = ac.v(o)
#                 buf.finish_path(val)
#                 print(ep_ret)
#                 o, ep_ret, ep_len = env.reset(), 0, 0
#                 break
                

  

    # Perform VPG update!
    data = buf.get()
    piloss,vloss, ent = update(data, pi_optimizer, train_v_iters, vf_optimizer, ac)
    pilrun += piloss
    vlrun += vloss
    entrun += ent
    if (epoch % save_freq == 0):
        savepth = os.path.join(checkpt,"agent_" + str(epoch) + ".pt")
#         savepth += ".pt"
        torch.save(ac.state_dict(), savepth)

    if (epoch % log_freq == 0):
        piloss_avg = float(pilrun)/float(log_freq)
        pilrun = 0
        vloss_avg = float(vlrun)/float(log_freq)
        vlrun = 0
        ent_avg = float(entrun)/float(log_freq)
        entrun = 0
        
        epr_avg = float(epr)/float(log_freq)
        epl_avg = float(epl)/float(log_freq)
        epr = 0
        epl = 0
        
        writer.add_scalar(f'pi_loss', piloss_avg, epoch)
        writer.add_scalar(f'v_loss', vloss_avg, epoch)
        
        writer.add_scalar(f'ep_length', epl_avg , epoch)
        writer.add_scalar(f'reward', epr_avg , epoch)
        writer.add_scalar(f'entropy_pi', ent_avg , epoch)
        print(f"{epoch}: rew_aw: {epr_avg} || len_Av: {epl_avg} || pi_loss:{piloss_avg} || v_loss:{vloss_avg} || entropy: {ent_avg}")
        
        data = {}
        data['pi_loss'] = piloss_avg,
        data['v_loss'] = vloss_avg
        data['ep_length'] = epl_avg
        data['reward'] = epr_avg
        data['entropy_pi'] = ent_avg 
        
        logDict[epoch] = data
        

        

5460: rew_aw: -10.8 || len_Av: 5822.5 || pi_loss:-0.0013523441448342055 || v_loss:0.12162694707512856 || entropy: 0.06388965025544166
5470: rew_aw: -14.3 || len_Av: 5571.4 || pi_loss:-0.0008889262782759034 || v_loss:0.10476735085248948 || entropy: 0.06247897632420063
5480: rew_aw: -14.4 || len_Av: 5390.5 || pi_loss:0.0019264152884716167 || v_loss:0.09798915758728981 || entropy: 0.06134676150977612
5490: rew_aw: -12.8 || len_Av: 5542.4 || pi_loss:-0.001411511363403406 || v_loss:0.1108429528772831 || entropy: 0.06101363413035869
5500: rew_aw: -14.5 || len_Av: 5255.3 || pi_loss:-0.0009891746667562984 || v_loss:0.10279479995369911 || entropy: 0.061430885642766955
5510: rew_aw: -12.7 || len_Av: 5864.4 || pi_loss:-0.0011533689219504594 || v_loss:0.1087005116045475 || entropy: 0.061104702204465865
5520: rew_aw: -14.2 || len_Av: 5489.9 || pi_loss:0.0040759362513199445 || v_loss:0.10360387489199638 || entropy: 0.060480961576104164
5530: rew_aw: -15.0 || len_Av: 4943.4 || pi_loss:0.0002245357783

6080: rew_aw: -15.2 || len_Av: 5038.4 || pi_loss:0.0014235732975066639 || v_loss:0.10886259153485298 || entropy: 0.06478055976331235
6090: rew_aw: -14.7 || len_Av: 5284.2 || pi_loss:0.0007753332596621476 || v_loss:0.1064748503267765 || entropy: 0.061037341877818105
6100: rew_aw: -13.8 || len_Av: 5672.3 || pi_loss:0.0007935044646728784 || v_loss:0.10678702890872956 || entropy: 0.0632092732936144
6110: rew_aw: -13.7 || len_Av: 5430.6 || pi_loss:0.0010873624618398026 || v_loss:0.10592545792460442 || entropy: 0.0642358459532261
6120: rew_aw: -15.8 || len_Av: 4990.8 || pi_loss:0.0017842415254563093 || v_loss:0.09516115821897983 || entropy: 0.06389915123581887
6130: rew_aw: -13.9 || len_Av: 5715.4 || pi_loss:2.5905435904860497e-05 || v_loss:0.106280467659235 || entropy: 0.06771675273776054
6140: rew_aw: -13.2 || len_Av: 5548.2 || pi_loss:-0.001507756661158055 || v_loss:0.11112130507826805 || entropy: 0.06665247604250908
6150: rew_aw: -15.0 || len_Av: 5250.9 || pi_loss:-0.00032100065727718173

6700: rew_aw: -15.5 || len_Av: 5460.0 || pi_loss:0.0015312048664782197 || v_loss:0.09818611741065979 || entropy: 0.05922613367438316
6710: rew_aw: -14.9 || len_Av: 5142.6 || pi_loss:-8.655198034830391e-05 || v_loss:0.10708827003836632 || entropy: 0.06046402081847191
6720: rew_aw: -13.5 || len_Av: 5378.1 || pi_loss:0.0008051005366723985 || v_loss:0.10888257399201393 || entropy: 0.05996761023998261
6730: rew_aw: -14.7 || len_Av: 5239.2 || pi_loss:-0.0006084573513362556 || v_loss:0.10091160088777543 || entropy: 0.0590776152908802
6740: rew_aw: -12.7 || len_Av: 5574.3 || pi_loss:0.0016791721514891832 || v_loss:0.11300623789429665 || entropy: 0.05801100879907608
6750: rew_aw: -14.5 || len_Av: 5500.5 || pi_loss:0.0006515280016174074 || v_loss:0.10656441077589988 || entropy: 0.057997891679406166
6760: rew_aw: -13.9 || len_Av: 5278.4 || pi_loss:-0.0005196622398216278 || v_loss:0.10489431321620941 || entropy: 0.057822690531611445
6770: rew_aw: -13.9 || len_Av: 5107.0 || pi_loss:-0.0001703441637

7320: rew_aw: -15.1 || len_Av: 5542.8 || pi_loss:-0.0005178372128284536 || v_loss:0.1058199681341648 || entropy: 0.0586431585252285
7330: rew_aw: -13.7 || len_Av: 5894.1 || pi_loss:-0.0003241919432184659 || v_loss:0.10835234969854354 || entropy: 0.05789959952235222
7340: rew_aw: -13.8 || len_Av: 6006.4 || pi_loss:-0.0006266797165153549 || v_loss:0.0978843092918396 || entropy: 0.05772297009825707
7350: rew_aw: -15.4 || len_Av: 5385.5 || pi_loss:0.0006230512633919715 || v_loss:0.11130209043622016 || entropy: 0.05508105270564556


KeyboardInterrupt: 

In [232]:
%reload_ext tensorboard
#--logdir=runs/vpg_1
# %tensorboard --logdir=runs/vpg_1

In [233]:
%tensorboard --logdir=runs/vpg_5

ERROR: Timed out waiting for TensorBoard to start. It may still be running as pid 29676.

In [None]:
logger

multi env