<a href="https://colab.research.google.com/github/CM134/RL_Project_02456/blob/large_impala/large_IMPALA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with PPO and ProcGen

Here's a bit of code that should help you get started on your projects.

The cell below installs `procgen` and downloads a small `utils.py` script that contains some utility functions. You may want to inspect the file for more details.

In [None]:
!pip uninstall imgaug
!pip install imgaug==0.2.6

!pip install procgen
!wget https://raw.githubusercontent.com/nicklashansen/ppo-procgen-utils/main/utils.py

Found existing installation: imgaug 0.2.6
Uninstalling imgaug-0.2.6:
  Would remove:
    /usr/local/lib/python3.7/dist-packages/imgaug-0.2.6.dist-info/*
    /usr/local/lib/python3.7/dist-packages/imgaug/*
Proceed (y/n)? y
  Successfully uninstalled imgaug-0.2.6
Collecting imgaug==0.2.6
  Using cached imgaug-0.2.6-py3-none-any.whl
Installing collected packages: imgaug
Successfully installed imgaug-0.2.6
--2021-11-22 20:25:35--  https://raw.githubusercontent.com/nicklashansen/ppo-procgen-utils/main/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14807 (14K) [text/plain]
Saving to: ‘utils.py.3’


2021-11-22 20:25:35 (78.1 MB/s) - ‘utils.py.3’ saved [14807/14807]



Hyperparameters. These values should be a good starting point. You can modify them later once you have a working implementation.

In [None]:
# Hyperparameters
total_steps = 1000000
num_envs = 32
num_levels = 10
num_steps = 256
num_epochs = 3
batch_size = 512
eps = .2
grad_eps = .5
value_coef = .5
entropy_coef = .01
feature_dim = 256


# Environment 
envname = "coinrun"

Network definitions. We have defined a policy network for you in advance. It uses the popular `NatureDQN` encoder architecture (see below), while policy and value functions are linear projections from the encodings. There is plenty of opportunity to experiment with architectures, so feel free to do that! Perhaps implement the `Impala` encoder from [this paper](https://arxiv.org/pdf/1802.01561.pdf) (perhaps minus the LSTM).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from utils import make_env, Storage, orthogonal_init


def xavier_uniform_init(module, gain=1.0):
    if isinstance(module, nn.Linear) or isinstance(module, nn.Conv2d):
        nn.init.xavier_uniform_(module.weight.data, gain)
        nn.init.constant_(module.bias.data, 0)
    return module


class Flatten(nn.Module):
    def forward(self, x):
        return x.view(x.size(0), -1)


class Encoder(nn.Module):
  def __init__(self, in_channels, feature_dim):
    super().__init__()
    self.layers = nn.Sequential(
        nn.Conv2d(in_channels=in_channels, out_channels=32, kernel_size=8, stride=4), nn.ReLU(),
        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2), nn.ReLU(),
        nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1), nn.ReLU(),
        Flatten(),
        nn.Linear(in_features=1024, out_features=feature_dim), nn.ReLU()
    )
    self.apply(orthogonal_init)

  def forward(self, x):
    return self.layers(x)


# Large IMPALA encoder 

class ResidualBlock(nn.Module):
  def __init__(self, in_channels):
    super().__init__()
    self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=1, padding=1)
    self.conv2 = nn.Conv2d(in_channels=in_channels, out_channels=in_channels, kernel_size=3, stride=1, padding=1)

  def forward(self, x):
    out = nn.ReLU()(x)
    out = self.conv1(out)
    out = nn.ReLU()(out)
    out = self.conv2(out)
    return out + x


class ImpalaBlock(nn.Module):
  def __init__(self, in_channels, out_channels):
    super().__init__()
    self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1)
    self.res1 = ResidualBlock(out_channels)
    self.res2 = ResidualBlock(out_channels)

  def forward(self, x):
    x = self.conv1(x)
    x = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)(x)
    x = self.res1(x)
    x = self.res2(x)
    return x


class ImpalaModel(nn.Module):
  def __init__(self, in_channels, feature_dim):
    super().__init__()
    self.imp1 = ImpalaBlock(in_channels=in_channels, out_channels=16)
    self.imp2 = ImpalaBlock(in_channels=16, out_channels=32)
    self.imp3 = ImpalaBlock(in_channels=32, out_channels=32)
    self.fc1 = nn.Linear(in_features=32*8*8, out_features=feature_dim)
    self.output_dim = 256
    self.apply(xavier_uniform_init)

  def forward(self, x):
    x = self.imp1(x)
    x = self.imp2(x)
    x = self.imp3(x)
    x = nn.ReLU()(x)
    x = Flatten()(x)
    x = self.fc1(x)
    x = nn.ReLU()(x)
    return x


class Policy(nn.Module):
  def __init__(self, encoder, feature_dim, num_actions):
    super().__init__()
    self.encoder = encoder
    self.policy = orthogonal_init(nn.Linear(feature_dim, num_actions), gain=.01)
    self.value = orthogonal_init(nn.Linear(feature_dim, 1), gain=1.)

  def act(self, x):
    with torch.no_grad():
      x = x.cuda().contiguous()
      dist, value = self.forward(x)
      action = dist.sample()
      log_prob = dist.log_prob(action)
    
    return action.cpu(), log_prob.cpu(), value.cpu()

  def forward(self, x):
    x = self.encoder(x)
    logits = self.policy(x)
    value = self.value(x).squeeze(1)
    dist = torch.distributions.Categorical(logits=logits)

    return dist, value


# Define environment
# check the utils.py file for info on arguments
env = make_env(num_envs, num_levels=num_levels, env_name=envname)

print('Observation space:', env.observation_space)
print('Action space:', env.action_space.n)


# Define network
#encoder = Encoder(in_channels=env.observation_space.shape[0], feature_dim=feature_dim)
encoder = ImpalaModel(in_channels=env.observation_space.shape[0], feature_dim=feature_dim)
policy = Policy(encoder=encoder, feature_dim=feature_dim, num_actions=env.action_space.n)
policy.cuda()

# Define optimizer
# these are reasonable values but probably not optimal
optimizer = torch.optim.Adam(policy.parameters(), lr=5e-4, eps=1e-5)

# Define temporary storage
# we use this to collect transitions during each iteration
storage = Storage(
    env.observation_space.shape,
    num_steps,
    num_envs
)

# Run training
obs = env.reset()
step = 0
while step < total_steps:

  # Use policy to collect data for num_steps steps
  policy.eval()
  for _ in range(num_steps):
    # Use policy
    action, log_prob, value = policy.act(obs)
    
    # Take step in environment
    next_obs, reward, done, info = env.step(action)

    # Store data
    storage.store(obs, action, reward, done, info, log_prob, value)
    
    # Update current observation
    obs = next_obs

  # Add the last observation to collected data
  _, _, value = policy.act(obs)
  storage.store_last(obs, value)

  # Compute return and advantage
  storage.compute_return_advantage()

  # Optimize policy
  policy.train()
  for epoch in range(num_epochs):

    # Iterate over batches of transitions
    generator = storage.get_generator(batch_size)
    for batch in generator:
      b_obs, b_action, b_log_prob, b_value, b_returns, b_advantage = batch

      # Get current policy outputs
      new_dist, new_value = policy(b_obs)
      new_log_prob = new_dist.log_prob(b_action)

      # Clipped policy objective
      ratio = torch.exp((new_log_prob - b_log_prob))
      surr1 = ratio * b_advantage
      surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * b_advantage
      pi_loss = - torch.min(surr1, surr2)

      # Clipped value function objective
      value_loss = value_coef * (b_returns - new_value).pow(2)

      # Entropy loss
      entropy_loss = entropy_coef * new_dist.entropy()

      # Backpropagate losses
      loss = pi_loss + value_loss - entropy_loss
      loss.mean().backward()

      # Clip gradients
      torch.nn.utils.clip_grad_norm_(policy.parameters(), grad_eps)

      # Update policy
      optimizer.step()
      optimizer.zero_grad()

  # Update stats
  step += num_envs * num_steps
  print(f'Step: {step}\tMean reward: {storage.get_reward()}')

print('Completed training!')
torch.save(policy.state_dict, 'checkpoint.pt')

Observation space: Box(0.0, 1.0, (3, 64, 64), float32)
Action space: 15
Step: 8192	Mean reward: 0.0
Step: 16384	Mean reward: 0.3125
Step: 24576	Mean reward: 0.625
Step: 32768	Mean reward: 1.5625
Step: 40960	Mean reward: 4.6875
Step: 49152	Mean reward: 7.1875
Step: 57344	Mean reward: 7.5
Step: 65536	Mean reward: 9.6875
Step: 73728	Mean reward: 9.0625
Step: 81920	Mean reward: 11.25
Step: 90112	Mean reward: 12.5
Step: 98304	Mean reward: 10.625
Step: 106496	Mean reward: 14.0625
Step: 114688	Mean reward: 15.9375
Step: 122880	Mean reward: 15.625
Step: 131072	Mean reward: 19.0625
Step: 139264	Mean reward: 20.625
Step: 147456	Mean reward: 20.0
Step: 155648	Mean reward: 23.4375
Step: 163840	Mean reward: 24.6875
Step: 172032	Mean reward: 24.0625
Step: 180224	Mean reward: 27.8125
Step: 188416	Mean reward: 18.4375
Step: 196608	Mean reward: 25.0
Step: 204800	Mean reward: 22.5
Step: 212992	Mean reward: 29.375
Step: 221184	Mean reward: 25.625
Step: 229376	Mean reward: 21.5625
Step: 237568	Mean reward

Below cell can be used for policy evaluation and saves an episode to mp4 for you to view.

In [None]:
import imageio

# Make evaluation environment
eval_env = make_env(num_envs, start_level=num_levels, num_levels=num_levels, env_name=envname)
obs = eval_env.reset()

frames = []
total_reward = []

# Evaluate policy
policy.eval()
for _ in range(512):

  # Use policy
  action, log_prob, value = policy.act(obs)

  # Take step in environment
  obs, reward, done, info = eval_env.step(action)
  total_reward.append(torch.Tensor(reward))

  # Render environment and store
  frame = (torch.Tensor(eval_env.render(mode='rgb_array'))*255.).byte()
  frames.append(frame)

# Calculate average return
total_reward = torch.stack(total_reward).sum(0).mean(0)
print('Average return:', total_reward)

# Save frames as video
frames = torch.stack(frames)
imageio.mimsave('vid.mp4', frames, fps=25)

Average return: tensor(56.7062)


In [None]:
#Average return: tensor(13.2928)

#Step: 8192	Mean reward: 3.75
#Step: 16384	Mean reward: 4.21875
#Step: 24576	Mean reward: 3.5
#Step: 32768	Mean reward: 4.28125
#Step: 40960	Mean reward: 4.375
#Step: 49152	Mean reward: 5.15625
#Step: 57344	Mean reward: 5.28125