### The following shows built in environments.

They are categorized into several categories like classic control (Cartpole, pendulum) which are canonical environments used in textbooks, Box2D which is a 2D physics engine for games, ToyText which is a small and simple environment used to debug RL algorithms (blackjack), etc. 

In [1]:
import gymnasium as gym
import numpy as np
import torch
for i in gym.envs.registry.keys():
    print(i)

CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
phys2d/CartPole-v0
phys2d/CartPole-v1
phys2d/Pendulum-v0
LunarLander-v3
LunarLanderContinuous-v3
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v3
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v0
Taxi-v3
tabular/Blackjack-v0
tabular/CliffWalking-v0
Reacher-v2
Reacher-v4
Reacher-v5
Pusher-v2
Pusher-v4
Pusher-v5
InvertedPendulum-v2
InvertedPendulum-v4
InvertedPendulum-v5
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
InvertedDoublePendulum-v5
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
HalfCheetah-v5
Hopper-v2
Hopper-v3
Hopper-v4
Hopper-v5
Swimmer-v2
Swimmer-v3
Swimmer-v4
Swimmer-v5
Walker2d-v2
Walker2d-v3
Walker2d-v4
Walker2d-v5
Ant-v2
Ant-v3
Ant-v4
Ant-v5
Humanoid-v2
Humanoid-v3
Humanoid-v4
Humanoid-v5
HumanoidStandup-v2
HumanoidStandup-v4
HumanoidStandup-v5
GymV21Environment-v0
GymV26Environment-v0


In [4]:
env = gym.make("CartPole-v1", render_mode='human')

In [5]:
print("observation space: ", env.observation_space)

observation space:  Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)


This observation space has 4 dimensions. Which are:

* Cart Position (-4.8, 4.8)
* Cart Velocity (- to +)
* Pole angle (-0.4189, +0.4189)
* Pole angular velocity (- to +)

In [6]:
observation, info = env.reset()
print("observation: ", observation)

  from pkg_resources import resource_stream, resource_exists


observation:  [ 0.0454559   0.00248629  0.04908514 -0.01469923]


In [7]:
print("action space: ", env.action_space)

action space:  Discrete(2)


This means there are a total of two actions an agent can take.

* 0: Push the cart to the left
* 1: Push the cart to the right

In [27]:
# env = gym.make("CartPole-v1", render_mode = 'human')
env = gym.make("CartPole-v1")
SEED = 1111
env.reset(seed=SEED)

np.random.seed(SEED)
torch.manual_seed(SEED)


<torch._C.Generator at 0x1d6dd646090>

### Using a simple policy gradient agent

The code below maps observed states to actions. So given an input observation, it predicts the right action.

In [28]:
import torch.nn as nn
import torch.nn.functional as F
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        x = self.layer1(x)
        x = self.dropout(x)
        x = F.relu(x)
        x = self.layer2(x)
        return x

### Reward collection

It is common to adjust future rewards using a discount factor and to normalize the array of stepwaise returns to ensure smooth and stable training

In [29]:
def calculate_stepwise_returns(rewards, discount_factor):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)
    returns = torch.tensor(returns)
    normalized_returns = (returns - returns.mean()) / returns.std()
    return normalized_returns

### Forward Pass

Runs the agent based on the current policy until it reaches a terminal state and collecting the stepwise rewards and action probabilities. This is done through:

* Resetting environment to initial state
* initialize buffers to store the action probabilities, rewards, and cumulative return
* use the .step() function to iteratively run the agent in the environment until it terminates:
    * get the observation of the environment's state
    * Get the action predicted by the policy based on the observation
    * Use the Softmax function to estimate the probability of taking the predicted action
    * simulate a categorical probability distribution based on these estimated probabilities
    * Sample the distribution to get the agent's action
    * Estimate the log probability of the sampled action from the simulated distribution
* Append the log probability of the actions and the rewards from each step to their respective buffers
* Estimate the normalized and discounted values of the returns at each step baed on the rewards.

In [30]:
from torch import distributions
def forward_pass(env, policy, discount_factor):
    log_prob_actions = []
    rewards = []
    done = False
    episode_return = 0
    policy.train()
    observation, info = env.reset()
    while not done:
        observation = torch.FloatTensor(observation).unsqueeze(0)
        action_pred = policy(observation)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = distributions.Categorical(action_prob)
        action = dist.sample()
        log_prob_action = dist.log_prob(action)
        observation, reward, terminated, truncated, info = env.step(action.item())
        env.render()
        done = terminated or truncated
        log_prob_actions.append(log_prob_action)
        rewards.append(reward)
        episode_return += reward
    log_prob_actions = torch.cat(log_prob_actions)
    stepwise_returns = calculate_stepwise_returns(rewards, discount_factor)
    return episode_return, stepwise_returns, log_prob_actions

In [31]:
def calculate_loss(stepwise_returns, log_prob_actions):
    loss = -(stepwise_returns * log_prob_actions).sum()
    return loss

### The code below does backpropagation with respect to the loss function above. This updates the policy's parameters

In [32]:
def update_policy(stepwise_returns, log_prob_actions, optimizer):
    stepwise_returns = stepwise_returns.detach()
    loss = calculate_loss(stepwise_returns, log_prob_actions)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

In [35]:
from torch import optim
def main(): 
    MAX_EPOCHS = 500
    DISCOUNT_FACTOR = 0.99
    N_TRIALS = 25
    REWARD_THRESHOLD = 475
    PRINT_INTERVAL = 10
    INPUT_DIM = env.observation_space.shape[0]
    HIDDEN_DIM = 128
    OUTPUT_DIM = env.action_space.n
    DROPOUT = 0.5
    episode_returns = []
    policy = PolicyNetwork(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)
    LEARNING_RATE = 0.01
    optimizer = optim.Adam(policy.parameters(), lr = LEARNING_RATE)
    for episode in range(1, MAX_EPOCHS+1):
        episode_return, stepwise_returns, log_prob_actions = forward_pass(env, policy, DISCOUNT_FACTOR)
        _ = update_policy(stepwise_returns, log_prob_actions, optimizer)
        episode_returns.append(episode_return)
        mean_episode_return = np.mean(episode_returns[-N_TRIALS:])
        if episode % PRINT_INTERVAL == 0:
            print(f'| Episode: {episode:3} | Mean Rewards: {mean_episode_return:5.1f} |')
        if mean_episode_return >= REWARD_THRESHOLD:
            print(f'Reached reward threshold in {episode} episodes')
            break

In [36]:
main()


| Episode:  10 | Mean Rewards:  28.8 |
| Episode:  20 | Mean Rewards:  26.8 |
| Episode:  30 | Mean Rewards:  31.0 |
| Episode:  40 | Mean Rewards:  45.6 |
| Episode:  50 | Mean Rewards:  62.7 |
| Episode:  60 | Mean Rewards:  78.8 |
| Episode:  70 | Mean Rewards: 108.1 |
| Episode:  80 | Mean Rewards: 129.2 |
| Episode:  90 | Mean Rewards: 148.8 |
| Episode: 100 | Mean Rewards: 159.8 |
| Episode: 110 | Mean Rewards: 130.6 |
| Episode: 120 | Mean Rewards: 104.9 |
| Episode: 130 | Mean Rewards: 182.3 |
| Episode: 140 | Mean Rewards: 206.2 |
| Episode: 150 | Mean Rewards: 149.4 |
| Episode: 160 | Mean Rewards:  69.2 |
| Episode: 170 | Mean Rewards:  81.0 |
| Episode: 180 | Mean Rewards: 180.2 |
| Episode: 190 | Mean Rewards: 293.8 |
| Episode: 200 | Mean Rewards: 380.9 |
| Episode: 210 | Mean Rewards: 324.4 |
| Episode: 220 | Mean Rewards: 250.0 |
| Episode: 230 | Mean Rewards: 222.4 |
| Episode: 240 | Mean Rewards: 195.4 |
| Episode: 250 | Mean Rewards: 142.5 |
| Episode: 260 | Mean Rew