# Day 27 - OpenAI Gym API and Gymnasium

## The anatomy of the agent

In [1]:
import random
from typing import List

In [2]:
class Environment:
    def __init__(self):
        self.steps_left = 10

    def get_observation(self) -> List[float]:
        return [0.0, 0.0, 0.0]

    def get_actions(self) -> List[int]:
        return [0, 1]

    def is_done(self) -> bool:
        return self.steps_left == 0

    def action(self, action: int) -> float:
        if self.is_done():
            raise Exception("Game is over")
        self.steps_left -= 1
        return random.random()

In [3]:
class Agent:
    def __init__(self):
        self.total_reward = 0.0

    def step(self, env: Environment):
        current_obs = env.get_observation()
        actions = env.get_actions()
        reward = env.action(random.choice(actions))
        self.total_reward += reward

In [4]:
env = Environment()
agent = Agent()

while not env.is_done():
    agent.step(env)
    
print(f"Total reward got: {agent.total_reward:.4f}")

Total reward got: 6.0649


## The OpenAI Gym API and Gymnasium

### The action space

### The observation space

* Gymnasium's `Space`s include one property, and three methods that are important to us:
    0. `shape`: Just like a NumPy shape
    0. `sample()`: Returns a random sample from the space
    0. `contains(x)`: Returns true if `x` is part of the space
    0. `seed()`: For reproducible runs
* Here are some examples:

In [5]:
import numpy as np
import gymnasium as gym
from gymnasium.spaces import Tuple, Box, Discrete

Tuple(spaces=(
    Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32),
    Discrete(n=3),
    Discrete(n=2),
))

Tuple(Box(-1.0, 1.0, (3,), float32), Discrete(3), Discrete(2))

### The environment

* An environment—the `Env` class in Gymnasium—has an `action_space`, an `observation_space`, a `reset()` method, and a `step()` method
* The latter returns `obs, reward, done, truncated, info`
    * Here, `info` is a dictionary with optional information that the environment can include

### Creating an environment

* To create an environment, Gymnasium provides the `make` method, which takes as its only argument the id of an environment

### The CartPole session

In [6]:
e = gym.make("CartPole-v1")

In [7]:
obs, info = e.reset()
obs, info

(array([-0.03194962, -0.04086227, -0.0243945 ,  0.01727769], dtype=float32),
 {})

In [8]:
e.action_space, e.observation_space

(Discrete(2),
 Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32))

In [9]:
e.step(0)

(array([-0.03276687, -0.23562603, -0.02404895,  0.30216515], dtype=float32),
 1.0,
 False,
 False,
 {})

In [10]:
e.action_space.sample(), e.observation_space.sample()

(1, array([2.2864687 , 0.26136494, 0.24771014, 0.39573586], dtype=float32))

## The random CartPole agent

In [11]:
env = gym.make("CartPole-v1")

In [12]:
def random_episode(env):
    total_steps = 0
    total_reward = 0.0
    obs, _ = env.reset()
    
    while True:
        action = env.action_space.sample()
        _, reward, done, *_ = env.step(action)
        total_reward += reward
        total_steps += 1
        if done:
            break
    return total_steps, total_reward    

In [13]:
episodes = [random_episode(env) for e in range(500)]
steps, returns = [], []
for s, ret in episodes:
    steps.append(s)
    returns.append(ret)

average_steps = np.mean(steps)
average_return = np.mean(returns)

print(f"Average episode done in {average_steps:.2f} steps. Average return: {average_return:.2f}.")

Average episode done in 22.76 steps. Average return: 22.76.


## Extra Gym API functionality

### Wrappers

* As it is often convenient to transform the observations, keep track of the last $n$ frames, or perform steps like reward normalization, Gymnasium providesa `Wrapper` class to wrap environments
* For convenience, there are to properties:
    1. `env`, which is the environment being wrapped by this wrapper
    2. `unwrapped`, which is the base environment at the center of all wrappers
* There also exist `ObservationWrapper`, `RewardWrapper`, and `ActionWrapper` classes
* These allow selective wrapping, requiring the overriding of the `observation(obs)`, `reward(rew)`, and `action(a)` methods respectively
* As an example, this wrapper makes any agent epsilon-greedy with $\varepsilon=0.1$:

In [14]:
class RandomActionWrapper(gym.ActionWrapper):
    def __init__(self, env: gym.Env, epsilon: float = 0.1):
        super().__init__(env)
        self.epsilon = epsilon

    def action(self, action: gym.core.WrapperActType) -> gym.core.WrapperActType:
        if random.random() < self.epsilon:
            action = self.env.action_space.sample()
            print(f"Random action: {action}")
        return action

In [15]:
env = RandomActionWrapper(gym.make("CartPole-v1"))
random_episode(env)

Random action: 1
Random action: 0
Random action: 1
Random action: 1


(52, 52.0)

### Rendering the environment

* The way I run Juypter, the code below will not work
```python 
env = gym.make("CartPole-v1", render_mode="rgb_array")
env = gym.wrappers.HumanRendering(env)
random_episode(env)
```
* Instead, we will skip to recording a video and displaying it in the notebook

In [16]:
video_folder = "./DRL/videos/"

In [17]:
import warnings

warnings.filterwarnings('ignore', message='.*Overwriting existing videos.*')

env = gym.make("CartPole-v1", render_mode="rgb_array")
env = gym.wrappers.RecordVideo(env, video_folder=video_folder)
random_episode(env)
env.close()

In [18]:
from IPython.display import Video

Video(video_folder + "rl-video-episode-1.mp4")