# Custom Environments and Wrappers in Gymnasium

Before I can build and run a version of the Embodied Communication Game, I must learn how to build custom environments in Gymnasium. Here I'm going to run through a few tutorials on building custom `Envs` and modifying them with custom `wrappers`.

## Designing Custom Environments

In [1]:
from typing import Optional
import numpy as np
import gymnasium as gym

In [5]:
class GridWorldEnv(gym.Env):

    #Initializes the environment with specific attributes including size, observation_space, action_space, and any other variables 
    # defining the agent, environment, or reward structure.
    def __init__(self, size: int = 5):
        # The size of the square grid
        self.size = size

        # Define the agent and target location; randomly chosen in `reset` and updated in `step`
        self._agent_location = np.array([-1, -1], dtype=np.int32)
        self._target_location = np.array([-1, -1], dtype=np.int32)

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`-1}^2
        self.observation_space = gym.spaces.Dict(
            {
                "agent": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
                "target": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
            }
        )

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = gym.spaces.Discrete(4)
        # Dictionary maps the abstract actions to the directions on the grid
        self._action_to_direction = {
            0: np.array([1, 0]),  # right
            1: np.array([0, 1]),  # up
            2: np.array([-1, 0]),  # left
            3: np.array([0, -1]),  # down
        }
    #A common design pattern is to include a _get_obs method for translating state into an observation. However, this helper method
    # is not mandatory, and you might want to compute observations directly in env.reset and env.step, which may be preferable if 
    # you want to compute them differently in each method call.
    def _get_obs(self):
        return {"agent": self._agent_location, "target": self._target_location}
    
    #A similar pattern, _get_info can be used to return auxiliary information. In this Env, we would like to calculate and return
    # Manhattan distance from the agent to the target square.
    def _get_info(self):
        return {
            "distance": np.linalg.norm(
                self._agent_location - self._target_location, ord=1
            )
        }
    #Reset is called to initiate a new episode for an environment and has two parameters, seed and options. Seed initializes the
    # random number generator to allow us to consistently generate the same environment when there are random variables involved.
    # Options is a dict containing any additional parameters we might want to specify during the reset.\
    
    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
         #We need the following line to seed self.np_random
        super().reset(seed=seed)
        #Choose the agent's location uniformly at random
        self._agent_location = self.np_random.integers(0,self.size,size=2,dtype=int)
        #Sample random target locations until they do not coincide with the agent's starting location
        self._target_location = self._agent_location
        while np.array_equal(self._target_location,self.agent_location):
            self._target_location = self.np_random.integers(
                0,self.size,size=2,dtype=int
            )

        
        observation = self._get_obs()
        info = self._get_info()
        
        return observation,info
    
    def step(self, action):
        #Map the action (element of {0,1,2,3}) to a direction on the map, using our helper dictionary
        direction = self._action_to_direction[action]
        self._agent_location = np.clip(
            self._agent_location + direction, 0, self.size - 1)
        
        #We use `np.clip` to make sure we don't leave the grid bound
        terminated = np.array_equal(self._agent_location, self._target_location)
        truncated = False
        reward = 1 if terminated else 0
        observation = self._get_obs()
        info = self._get_info()

        return observation, reward, terminated, truncated, info

#Now that we've defined the environment in it's own class, we can register it with gymnasium under a particular namespace
# which we can then call gym.make() on to instantiate this custom environment.
gym.register(id="GridWorld-v0",
             entry_point = GridWorldEnv)
    
        

  logger.warn(f"Overriding environment {new_spec.id} already in registry.")


In [6]:
#Instantiating the registered version of our custom environment using gym.make().
gym.make("GridWorld-v0")


<OrderEnforcing<PassiveEnvChecker<GridWorldEnv<GridWorld-v0>>>>

# Designing my own Custom Environment

### Simplified single-player color-matching game

Now I'll design my own custom environment in grid world. It will have elements of the ECG, but be designed to be solvable by a single player. In this environment, the agent will have to travel to a square whose color matches another given color. The squares will be inside of a 4x4 grid and the colors and starting position of the agent will be randomized.

In [51]:
class SimpleColorGame(gym.Env):
    #Initializes the Env, including observation space and action space. This one initializes the Observation space as a grid
    # of boxes with colors assigned to them, and the action space as the movement of the agent along the grid.
    def __init__(self,size=2,step_limit=200):
        #The size of one side of the square grid. It will be NxN squares in area, where N is self.size
        self.size = size
        self._num_colors = size**2

        #This is a time limit on the number of steps the agent is allowed to take in the game. This is necessary to
        # prevent the game from running forever if the agent's policy prevents it from moving or reaching the target.
        self._step_limit = step_limit
        #Integer to keep track of the number of steps taken in a particular iteration of the game
        self._step_count = 0

        #The agent location is stored inside of a local variable.
        self._agent_location = np.array([-1,-1], dtype=np.int32)

        #The colors of the boxes are also stored in a local variable. These colors are randomized on start-up. For this
        # version of the game, I will substitute integer values for colors.
        self._square_colors = np.arange(self._num_colors).reshape(size,size)

        #The target color will be a random number between 1 and 4. This number will be initialized during the reset() method.
        self._target_color = np.random.randint(0,self._num_colors)

        # Observations are dictionaries with the agent's and the target's location.
        # Each location is encoded as an element of {0, ..., `size`-1}^2
        self.observation_space = gym.spaces.Dict(
            {
                "agent location": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
                "square colors": gym.spaces.Box(0, self._num_colors-1, shape=(size,size), dtype=int),
                "target color": gym.spaces.Discrete(self._num_colors)
            }
        )
        
        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = gym.spaces.Discrete(4)
        
        # Dictionary maps the abstract actions to the directions on the grid
        self._action_to_direction = {
            0: np.array([1, 0]),  # right
            1: np.array([0, 1]),  # up
            2: np.array([-1, 0]),  # left
            3: np.array([0, -1]),  # down
        }

    #Helper method used to get the observation from the state, useful in reset and step methods. This version returns
    # the properties of agent location, square colors, and the target color.
    def _get_obs(self):
        return {
                "agent location": self._agent_location,
                "square colors": self._square_colors,
                "target color": self._target_color
        }

    #Helper method used to get auxiliary information from the state. Currently returns nothing.
    def _get_info(self):
        info = { "info": None }
        return info

    #Helper method for calculating the reward from the state. This will be useful as I can override it in child classes.
    def _get_reward(self):
        reward = 1 if (self._square_colors[tuple(self._agent_location)] == self._target_color) else 0
        return reward

    
    #Reset the environment to an initial configuration. The initial state may involve some randomness, so the seed argument
    # is used to guarantee an identical initial state whenever reset() is called with that seed. Options is a dict containing
    # any additional parameters we might want to specify during the reset.
    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        
        #Firstly, we will call this method to seed self.np_random with the seed argument if given.
        super().reset(seed=seed)

        #Reset the step count to 0 for the new iteration of the game
        self._step_count = 0

        #Now randomly generate a starting location for the agent using self.np_random. We generate an array of size two
        # representing the agent's starting coordinates.
        self._agent_location = self.np_random.integers(0,self.size,size=2)

        #Generate a random permutation of the square colors, and reshape them into a sizeXsize grid.
        self._square_colors = self.np_random.permutation(self._num_colors).reshape(self.size,self.size)

        #Now we generate the target color, which is a random integer from 0 to self._num_colors inclusive.
        self._target_color = self.np_random.integers(0,self._num_colors)

        #Now we can return the observation and auxiliary info
        observation = self._get_obs()
        info = self._get_info()

        return observation, info

    #Takes an action as input and updates the state of the Env according to that Action. Step then returns an observation
    # containing the new Env state, as well as some other additional variables and info.
    def step(self, action):
        #First, iterate the step count by one
        self._step_count += 1
        
        #Next, we convert our action to a direction.
        direction = self._action_to_direction[action]
        
        #Then we add the direction coordinates to the agend coordinates to get the new agent location. We must clip the
        # agent location at the Box boundary, so the agent's coordinates are within 0 and self.size-1.
        self._agent_location = np.clip(self._agent_location + direction,0,self.size-1)
        
        #Now we terminate the game and give the agent a reward if the square it's standing on is the target color.
        terminated = (self._square_colors[tuple(self._agent_location)] == self._target_color)
        
        #We also truncate the game if self._step_count > self._step_limit.
        truncated = (self._step_count > self._step_limit)
        
        #Reward is 1 if we are on the target color square, otherwise 0
        reward = self._get_reward()

        #Finally, use the helper functions to generate Obs and Info.
        observation = self._get_obs()
        info = self._get_info()
        
        return observation, reward, terminated, truncated, info

#Now let's register this environment with a namespace and try calling gym.make on it later
gym.register(id="SimpleColorGame-v0",
            entry_point = SimpleColorGame)

In [70]:
#Let's see if the registration worked! We ran into a few errors with incorrect method/variable names, but after fixing those it appears this
# game was able to load!
env = gym.make("SimpleColorGame-v0")

## Learning in the Custom Env with Stable-Baselines

Now that I have a custom environment defined, I can try unleashing the Stable-Baselines RL algorithms on it to see if I can train a model to succeed. Of course, I have no idea currently if the environment is working properly, so I'll have to find out through trial and error.

In [74]:
#IMPORTS
from stable_baselines3 import A2C, PPO, DQN
from stable_baselines3.common.evaluation import evaluate_policy

In [71]:
print("Mean Reward of A2C Model, Simple Color Game, NxN Grid, 10000 training timesteps\n")
for i in range(2,7):
    env = gym.make("SimpleColorGame-v0",size=i,step_limit=(i**3))
    model_A2C = A2C("MultiInputPolicy", env, verbose = 0)
    
    mean_reward_pre, std_reward_pre = evaluate_policy(model_A2C, env, n_eval_episodes = 100) #Pre-training evaluation     
    model_A2C.learn(total_timesteps=10000)#training the model
    mean_reward, std_reward = evaluate_policy(model_A2C, env, n_eval_episodes = 100) #Post-training evaluation
    
    print(f"{i}x{i}: Pre-training  [{mean_reward_pre:.2f} +/- {std_reward_pre:.2f}], Post-training [{mean_reward:.2f} +/- {std_reward:.2f}]")

Mean Reward of A2C Model, Simple Color Game, NxN Grid, 10000 training timesteps





2x2: Pre-training  [0.18 +/- 0.38], Post-training [0.67 +/- 0.47]
3x3: Pre-training  [0.20 +/- 0.40], Post-training [0.17 +/- 0.38]
4x4: Pre-training  [0.15 +/- 0.36], Post-training [0.12 +/- 0.32]
5x5: Pre-training  [0.05 +/- 0.22], Post-training [0.07 +/- 0.26]
6x6: Pre-training  [0.03 +/- 0.17], Post-training [0.12 +/- 0.32]


## A2C learned the simple color game!

It took some finagling, but the A2C model now works with my custom color game! I had to make one major modification to the Env to allow the model to learn the game: I added a time-limit, *self._step_limit*, which sets **Truncated** variable to *True* to unceremoniously end the game if the agent takes more than *N* steps without reaching the target square. I had to do this because the untrained A2C algorithm was deciding to stubbornly stay in the corner and not move at all, causing the game to last forever.

However, it looks like the model may just be learning to wander around the environment as much as possible to stumble upon the square of the correct color. I want it to use the color grid and the target color to reach the target square as fast as possible.

### Time-Discounting The Reward

Adding a time discounter to the reward the agent recieves should incentivise the model to be as fast as possible in reaching the target square. The simple way to do this is to decrement the reward by 1 for every step taken.

In [59]:
#I'll make this class extend the SimpleColorGame class. We just want to modify the reward function to subtract 0.1 for each step taken.
class TimedColorGame(SimpleColorGame):
    #Override the _get_reward() method to subtract 1/grid_size from the normal reward. This creates an incentive to reach the target
    # color square quickly.
    def _get_reward(self):
        return super()._get_reward() - (1.0/self._num_colors)

gym.register(id="TimedColorGame-v0",
             entry_point = TimedColorGame)
env = gym.make("TimedColorGame-v0")

In [60]:
model_A2C = A2C("MultiInputPolicy", env, verbose = 0)

mean_reward, std_reward = evaluate_policy(model_A2C, env, n_eval_episodes = 100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
      
model_A2C.learn(total_timesteps=10000)

mean_reward, std_reward = evaluate_policy(model_A2C, env, n_eval_episodes = 100)
print(f"mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

mean reward: -38.01 +/- 21.78
mean reward: 0.37 +/- 0.28


## Is the A2C algorithm finding the correct square efficiently?

It's hard to tell if the A2C algorithm is finding the optimal path to the square of the target color, or if it's just moving around the tiny 2x2 grid. Let's instantiate some larger versions of this game to see how quickly the model is approaching the target square.

In [66]:
#We'll run the A2C algorithm for 5 instances of the game, from a 2x2 to a 6x6. Performance before and after learning
# will be compared for each game size.
print("Mean Rewards for Timed Color Game with NxN grid-size, using A2C, 10000 training timesteps")
for i in range(2,7):
    #Instantiate the TimedColorGame with size=i and step_limit = i^3, so any improvement in the trained model over
    # an untrained model should be apparent. Previously even the trained models appeared to be running into the step
    # limit.
    env = gym.make("TimedColorGame-v0",size=i,step_limit=(i**3))
    model_A2C = A2C("MultiInputPolicy", env, verbose = 0)
    
    mean_reward_pre, std_reward_pre = evaluate_policy(model_A2C, env, n_eval_episodes = 100) #Pre-training evaluation     
    model_A2C.learn(total_timesteps=10000)#training the model
    mean_reward, std_reward = evaluate_policy(model_A2C, env, n_eval_episodes = 100) #Post-training evaluation
    print(f"{i}x{i}: Pre-training  [{mean_reward_pre:.2f} +/- {std_reward_pre:.2f}], Post-training [{mean_reward:.2f} +/- {std_reward:.2f}]")

Mean Rewards for Timed Color Game with NxN grid-size, using A2C, 10000 training timesteps
2x2: Pre-training  [-1.62 +/- 1.22], Post-training [0.37 +/- 0.27]
3x3: Pre-training  [-2.31 +/- 1.60], Post-training [-2.36 +/- 1.56]
4x4: Pre-training  [-3.47 +/- 1.61], Post-training [-3.37 +/- 1.72]
5x5: Pre-training  [-4.50 +/- 1.71], Post-training [-4.20 +/- 2.07]
6x6: Pre-training  [-5.61 +/- 1.66], Post-training [-5.82 +/- 1.19]


### After testing the model repeatedly, it appears that it is not finding the correct square efficiently.

The model only shows significant improvement in the 2x2 square, which I assume is because it's just rnadomly walking around the edge until it stumbles upon the square of the correct color. I'll have to try modifying the algorithm or the game to more efficiently find the right square.

### Let's try adjusting the model.

In [None]:
#I'm going to try out different models and training paradigms on a 3x3 grid to see which, if any, of them can show
# improvement on the simple color game. I'm now realizing that these models come with a pre-programmed discount
# factor, so adding a time penalty to the reward structure is probably unnecessary and counterproductive at worst.
env = gym.make("SimpleColorGame-v0",size=3)
PPO = PPO("MultiInputPolicy",env,verbose=0)
DQN = DQN("MultiInputPolicy",env,verbose=0)
A2C = A2C("MultiInputPolicy",env,verbose=0)

for model in [PPO,DQN,A2C]:
    mean_reward_pre, std_reward_pre = evaluate_policy(model, env, n_eval_episodes = 100) #Pre-training evaluation     
    model.learn(total_timesteps=10000)#training the model
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes = 100) #Post-training evaluation
    print(f"3x3: Pre-training  [{mean_reward_pre:.2f} +/- {std_reward_pre:.2f}], Post-training [{mean_reward:.2f} +/- {std_reward:.2f}]")

3x3: Pre-training  [0.12 +/- 0.32], Post-training [0.14 +/- 0.35]
3x3: Pre-training  [0.12 +/- 0.32], Post-training [0.19 +/- 0.39]
