### CDS NYU
### DS-GA 3001 | Reinforcement Learning
### Lab 07
### March 13, 2025


# Implement Custom Gym RL Environments

<br>

---

## Professor
Jeremy Curuksu, PhD -- jeremy.cur@nyu.edu

## Section Leader
Akshitha Kumbam – ak11071@nyu.edu

Kushagra Khatwani – kk5395@nyu.edu

## Goal of Today's Lab 

In this Lab, we will implement custom RL environments by building on existing Gym environments. We will see how to change the state-space and/or action-space definition, how to import custom data to redefine the environment, and how to register a new environment in Gym so as to continue leveraging Gym capabilities for a custom RL problem. 

Today we will customize pre-existing Gym environments, but we could also get rid of Gym entirely and develop an environment from scratch. There are other RL frameworks available too, which we will discuss later this semester (for example: Google DeepMind ACME, Amazon SageMaker RL, Facebook Meta ReAgent, etc). Note these framework tend to specialize in some functionalities such as specific types of agents, distributed computing, etc, and are often compatible with Gym. Gym remains the most widely used standard benchmark of RL environments.

Gym has lots to offer to avoid reinventing the wheel. Using Gym can help make sure you (the developer) focus your time and effort on what is truly new, on the innovative RL problem you are trying to solve. But at this time in the course you should also start getting comfortable thinking about how you would create a custom environment for *any kind* of RL problem, whichever interests you the most. For the project in this course, you are free to use Gym or not, and/or any other RL framework.


We will cover three case studies today, the first of which (Gridlock) is the official Gym tutorial available at the Gymnasium documentation link shown below. Then we will turn to develop a custom stock trading agent and a custom state-action space for a Super Mario Bros agent.

## Resources

* https://gymnasium.farama.org/


# 1. Implement a custom Gridworld environment in Gym

This case study comes directly from the official Gym documentation tutorial, which you can find here: https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/

Since we will go all the way from creating a new Gym environment in OOP to registering it in Gymnasium as a Python package, we will write code in scripts rather than the Jupyter Notebook and use a standard dev structure with the following hierarchy of folders and files, please bring this entire structure of folders/files on your computer:

`gym-examples/`<br>
  &emsp;&emsp;`README.md`<br>
  &emsp;&emsp;`setup.py`<br>
  &emsp;&emsp;`gym_examples/`<br>
    &emsp;&emsp;&emsp;&emsp;`__init__.py`<br>
    &emsp;&emsp;&emsp;&emsp;`envs/`<br>
      &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;`__init__.py`<br>
      &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;`grid_world.py`<br>
      
In this notebook we use `pygmentize` to read each script, but feel free to read the scripts outside Jupyter Notebook using your favorite Python code editor.

The custom Gridlock environment will be a two-dimensional square grid of custom size where the RL agent can move vertically or horizontally between grid cells at each timestep. The goal of the agent is to navigate to a target (red square) on the grid that is placed randomly at the beginning of each episode.

* The observation (state) is the location of the target and the agent

* The agent can take one of 4 possible actions at each step corresponding to the movements “right”, “up”, “left”, or “down”

* Done is set to `true` (episode terminated) as soon as the agent has navigated to the grid cell where the target is located

* Rewards are binary and sparse, meaning that the immediate reward is always zero, unless the agent reaches the target where the reward is +1

* An episode in this environment (with size=5) might look like this:

<br>

<img src="./CustomGridSize5.png" width="500">

<br>

where the blue dot is the agent and the red square is the target.



##  Write an environment custom class which inherits from the parent class `gym.Env`

The custom environment class with have seven components:

1. Initialization of attributes: `__init__()`
2. Construction of observations from the environment state: `_get_obs()`
3. Auxiliary information: `_get_info()`
4. Reset method: `reset()`
5. Step method: `step()`
6. Render method: `render()`/`_render_frame()`
7. Close method: `close()`


This is a copy-paste from the code provided in the Gymnasium tutorial. You can read explanation and description of the code through the comments inserted in the script itself, or in better-looking, HTML format directly from the doc: https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/


In [None]:

!pygmentize gym-examples/gym_examples/envs/grid_world.py


In other environments, `close()` might also close files that were opened or release other resources. You shouldn’t interact with the environment after having called close.

## Register the environment in Gym

In order for the custom environment to be detected by Gym, it must be registered as follows in the file ``gym-examples/gym_examples/__init__.py``

In [None]:

!pygmentize gym-examples/gym_examples/__init__.py


The environment ID consists of three components, two of which are
optional: an optional namespace (here: ``gym_examples``), a mandatory
name (here: ``GridWorld``) and an optional but recommended version
(here: v0). It might have also been registered as ``GridWorld-v0`` (the
recommended approach), ``GridWorld`` or ``gym_examples/GridWorld``, and
the appropriate ID should then be used during environment creation.

The keyword argument ``max_episode_steps=300`` will ensure that
GridWorld environments that are instantiated via ``gymnasium.make`` will
be wrapped in a ``TimeLimit`` wrapper (see `the wrapper documentation </api/wrappers>`__ for more information). 
A done signal will then be produced if the agent has reached the target *or* 300 steps
have been executed in the current episode. To distinguish truncation and
termination, you can check ``info["TimeLimit.truncated"]``.


## Create a Python package

The last step is to structure our code as a Python package. This
involves configuring ``gym-examples/setup.py``. A minimal example of how
to do so is as follows:

In [None]:

!pygmentize gym-examples/setup.py


## Install the custom package

Install the package locally (``pip install -e gym-examples`` needs be typed where the package is located, outside the package tree of files)

In [None]:
!pip install -e gym-examples

## Create an instance of the custom environment

Now let's simulate and vizualize the environment we created! Given we registered our custom environment in Gym, we can create an instance of the environment via all our usual Gym commands:

In [None]:
import gym
import gym_examples

# Load custom environment we created 
env = gym.make('gym_examples/GridWorld-v0', render_mode = "human", size=5) 

# Set to initial state
env.reset()  

# Loop over 200 steps
for _ in range(200):
    action = env.action_space.sample()                           # Choose a random action
    new_state, reward, done, truncated, info = env.step(action)  # Carry out the action
    
    if done:
         env.reset()
            
env.close()

# 2. Exercises: Implement your own custom financial trading environment

## Example solution: 
An example of custom trading environment was implemented in the previous lab, where we added engineered feature time series to the agent state defined by financial indicators called `momentum_rsi`, `volume_obv`, `trend_macd_diff` computed by the `ta` external library. Any indicator computed by this library can be added to the state space, defining a new custom trading environment (solution below). 

In fact, just downloading a specific set of stocks for specific training/testing timelines becomes a new, custom financial trading environment for the agent to explore and learn from.


### Set up the S&P 500 stock index trading environment

In [None]:
#Quick fix for M1 architecture (M2/M3 might also need this). This fix is for importing stable baselines which fails on Apple Silicon for some versions.
# import os
# os.environ['KMP_DUPLICATE_LIB_OK']='True'

For some Apple Silicon architectures, stable-baselines3 causes a segmentation fault when attempting to load the model on CPU. It is advisable to use the GPU in the form of mps(Metal Performance Shaders) which is a hacky fix. You might have to use mps while we try to find an actual fix. Uncomment the code below if your kernel crashes while trying to train the A2C model.

In [None]:
#RUN THIS ONLY IF YOU'RE ON APPLE SILICON (And/or want to switch to Apple's GPU)
# import torch
# if torch.backends.mps.is_available():
#     mps_device = torch.device("mps")
#     x = torch.ones(1, device=mps_device)
#     print (x)
# else:
#     print ("MPS device not found.")

In [None]:
import gym
from stable_baselines3.common.vec_env import DummyVecEnv 
from stable_baselines3 import A2C 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gym_anytrading  # https://github.com/AminHP/gym-anytrading
from gym_anytrading.envs import StocksEnv
import yfinance as yf
from pandas_datareader import data as pdr
yf.pdr_override() # bug fix, for details see https://stackoverflow.com/questions/74862453/why-am-i-getting-a-typeerror-string-indices-must-be-integer-message-when-tryi
from ta import add_all_ta_features # Method from TA (Technical Analysis) library to engineer financial indicators

In [None]:
data = pdr.get_data_yahoo('SPY', start='2021-01-01', end='2023-01-01')
# Engineer financial indicators using the method imported above from the "TA" library
df2 = add_all_ta_features(data, open='Open', high='High', low='Low', close='Close', volume='Volume', fillna=True)

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df2

In [None]:
def my_processed_data(env):
    start = env.frame_bound[0] - env.window_size
    end = env.frame_bound[1]
    prices = env.df.loc[:, 'Low'].to_numpy()[start:end]
    signal_features = env.df.loc[:, ['Close', 'Volume', 'momentum_rsi', 'volume_obv', 'trend_macd_diff']].to_numpy()[start:end]
    return prices, signal_features

class MyCustomEnv(StocksEnv):
    _process_data = my_processed_data
    

In [None]:
env2 = MyCustomEnv(df=df2, window_size= 5, frame_bound=(5, 400))
print(env2.observation_space.shape)

### Train the `A2C` algorithm from `stable-baselines` with the engineered features

In [None]:
model = A2C('MlpPolicy', env2, verbose=1)
model.learn(total_timesteps=100000)

### Exploit the trained agent in test simulations

In [None]:
env = MyCustomEnv(df=df2, window_size=5, frame_bound=(400, 502))
obs = env.reset()
obs = obs[0]

while True:
    action, states = model.predict(obs)
    obs, rewards, term, trunc, info = env.step(action)
    if term or trunc:
        print(info)
        break

In [None]:
plt.figure(figsize=(15, 6))
plt.cla()
env.render_all()
plt.show()

## Second example solution: 
The agent state can be defined based on forecasts obtained from a RNN deep learning trained on trends in recent stock prices. A solution for this can be found at the following AWS Blog Post (a bit outdated, it was published in 2018): https://aws.amazon.com/blogs/machine-learning/forecasting-time-series-with-dynamic-deep-learning-on-aws/

## Thank you everyone!