### CDS NYU
### DS-GA 3001 | Reinforcement Learning
### Lab 06
### March 06, 2025


# Implement RL algorithms with Keras-RL or Stable-Baseline

<br>

---

## Section Leader


Akshitha Kumbam – ak11071@nyu.edu

Kushagra Khatwani – kk5395@nyu.edu


## Goal of Today's Lab 

In this Lab, we will implement RL algorithms by building on existing RL algorithm libraries, so we don't have to implement RL agents from scratch as we did in the past few weeks. By doing so, we get less control on the details of the implementation, but it is much faster to implement (**we will cover 4 case studies today:** `CartPole`, `SpaceInvaders`, `CarRacing`, and `StockTrading`), and components available in public libraries tend to be high quality and efficient. 

Let us start with the DQN algorithm. We can use external open-source Python packages which implement each of the key DQN methods (e,g., experience replay method, action-selection method, etc). These methods are developed, maintained, and optimized for robustness to different scenarios, and for overall performance.

Using these packages instead of implementing each DQN component from scratch is generally faster, leads to a more reliable/efficient program, yet still gives you a lot of control on the details of the implementation and hyperparameters.

We will focus on DQN in the first part of the lab, but these packages also offer methods for most commonly used RL algorithms such as A3C, PPO, etc, so we will start looking at these too.

## Resources

* https://gymnasium.farama.org/


# 1. Implement DQN with Keras-RL Methods

Keras-RL is an older reinforcement learning library that does not support the latest versions of TensorFlow and Gym. To ensure compatibility, we need to downgrade both TensorFlow and Gym to specific versions that work with Keras-RL.

### Install Required Dependencies

To install the required versions, run the following commands:

```bash
pip install tensorflow==2.15
pip install gym==0.15.4
pip install pyglet==1.5.11
```

**Note:** We use `gym` instead of `gymnasium` because Keras-RL was originally built for `gym` and does not support `gymnasium`. Using these specific versions ensures that we can successfully implement and train a Deep Q-Network (DQN) using Keras-RL.


# 1.1 Solve *Cart Pole*  with DQN from Keras-rl

Most of this case study will be the same as in previous lab, but we build some of the key components of the DQN agent using components available in the Keral-RL library


## Imports

In [None]:
import time 
import gym
from pyglet.window import key 
from tensorflow.keras.models import Sequential  
from tensorflow.keras.layers import Dense  
from tensorflow.keras.layers import Activation 
from tensorflow.keras.layers import Flatten  
from tensorflow.keras.optimizers.legacy import Adam  # Adam optimizer


# Import DQN methods from the keras-rl2 library (keras-rl is tagged "rl" in Python)
# Quick fix if python cannot import name '__version__' from 'tensorflow.keras'
import tensorflow as tf
from keras import __version__
tf.keras.__version__ = __version__

from rl.agents.dqn import DQNAgent 

## Set up the `CartPole` Gym environment

In [None]:
# https://stackoverflow.com/questions/56904270/difference-between-openai-gym-environments-cartpole-v0-and-cartpole-v1
env_name = ENV_NAME = 'CartPole-v1'
env = gym.make(env_name)  

# Same as last week:
num_actions = env.action_space.n
num_observations = env.observation_space.shape[0]
print(f"There are {num_actions} possible actions and {num_observations} observations")


## Execute random actions just to get familiar with the environment

In [None]:
# Load the CartPole Gym environment with graphical rendering to vizualize the environment
env_test = gym.make("CartPole-v1")  # [Jeremy] was v1 last week  
# Set to initial state
env_test.reset()
  

# Loop over 200 steps
for _ in range(200):
    env_test.render()                                                 # Render on the screen
    action = env_test.action_space.sample()                           # Choose a random action
    new_state, reward, done, info = env_test.step(action)  # Carry out the action
    time.sleep(0.03)
    if done:
         env_test.reset()
            
env_test.close()


## Implement an Artificial Neural Network
To build our network, we first need to find out how many actions and observation our environment has.
We can either get those information from the source code (https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py) or via the following commands:

Similar to in the previous lab, we build a simple ANN with 2 hidden layers and 16 and 32 neurons each followed by relu activation. The output layer has 2 nodes, one for each action

In [None]:
model = Sequential()
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

model.add(Dense(16))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(num_actions))
model.add(Activation('linear'))

print(model.summary())

## Implement the DQN framework with Keras-RL

The DQN agent `DQNagent` from Keras-RL takes the following parameters:

1. `model` = The ANN


2. `nb_actions` = The number of actions (2 in this case)


3. `memory` = The action replay memory. Ny far the most common choice is `SequentialMemory()` 


4. `nb_steps_warmup` = Number of iterations used to fill the memory prior starting to update the ANN parameters


5. `target_model_update` = How often (in number of steps) to update the target model


6. `policy` = You can choose between `LinearAnnealedPolicy()`, `SoftmaxPolicy()`, `EpsGreedyQPolicy()`, `GreedyQPolicy()`, `GreedyQPolicy()`, `MaxBoltzmannQPolicy()` and `BoltzmannGumbelQPolicy()`. We will use the `LinearAnnealedPolicy` policy, but feel free to try them out and inspect which works best here


There are some more parameters you can pass to the DQN Agent, feel free to explore them on your own.

let's initialize a circular buffer with a limit of 20000 and window length of 1 (window length describes the number of steps stored to define a state)


In [7]:
from rl.memory import SequentialMemory  # Sequential Memory for storing observations (optimized circular buffer)

memory = SequentialMemory(limit=20000, window_length=1)


Then we define the Action Selection Policy: <br />
We use *LinearAnnealedPolicy* in order to perform the epsilon greedy strategy with decaying epsilon. <br />
*LinearAnnealedPolicy* accepts an action selection policy, its maximal and minimal values and a step number in order to create a dynamical policy. <br/>
The smallest value epsilon can reach during training is 0.1.<br />
For testing/evaluation of the trained agent, let's set epsilon to 0.05


In [8]:
# LinearAnnealedPolicy allows to decay the epsilon for the epsilon greedy strategy
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
                              attr='eps',
                              value_max=1.,
                              value_min=.1,
                              value_test=.05,
                              nb_steps=20000) 


Now we create the DQN Agent based on the defined model (**model**), the possible actions (**num_actions**) (left and right in this case), the circular buffer (**memory**), the warmup phase (**10**), how often the target model gets updated (**100**) and the policy (**policy**)


In [9]:
dqn = DQNAgent(model = model, nb_actions = num_actions, memory = memory, nb_steps_warmup = 10,
               target_model_update = 100, policy = policy)



We can now compile the DQN with the Adam optimizer and a learning rate of 0.001.<br />
We log the Mean Absolute Error

In [None]:
# Use learning_rate instead of lr if you get warning
dqn.compile(Adam(learning_rate=1e-3), metrics=['mae']) 

Let's run the training for 20000 steps. You can change visualize=True if you want to watch your model learning.
Keep in mind that this increases the running time.


## Train the DQN agent

In [None]:
dqn.fit(env, nb_steps=20000, visualize=False, verbose=2)

After just 1-2 minutes of training (for reference, it takes < 1min on a Macbook Air with M2), we achieve some great results already. It was taking at least 15 minutes of training to reach a similar level of performance with the custom DQN implemented from scratch in the previous lab.

The reason for this is that keras-rl has implemented many optimization strategies (e.g., the optimized replay buffer) which lead to a significantly faster convergence than the DQN we implemented by hand.

In [None]:
# After training is done, we can save the final weights.
dqn.save_weights(f'dqn_{env_name}_weights.h5f', overwrite=True)

## Exploit learned Q values in test simulations

The Keras-RL agents also offer methods to perform tests in Gym, with some parameters e.g. to decide whether to render the simulation graphically.

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

All in all, we accomplished a better agent (trained more efficiently) with much less code than in the previous lab, thanks to Keras-RL!

---------

# 1.2 Solve *Space Invaders*  with Convolutional DQN from Keras-rl
Also, install :
```bash
pip install "gym[atari,accept-rom-license]"
```

## Imports

In [18]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras.optimizers.legacy import Adam

# Quick fix if python cannot import name '__version__' from 'tensorflow.keras'
import tensorflow as tf
from keras import __version__
tf.keras.__version__ = __version__

from rl.agents import DQNAgent
from rl.memory import SequentialMemory
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

import gym

env = gym.make('SpaceInvaders-v0')


## Execute random actions just to get familiar with the environment

In [None]:
env = gym.make('SpaceInvaders-v0')

episodes = 5

for episode in range(1, episodes):
    state = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        state, reward, done, info = env.step(env.action_space.sample())
        print(state.shape)
        score += reward
    print('Episode: {}\nScore: {}'.format(episode, score))
    
env.close()

## Implement a Convolutional Neural Network

In [20]:
def build_model(height, width, channels, actions):
    model = Sequential()
    model.add(Conv2D(32, (8,8), strides=(4,4), activation='relu', input_shape=(3, height, width, channels)))
    model.add(Conv2D(64, (4,4), strides=(2,2), activation='relu'))
    model.add(Conv2D(64, (4,4), strides=(2,2), activation='relu'))
    model.add(Flatten())
    model.add(Dense(512, activation='relu'))
    model.add(Dense(256, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [None]:
height, width, channels = env.observation_space.shape
actions = env.action_space.n
print(height, width, channels)

In [None]:
#del model # [Jeremy] why do we need this? TBD

In [22]:
model = build_model(height, width, channels, actions)

## Implement the DQN framework 

In [23]:
def build_agent(model, actions):
    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.2, nb_steps=10000)
    memory = SequentialMemory(limit=2000, window_length=3)
    dqn = DQNAgent(model=model, memory=memory, policy=policy,
                  enable_dueling_network=False, dueling_type='avg',
                  nb_actions=actions, nb_steps_warmup=1000)
    return dqn

In [24]:
dqn = build_agent(model, actions)

In [None]:
dqn.compile(Adam(learning_rate=0.001))

In [None]:
dqn.fit(env, nb_steps = 1000, visualize = False, verbose = 1) # Train for 1000 steps in class, try 40000 at home :)

In [None]:
#dqn.save_weights('SpaceInvaderTrainedModel/dqn.h5f')

## Exploit learned Q values in test simulations

In [None]:
#dqn.load_weights('SpaceInvaderTrainedModel/dqn.h5f')

In [None]:
env = gym.make('SpaceInvaders-v0') 

scores = dqn.test(env, nb_episodes = 10, visualize = True)  # Would need play with versioning to use the vizualize parameter (currently fixed with line above)
print(np.mean(scores.history['episode_reward']))

# 2. Implement DQN with Stable-baselines algorithms

Instead of implementing DQN from scratch (previous lab), or its individual components from external packages as seen in the section above, yet other packages exist which offer a "complete" implementation of the most popular RL algorithms (DQN, A3C, PPO, DDPG, etc).

Many such packages exist. The most popular ones include OpenAI `Stable-baselines`, DeepMind `Acme`, AWS `SageMaker RL`, Meta `AI ReAgent`, Ray `RLlib`, Intel `AI Coach`. 

Today we will focus exclusively on `Stable-baselines` because it was initially designed by OpenAI in tandem with `Gym` environments. In the final labs of the semester will introduce several other frameworks.


These implementations of RL algorithms have been optimized for ease-of-use, robustness to different scenarios, and overall performance. Their drawback is you get less control on details of the implementation and hyperparameters, compared to using packages such as `Keras-RL` which only implement key *components* of RL algorithms.

Please install the following :
```bash
pip install stable-baselines3
```

# 2.1 Play 'Car Racing' with Convolutional PPO from Stable-baselines

## Imports

In [5]:
import gymnasium as gym

#Quick fix for M1 architecture (M2/M3 might also need this).
# import os
# os.environ['KMP_DUPLICATE_LIB_OK']='True'

from stable_baselines3 import PPO 
from stable_baselines3.common.evaluation import evaluate_policy

## Set up the `CarRacing` Gym environment

In [6]:
env = gym.make('CarRacing-v3')
env.observation_space.sample().shape
env.action_space.sample()
#env.reset()
#env.render()
env.close()

## Execute random actions just to get familiar with the environment

In [None]:
env_test = gym.make('CarRacing-v3', render_mode = 'human')
episodes = 5
for episode in range(episodes):
    state, info = env_test.reset()
    done = False
    score = 0
    
    while not done:
        env_test.render()
        action = env_test.action_space.sample()
        n_state, reward, done, trunc, info = env_test.step(action)
        score+=reward
    print("Episode: {} Score: {}".format(episode, score))
    
env.close()

In [50]:
# from stable_baselines3.common.env_util import make_vec_env
# env = make_vec_env(lambda: gym.make('CarRacing-v3'), n_envs=1)

## Train the `PPO` algorithm from `stable-baselines`

In [None]:
model = FILL

In [None]:
# Train the model
# TODO


## Exploit the trained agent in test simulations

In [None]:
# Evaluate the policy
# TODO

In [6]:
env.close()

# 2.2 Trade a S&P 500 stock index with DQN from Stable-baselines
Please install the following:
```bash
pip install gym_anytrading
pip install yfinance  
pip install pandas_datareader
pip install TA
```


In [28]:
import gymnasium as gym

#Quick fix for M1 architecture (M2/M3 might also need this).
# import os
# os.environ['KMP_DUPLICATE_LIB_OK']='True'

from stable_baselines3.common.vec_env import DummyVecEnv 
from stable_baselines3 import A2C 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For this case study please install the following libraries: gym_anytrading, yfinance, pandas_datareader, TA
import gym_anytrading
from gym_anytrading.envs import StocksEnv
import yfinance as yf
from pandas_datareader import data as pdr
# yf.pdr_override() # bug fix, for details see https://stackoverflow.com/questions/74862453/why-am-i-getting-a-typeerror-string-indices-must-be-integer-message-when-tryi
from ta import add_all_ta_features # Method from TA (Technical Analysis) library to engineer financial indicators


## Set up the S&P 500 stock index trading environment

Read some daily time series stock data

In [None]:
df = yf.download('SPY', start='2021-01-01', end='2023-01-01', multi_level_index=False)
df.head()

In [None]:
df = df.reset_index()
df['Date'] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace = True)
df = df.reset_index()
df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]
df.head()

Set up the Gym trading environment

In [31]:
env = gym.make('stocks-v0', df=df, frame_bound=(5, 400), window_size=5)

In [None]:
env.action_space

## Execute random actions just to get familiar with the environment

In [None]:
import time
state, info = env.reset()

# Run the environment with random actions (Buy or Sell)
while True:  # Example: run for 100 steps
    action = env.action_space.sample()  # Sample random action (0 or 1)
    next_state, reward, done, trunc, info = env.step(action)  # Take the action
    
    # Print the action and reward
    print(f"Action: {action}, Reward: {reward}, Done: {done}")
    
    if done or trunc:
        print("Episode finished. Info:", info)
        break

plt.figure(figsize=(15, 6))
plt.cla()
env.unwrapped.render_all()  # Render the environment
plt.show()

# Close the environment
env.close()

plt.show()


The green dots represent buying while the red dots represent selling stocks.

## Train an `A2C` algorithm from `stable-baselines`

In [None]:
env = gym.make('stocks-v0', df=df, frame_bound=(5, 400), window_size=5)
model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

## Exploit the trained agent in test simulations

In [None]:
env = gym.make('stocks-v0', df=df, frame_bound=(400, 502), window_size=5)
obs, info = env.reset()
while True:
    # obs = obs[np.newaxis, ...]
    action, states = model.predict(obs)
    obs, rewards, done,trunc,  info = env.step(action)
    
    if done or trunc:
        print(info)
        break

In [None]:
plt.figure(figsize=(15, 6))
plt.cla()
env.unwrapped.render_all()
plt.show()

Again, the green dots represent buying while the red dots represent selling stocks. If the agent is capable of making intelligent trading decisions, it will tend to buy (green dots) when the price is relatively low, and to sell (red dots) when the price is relatively high. 

As you can see, the agent needs fine tuning to perform well. So to close the lab, let us fine tune the agent by engineering features ("financial indicators") and adding these additional indicators to the definition of the RL state, so the agent has additional information every day to make trading decisions.

## Engineer features (financial indicators) for the agent to make smarter decisions

In [None]:
data = yf.download('SPY', start='2021-01-01', end='2023-01-01', multi_level_index=False)
# Engineer financial indicators using the method imported above from the "TA" library
df2 = add_all_ta_features(data, open='Open', high='High', low='Low', close='Close', volume='Volume', fillna=True)

In [41]:
pd.set_option('display.max_columns', None)

In [None]:
df2.head()

In [43]:
def my_processed_data(env):
    start = env.frame_bound[0] - env.window_size
    end = env.frame_bound[1]
    prices = env.df.loc[:, 'Low'].to_numpy()[start:end]
    signal_features = env.df.loc[:, ['Close', 'Volume', 'momentum_rsi', 'volume_obv', 'trend_macd_diff']].to_numpy()[start:end]
    return prices, signal_features

class MyCustomEnv(StocksEnv):
    _process_data = my_processed_data
    

In [44]:
env2 = MyCustomEnv(df=df2, window_size= 5, frame_bound=(5, 400))

## Re-train the `A2C` algorithm from `stable-baselines` with the new engineered features

In [None]:
model = A2C('MlpPolicy', env2, verbose=1)
model.learn(total_timesteps=10000)

## Exploit the trained agent in test simulations

In [None]:
env = MyCustomEnv(df=df2, window_size=5, frame_bound=(400, 502))
obs, info = env.reset()

while True:
    # obs = obs[np.newaxis, ...]
    action, states = model.predict(obs)
    obs, rewards, done, trunc, info = env.step(action)
    
    if done or trunc:
        print(info)
        break

In [None]:
plt.figure(figsize=(15, 6))
plt.cla()
env.render_all()
plt.show()

Hopefully you now see more green dots (buy stocks) when the price is relatively low, and more red dots (sell stocks) when the price is relatively high!

**Exercise 1**: What other fine tuning can you try to improve the agent's trading performance?

**Exercise 2**: Implement the `CartPole` and `SpaceInvaders` agents using the DQN algorithms from `stable-baselines`.  

**Exercise 3**: Implement the `PPO` algorithm from `stable-baselines` to learn to play at the `FlappyBird` (lab 5) and `SpaceInvaders` games.

## Thank you everyone!