# **Introduction**

This notebook serves as an implementation of Soft Actor-Critic (SAC) on the custom-developed 2D navigation environment, titled ``Nav2D-v0``. The goal of this implementation is to quantify the performance of SAC in a simple 2D navigational  task, such that it can be used for incremental learning within subsequent environments.

# **Imports**

This section imports the necessary packages for this implementation.

In [6]:
# import gymnasium related packages:
import gymnasium as gym
from gymnasium.utils.env_checker import check_env
from gymnasium.wrappers import RescaleAction

# import custom environments and wrappers:
import nav2d

# import stablebaselines stuff:
from stable_baselines3 import SAC
from stable_baselines3.common.env_util import Monitor

# other necessary imports:
from tqdm import tqdm
import pyautogui
import numpy as np
import pandas as pd

# **Environment Definition and Hyperparameters**

This section defines and verifies the environment, defines the hyperparameters for the model, and creates a model.

In [7]:
# make the environment:
env = gym.make("Nav2D-v0")

# check the environment:
try: 
    check_env(env.unwrapped)
    print(f"Environment passes all checks!")
except Exception as e:
    print(f"Environment has the following issues: \n{e}")

Environment has the following issues: 
The first element returned by `env.reset()` is not within the observation space.


Define hyperparameters:

In [8]:
# create an environment:
env = gym.make("Nav2D-v0", max_episode_steps = 1000, render_mode = "human")

# hyperparameters:
policy = "MlpPolicy"
gamma = 0.99
learning_rate = 3e-4
buffer_size = int(1e6)
batch_size = 64
tau = 5e-3
ent_coef = "auto_0.1"
train_freq = 1
learning_starts = int(0)
target_update_interval = 1
gradient_steps = 4
target_entropy = "auto"
action_noise = None
verbose = 2

Create model:

In [9]:
# model creation using SB3:
model = SAC(policy = policy, 
            env = env,
            learning_rate = learning_rate,
            buffer_size = buffer_size,
            batch_size = batch_size,
            tau = tau,
            ent_coef = ent_coef,
            train_freq = train_freq,
            learning_starts = learning_starts,
            target_update_interval = target_update_interval,
            gradient_steps = gradient_steps,
            target_entropy = target_entropy,
            action_noise = action_noise, 
            verbose = verbose)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


# **Train the model**

Using the instantiated SB3 model, train on the ``Nav2D-v0`` environment.

In [10]:
# using model.learn approach:
model.learn(total_timesteps = 25000, log_interval = 50)

current: 29.08 | desired: 45.98 | rew_heading: -0.91 | rew_dist: -0.96 | total: -2.87                                           

KeyboardInterrupt: 

In [None]:
env.close()

# **Visualization**

This section visualizes the learned policy.

In [None]:
# # render settings:
# width = 1280
# height = 1280
# default_camera_config = {"azimuth" : 90.0, "elevation" : -90.0, "distance" : 3, "lookat" : [0.0, 0.0, 0.0]}
# camera_id = 2

# DEFAULT_CAMERA = "overhead_camera"
# ENABLE_FRAME = True
# RENDER_EVERY_FRAME = True 

# # make a single environment:
# env = gym.make("Nav2D-v0", 
#                render_mode = "human", 
#                width = width, 
#                height = height,
#                default_camera_config = default_camera_config, 
#                camera_id = camera_id, 
#                max_episode_steps = 500)

# env = RescaleAction(env, min_action = -action_bounds, max_action = action_bounds)

# if DEFAULT_CAMERA=="overhead_camera": pyautogui.press('tab')
# if ENABLE_FRAME: pyautogui.press('e') 
# if not RENDER_EVERY_FRAME: pyautogui.press('d') 

# # for every test episode:
# for eps in range(10):
#     obs, _ = env.reset()
#     done = False

#     # while not done:
#     while not done:
#         action, _ = model.predict(obs, deterministic = True)
#         nobs, reward, term, trunc, _ = env.step(action)
#         done = term or trunc

#         # advance observation, reset if not:
#         obs = nobs if not done else env.reset()
        
#         # render for user:
#         env.render()

# # close when done:
# env.close()