# **Introduction** 

This notebook serves as an implementation of the Soft-Actor Critic (SAC) algorithm developed by Haarnoja et al. in the following papers [[1]](https://arxiv.org/abs/1801.01290)[[2]](https://arxiv.org/abs/1812.05905)[[3]](https://arxiv.org/abs/1812.11103). SAC is an off-policy actor-critic algorithm that is based on the maximum entropy reinforcement learning framework.

The maximum entropy framework sees the actor attempting to simultaneously maximize both expected return and entropy. This leads to improvements in both exploration and robustness. The three key components of the SAC architecture are:

1. an actor-critic architecture, separating policy and value function into two distinct networks,
2. an off-policy formulation allowing the use of a replay buffer, and
3. the use of entropy maximization to encourage both stability and exploration.

This implementation was done using the `InvertedPendulum` environment offered through `Gymnasium`.

# **Import Packages**

This section imports the necessary packages for this implementation.

In [5]:
# import these:
import gymnasium as gym
import numpy as np
import os
from tqdm import tqdm
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout # type: ignore
from tensorflow.keras.optimizers import Adam # type: ignore

testing:

In [6]:
env = gym.make("InvertedPendulum-v5")