# **Introduction** 

This notebook serves as an implementation of the Soft-Actor Critic (SAC) algorithm developed by Haarnoja et al. in the following papers [[1]](https://arxiv.org/abs/1801.01290)[[2]](https://arxiv.org/abs/1812.05905)[[3]](https://arxiv.org/abs/1812.11103). SAC is an off-policy actor-critic algorithm that is based on the maximum entropy reinforcement learning framework.

The maximum entropy framework sees the actor attempting to simultaneously maximize both expected return and entropy. This leads to improvements in both exploration and robustness. The three key components of the SAC architecture are:

1. an actor-critic architecture, separating policy and value function into two distinct networks,
2. an off-policy formulation allowing the use of a replay buffer, and
3. the use of entropy maximization to encourage both stability and exploration.

This implementation was done using the `InvertedPendulum` environment offered through `Gymnasium`.

# **Import Packages**

This section imports the necessary packages for this implementation.

In [38]:
# import these:
import gymnasium as gym
import numpy as np
import os
from collections import deque
from tqdm import tqdm
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout # type: ignore
from tensorflow.keras.optimizers import Adam # type: ignore

function for making neural networks:

In [39]:
# function for making a keras model based on user inputs:
def make_model(rate : float, 
               layers : int, 
               neurons : int, 
               input_shape : int, 
               output_shape : int, 
               loss_function : str, 
               output_activation : str):
    """ 
    this is a function for making simple Keras sequential models. these models do not have any protection
    against vanishing or exploding gradients (lacking batch_normalization and dropout layers, namely) and are
    simply fully connected, nonlinearly activated feedforward neural networks.

    rate:                   a float representing the learning rate of the optimizer used, which is Adam
    layers:                 an int representing the number of layers in the network
    neurons:                an int representing the number of neurons in each layer of the network
    input_shape:            an int representing the shape of the input data (input_shape, )
    output_shape:           an int representing the number of outputs of the network
    loss_function:          a string representing the desired loss function to be used in the optimizer, which is Adam
    output_activation:      a string representing the activation function of the output layer
    
    """
    # instantiate model:
    model = keras.Sequential()

    # add hidden layers:
    for i in range(layers):
        if i == 0:
            model.add(Input(shape = (input_shape, )))
            model.add(Dense(neurons, activation = 'relu', name = f"hidden_layer_{i+1}"))
        else: 
            model.add(Dense(neurons, activation = 'relu', name = f"hidden_layer_{i+1}"))

    # add output layer:
    model.add(Dense(output_shape, activation = output_activation, name = 'output_layer'))

    # compile the model:
    model.compile(optimizer = Adam(learning_rate = rate),
                  loss = loss_function)
    
    # return to user:
    return model

object oriented function for making SAC agents:

In [None]:
# define class:
class SAC_Agent:
    ####################### INITIALIZATION #######################
    # constructor:
    def __init__(self,
                env: gym.Env,
                lr_a: float,
                lr_c: float,
                gamma: float,
                layers: int,
                neurons: int,
                batch_size: int,
                buffer_size: int,
                gradient_steps: int,
                polyak_coefficient: float,
                target_update_interval: int,
                ):
        """ 
        this is the constructor for the agent. this agent uses the soft actor-critic (SAC) algorithm to learn an optimal policy. 
        the theory behind this implementation is derived from entropy maximization reinforcement learning, which seeks to improve the robustness and the 
        exploratory nature of the agent by changing the learning objective to both maximize the expected return and the entropy. 

        the base SAC implementation in [1] is brittle with respect to the temperature. this is because the SAC algorithm is very sensitive to the scaling of the 
        rewards, and the reward scaling is inversely proportional to temperature, which determines the relative importance of the entropy term versus the reward.

        the modified implementation in [2] addresses this delicate need to tune the temperature by having the network automatically learn the temperature. basically,
        the learning objective is modified to include an expected entropy constraint. the learned stochastic policy therefore attempts to achieve maximal expected return, 
        satisfying a minimum expected entropy constraint. 

        env:                        a gymnasium environment
        lr_a:                       a float value representing the learning rate of the actor, α_a
        lr_c:                       a float value representing the learning rate of the critic, α_c
        gamma:                      a float value representing the discount factor, γ
        layers:                     an int value indicating the number of layers in a given network
        neurons:                    an int value indicating the number of neurons in a given network
        batch_size:                 an int value indicating the number of samples to sample from the replay buffer
        buffer_size:                an int value indicating the size of the replay buffer
        gradient_steps:             an int value indicating how many gradient steps to apply
        polyak_coefficient:         a float value indicating the target smooth coefficient (polyak coefficient)
        target_update_interval:     an int value indicating how often to apply the smooth target network update

        nS:                 an int representing the number of states observed from the continuous state space
        nA:                 an int representing the number of actions observed from the continuous action space
        actor:              a Keras sequential neural network representing the actor network
        critic_1:           a Keras sequential neural network representing the first critic network
        critic_2:           a Keras sequential neural network representing the second critic network
        experience:         an empty deque used to hold the experience history of the agent, limited by 'buffer_size'
        entropy_target:     an int value representing the desired entropy target

        """
        # object parameters:
        self.env = env
        self.lr_a = lr_a
        self.lr_c = lr_c
        self.gamma = gamma
        self.layers = layers
        self.neurons = neurons
        self.batch_size = batch_size
        self.buffer_size = buffer_size
        self.gradient_steps = gradient_steps
        self.polyak_coefficient = polyak_coefficient
        self.target_update_interval = target_update_interval

        # get the environmental dimensions (number of states and number of actions):
        self.nS = self.env.observation_space.shape[0]
        self.nA = self.env.action_space.shape[0]
        self.entropy_target = -self.nA      # see appendix D in [2]

        # create networks:
        self.actor = make_model(rate = self.lr_a,
                           layers = self.layers,
                           neurons = self.neurons,
                           input_shape = self.nS,
                           output_shape = 1,
                           loss_function = "categorical_crossentropy",
                           output_activation = "softmax")
        
        self.critic_1 = make_model(rate = self.lr_c,
                            layers = self.layers,
                            neurons = self.neurons,
                            input_shape = self.nS,
                            output_shape = 1,
                            loss_function = "mse",
                            output_activation = "linear")
        
        self.critic_2 = make_model(rate = self.lr_c,
                            layers = self.layers,
                            neurons = self.neurons,
                            input_shape = self.nS,
                            output_shape = 1,
                            loss_function = "mse",
                            output_activation = "linear")
        
        # initialize the experience buffer:
        self.experience = deque(maxlen = self.buffer_size)
