# **Introduction**

This notebook is for implementing an Actor-Critic method based on the advantage function (A2C), for learning an optimal policy for the `Cartpole` environment. The Actor-Critic method utilizes two networks, one of which is responsible for mapping states to a probability distribution over the actions (actor), and another which estimates the value of a state to guide the actor (critic). The general idea is that the actor updates its policy in the direction suggested by the critic.

Also, as per [this paper](https://proceedings.mlr.press/v97/ahmed19a/ahmed19a.pdf), I am also implementing entropy regularization in an attempt to reduce variance and improve performance. Similarly, I am vectorizing the environment, employing the use of `n_envs` concurrent environments, such that the implemented algorithm is more akin to [its original introduction](https://arxiv.org/pdf/1602.01783).

# **Import Packages**

This section imports the necessary packages.

In [1]:
# import these:
import gymnasium as gym
from gymnasium.vector import AsyncVectorEnv
import numpy as np
import tensorflow as tf
from tqdm import tqdm
from tensorflow import keras
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt

# **Environment Setup**

This section sets up the environment and defines the relevant functions needed for this implementation.

##### Function for making vectorized environments:

In [2]:
# function that takes seed and env_name:
def make_env(env_name : str, seed):
    # does not work without this, for some reason:
    def intermediary():
        # make the environment based on the provided env_name:
        env = gym.make(env_name)

        # wrap environment:
        env = SwingUpWrapper(env)

        # reset and seed the environment:
        env.reset(seed = seed)

        # return env to user:
        return env
    return intermediary

##### Function for making Keras models:

In [3]:
# function for making a keras model based on user inputs:
def make_model(layers, neurons, rate, input_shape, output_shape, loss_function, output_activation):
    # instantiate model:
    model = keras.Sequential()

    # add hidden layers:
    for i in range(layers):
        if i == 0:
            model.add(Input(shape = (input_shape, )))
            model.add(Dense(neurons, activation = 'relu', name = f"hidden_layer_{i+1}"))
        else: 
            model.add(Dense(neurons, activation = 'relu', name = f"hidden_layer_{i+1}"))

    # add output layer:
    model.add(Dense(output_shape, activation = output_activation, name = 'output_layer'))

    # compile the model:
    model.compile(optimizer = Adam(learning_rate = rate),
                  loss = loss_function)
    
    # return to user:
    return model

##### Vectorized A2C class:

In [None]:
# A2C class:
class A2C_Agent:
    ####################### INITIALIZATION #######################
    # constructor:
    def __init__(self,
                envs: gym.AsyncVectorEnv,
                gamma: float, 
                lr_a: float,
                lr_c: float, 
                beta: float,
                layers = int, 
                neurons = int,
                n_envs = int,
                n_steps: int = 5):
        """
        this is the constructor for the agent. this agent uses the advantage actor-critic (A2C) algorithm to learn an optimal policy,
        through the use of two approximator networks. the first network, called the actor, is responsible for providing the probabilty 
        distribution over all actions given a state. the second network, called the critic, is responsible for utilizing the advantage function
        to guide the learning of the actor.

        this implementation uses entropy regularization to encourage exploration, and utilizes a vectorized environment to increase the stabilizing
        effect on the training procedure.

        envs:               asynchronously vectorized gymnasium environments
        gamma:              a float value indicating the discount factor, γ
        lr_a:               a float value indicating the learning rate of the actor, α_a
        lr_c:               a float value indicating the learning rate of the critic, α_c
        beta:               a float value indicating the entropy regularization parameter, β
        layers:             an int value indicating the number of layers in a network
        neurons:            an int value indicating the number of neurons per layer
        n_steps:            an int value indicating the number of steps to use when computing the return
        n_envs:             an int value indicating the number of parallel environments used

        nS:                 an int representing the number of states observed, each of which is continuous
        nA:                 an int representing the number of discrete actions that can be taken

        actor:              a Keras sequential neural network representing the actor
        critic:             a Keras sequential neural network representing the actor

        buf_states:         a list used to hold the states used in the n-step return
        buf_actions:        a list used to hold the actions used in the n-step return
        buf_rewards:        a list used to hold the rewards used in the n-step return
        buf_next_states:    a list used to hold the next states used in the n-step return
        buf_next_dones:     a list used to hold the dones used in the n-step return
        
        """
        # object parameters:
        self.envs       = envs
        self.gamma      = gamma
        self.lr_a       = lr_a
        self.lr_c       = lr_c
        self.beta       = beta
        self.n_steps    = n_steps
        self.n_envs     = n_envs

        # get environment dimensions:
        self.nS = envs.single_observation_space.shape[0]
        self.nA = envs.single_action_space.n

        # initialize the networks:
        self.actor = make_model(layers = layers,
                                neurons = neurons,
                                rate = lr_a,
                                input_shape = self.nS,
                                output_shape = self.nA,
                                output_activation = "softmax",
                                loss_function = "categorical_crossentropy")

        self.critic = make_model(layers = layers,
                                neurons = neurons,
                                rate = lr_c,
                                input_shape = self.nS,
                                output_shape = self.nA,
                                output_activation = "linear",
                                loss_function = "mse")

        # initialize buffers for rollout:
        self.obs_buf = []
        self.act_buf = []
        self.rew_buf = []
        self.val_buf = []
        self.done_buf = []

    ####################### TRAINING #######################
    # function for calculating discounted returns:
    def discounted_returns(rewards, dones, last_value):
        # compute the discounted cumulative reward for a vectorized environment:
        returns = np.zeros((self.n_steps, self.n_envs))
        running_return = last_value

        # start computing discounted return:
        for t in reversed(range(self.n_steps)):
            running_return = rewards[t] + gamma * running_return * (1 - dones[t])
            returns[t] = running_return
    
        # return to user:
        return returns

    # decorated training step function:
    @tf.function
    def training_step(states, actions, returns, advantages):
        # convert values to tensors, if not already:
        states = tf.convert_to_tensor(states, dtype = tf.float32)
        actions = tf.cast(actions, dtype = tf.int32)
        returns = tf.cast(returns, dtype = tf.float32)
        advantages = tf.cast(advantages, dtype = tf.float32)

        # CRITIC UPDATE:




SyntaxError: expected ':' (3331239329.py, line 6)