# Reinforcement learning project

Contact: Elie KADOCHE, eliekadoche78@gmail.com

This is a reinforcement learning project on policy gradient methods.
In this Jupyter notebook you will find all the code and instructions.
More explanations will be given in the board during the practical sessions.
Do not hesitate to ask for help if you need to.
The overall project will be noted on 20.
The notation is given for information and may slightly change.
Your solutions should all appear in this Jupyter notebook, with both: explanations and code.
- You will write your answers, thoughts and approaches for each Section.
Even if you do some mistakes, you can describe them and explain how you corrected them.
It will be taken into account in the notation.
- Each time there is a comment `# ---> TODO:`, there is some code missing that you need to write.
Be mindful of the code quality.
Each line of code should be associated to a comment where you describe what you do.

In [None]:
!pip install gymnasium
!pip install numpy
!pip install scipy
!pip install torch

Collecting gymnasium
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0


In [None]:
import math

import gymnasium as gym
import numpy as np
import torch
from gymnasium import spaces
from gymnasium.utils import seeding
from numpy.random import default_rng
from torch import nn as nn
from torch import optim as optim
from torch.distributions import Categorical
from torch.nn import functional as F

## 1) Cartpole environment (/2)

Your objective is to understand how the cartpole environment works and what is the problem we want to solve.
You do not need to write any code for this Section.
You need to write what are the states, the rewards and the actions.
Write down the Markov Decision Process associated to the problem.
You can read the Section 5 of this [document](http://incompleteideas.net/papers/barto-sutton-anderson-83.pdf) for help.
Below are: the environment and a script to test a random policy.

**State:**  
x - position of the cart on the track,  
θ - angle of the pole with the vertical,  
ẋ - cart velocity, and  
θ̇ - rate of change of the angle.

**Action**:     
The action in a 2D case is the cart either moving right or left

**Reward:**   
Whenever the failure signal shows, the reward is subtracted.

**Markov Decision Process**.   
The MDP takes in the previous state and takes a certain action through a policy to obtain the probability of the next state given the current state and action. The MDP only takes into account the current state.

In [None]:
"""Cartpole environment (default)."""

class CartpoleEnvV0(gym.Env):
    """Cartpole environment."""

    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 60}

    def __init__(self, env_context=None, render_mode=None):
        """Initialize environment.

        Args:
            env_context (dict): environment configuration.
            render_mode (str): render mode.
        """
        # Variables
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masspole + self.masscart
        self.length = 0.5  # Actually half the pole's length
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02  # Seconds between state updates
        self.kinematics_integrator = "euler"

        # Angle at which to fail the episode
        self.theta_threshold_radians = 12 * 2 * math.pi / 360
        self.x_threshold = 2.4

        # Angle limit set to 2 * theta_threshold_radians so failing observation
        # is still within bounds
        high = np.array([
            self.x_threshold * 2,
            np.finfo(np.float32).max,
            self.theta_threshold_radians * 2,
            np.finfo(np.float32).max,
        ], dtype=np.float32)

        # Action and observation (state) spaces
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(-high, high, dtype=np.float32)

        # Render mode
        self.render_mode = render_mode

        # Others
        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True
        self.state = None

    def _process_state(self):
        """Process state before returning it.

        Returns:
            state_processed (numpy.array): processed state.
        """
        # No modifications
        processed_state = self.state

        return processed_state

    def reset(self, seed=None, options=None):
        """Reset the environment.

        Args:
            seed (int): seed for reproducibility.
            options (dict): additional information.

        Returns:
            state (numpy.array): the processed state.

            info (dict): auxiliary diagnostic information.
        """
        # Reset seed
        if seed is not None:
            self._np_random, seed = seeding.np_random(seed)

        # Current time step
        self._time_step = 0

        # Reset state
        self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
        self.state = self.state.astype(np.float32)

        # Eventually render
        if self.render_mode == "human":
            self.render()

        return self._process_state(), {}

    def step(self, action):
        """Go from current step to next one.

        Args:
            action (int): action of the agent.

        Returns:
            state (numpy.array): state.

            reward (float): reward.

            terminated (bool): whether a terminal state is reached.

            truncated (bool): whether a truncation condition is reached.

            info (dict): auxiliary diagnostic information.
        """
        # Check if action is valid
        err_msg = f"{action!r} ({type(action)}) invalid"
        assert self.action_space.contains(action), err_msg
        assert self.state is not None, "Call reset before using step method."

        # Compute variables
        x_tmp = self.state
        x, x_dot, theta, theta_dot = x_tmp[0], x_tmp[1], x_tmp[2], x_tmp[3]
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        # https://coneural.org/florian/papers/05_cart_pole.pdf
        m = self.polemass_length
        temp = force + m * theta_dot**2 * sintheta / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp)
        thetaacc /= 4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass
        thetaacc /= self.length
        xacc = temp - m * thetaacc * costheta / self.total_mass

        # Update system
        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # Semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        # Full system state
        self.state = np.array([
            x,
            x_dot,
            theta,
            theta_dot,
        ], dtype=np.float32)

        # Reward is 1
        reward = 1.0

        # Increase time step
        self._time_step += 1

        # Check if episode if finished
        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
            or self._time_step >= 500
        )

        # Eventually render
        if self.render_mode == "human":
            self.render()

        return self._process_state(), reward, terminated, False, {}

    def render(self):
        """Render environment.

        Note:
            Do not pay too much attention to this function. It is just to
            display a nice animation of the environment.
        """
        import pygame
        from pygame import gfxdraw

        # Initialize render mode if needed
        if self.render_mode is None:
            self.render_mode = "human"

        # Initialize objects
        if self.screen is None:
            pygame.init()
            if self.render_mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_width, self.screen_height))
            else:  # mode == "rgb_array"
                self.screen = pygame.Surface(
                    (self.screen_width, self.screen_height))

        # Initialize clock
        if self.clock is None:
            self.clock = pygame.time.Clock()

        # Objects
        world_width = self.x_threshold * 2
        scale = self.screen_width / world_width
        polewidth = 10.0
        polelen = scale * (2 * self.length)
        cartwidth = 50.0
        cartheight = 30.0

        # Get state
        if self.state is None:
            return None
        x = self.state

        # Get surface
        self.surf = pygame.Surface((self.screen_width, self.screen_height))
        self.surf.fill((255, 255, 255))

        # Computations
        l = -cartwidth / 2
        r = cartwidth / 2
        t = cartheight / 2
        b = -cartheight / 2
        axleoffset = cartheight / 4.0
        cartx = x[0] * scale + self.screen_width / 2.0  # MIDDLE OF CART
        carty = 100  # TOP OF CART
        cart_coords = [(l, b), (l, t), (r, t), (r, b)]
        cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
        gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))

        l, r, t, b = (
            -polewidth / 2,
            polewidth / 2,
            polelen - polewidth / 2,
            -polewidth / 2,
        )

        pole_coords = []
        for coord in [(l, b), (l, t), (r, t), (r, b)]:
            coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
            coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
            pole_coords.append(coord)
        gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
        gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))

        gfxdraw.aacircle(self.surf,
                         int(cartx),
                         int(carty + axleoffset),
                         int(polewidth / 2),
                         (129, 132, 203))
        gfxdraw.filled_circle(self.surf,
                              int(cartx),
                              int(carty + axleoffset),
                              int(polewidth / 2),
                              (129, 132, 203))

        gfxdraw.hline(self.surf, 0, self.screen_width, carty, (0, 0, 0))

        # Display
        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))

        # Human mode
        if self.render_mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        # RGB array mode
        elif self.render_mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)),
                axes=(1, 0, 2)
            )

    def close(self):
        """Close the environment.

        Note:
            Do not pay too much attention to this function. It is just to close
            the environment.
        """
        if self.screen is not None:
            import pygame
            pygame.display.quit()
            pygame.quit()
            self.isopen = False

In [None]:
# Script to run a random policy
# ------------------------------------------

# Create environment
env = CartpoleEnvV0()

# Create random generator
generator = default_rng(seed=None)

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    # Select a random action
    action = generator.integers(0, 2)

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

total_reward = 11.0


## 2) Deep neural network policy (/2)

We want to build a policy $\pi_\theta(a | s) = P(a | s, \theta)$ that gives the probability of choosing an action $a$ in state $s$.
The policy is a deep neural network parameterized by some weights $\theta$.
The policy is also referred to as "actor".
Depending on what you understood from the cartpole environment, you need to change in the actor code the `input_size` and `nb_actions` variables.
In the code running the policy on cartpole, you need to find how to select an action from the ouptut of the policy.
Explain your method.
How does the policy performs on cartpole?
Why?
Below are: an example of a simple neural network in PyTorch, the actor code for Cartpole and a script to run the policy.

**Your answer**:   
We use a policy gradient algorithm to solve the bellman equation to find the optimum policy by performing gradient ascent to maximize the expected outcome for the reward given the initial state. We can do this by the hypothesis that the decision process is a markov decision process. The current method performs poorly, only achieveing a consistent reward under 15. This is because we have not ran the backpropagation algorithm, and the optimizer step.

In [None]:
"""Example of a simple neural network with PyTorch."""

import torch


class Net(torch.nn.Module):
    """Network class."""

    def __init__(self):
        """Init."""
        super(Net, self).__init__()
        self.A = torch.nn.Linear(1, 32)
        self.B = torch.nn.Linear(32, 32)
        self.C = torch.nn.Linear(32, 1)

    def forward(self, x):
        """Forward."""
        x = torch.relu(self.A(x))
        x = torch.relu(self.B(x))
        x = self.C(x)
        return x


# Create model
model = Net()

# Create the optimizer
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)

# Training loop
while True:

    # We create some random integergers. Careful here, the requires_grad is
    # very important. If we do all the computations without gradients it
    # will naturally not work
    x = torch.randint(
        0, 7, (250, 1), dtype=torch.float32, requires_grad=True)

    # We create the real y values. The objectif for the model is to give
    # the squared value of x
    y_true = torch.square(x)

    # We use our model to predict the y values
    y = model(x)

    # We compute the loss
    loss = torch.mean(torch.square(y - y_true))

    # Print loss: it should decreases!
    print("{:.2E}".format(loss.item()))

    # Reset the gradients to 0
    optimizer.zero_grad()

    # Compute the gradients of the model parameters relative to the loss
    loss.backward()

    # Update the network
    optimizer.step()

2.96E+02
3.16E+02
3.15E+02
2.72E+02
2.32E+02
2.89E+02
2.64E+02
2.33E+02
2.15E+02
2.11E+02
1.81E+02
1.83E+02
1.34E+02
1.32E+02
1.32E+02
1.12E+02
8.83E+01
6.32E+01
3.94E+01
3.71E+01
3.00E+01
2.59E+01
2.70E+01
3.03E+01
3.20E+01
3.51E+01
3.81E+01
3.48E+01
3.21E+01
2.94E+01
2.15E+01
1.95E+01
1.95E+01
2.28E+01
2.37E+01
2.36E+01
2.00E+01
2.11E+01
1.98E+01
2.00E+01
1.50E+01
1.57E+01
1.56E+01
1.53E+01
1.54E+01
1.72E+01
1.44E+01
1.23E+01
1.17E+01
1.11E+01
1.01E+01
1.07E+01
1.11E+01
1.09E+01
9.02E+00
8.40E+00
7.55E+00
6.89E+00
7.56E+00
7.78E+00
7.57E+00
6.87E+00
7.40E+00
6.41E+00
6.30E+00
7.17E+00
6.86E+00
6.72E+00
6.47E+00
6.34E+00
5.74E+00
6.36E+00
6.00E+00
5.75E+00
6.46E+00
6.36E+00
5.95E+00
5.37E+00
5.68E+00
6.33E+00
5.75E+00
5.50E+00
5.52E+00
5.70E+00
5.20E+00
5.46E+00
5.02E+00
5.34E+00
4.67E+00
5.02E+00
4.97E+00
4.73E+00
4.87E+00
4.10E+00
4.32E+00
4.32E+00
4.46E+00
4.05E+00
4.11E+00
3.99E+00
3.90E+00
4.13E+00
3.70E+00
3.48E+00
3.45E+00
3.80E+00
3.52E+00
3.32E+00
3.27E+00
3.11E+00
3.65E+00
3

KeyboardInterrupt: 

In [None]:
"""Actor network (default)."""

class ActorModelV0(nn.Module):
    """Deep neural network."""

    # By default, use CPU
    DEVICE = torch.device("cpu")

    def __init__(self):
        """Initialize model."""
        super(ActorModelV0, self).__init__()
        # ---> TODO: change input and output sizes depending on the environment
        input_size = 4
        nb_actions = 2

        # Build layer objects
        self.fc0 = nn.Linear(input_size, 128)
        self.fc1 = nn.Linear(128, 128)
        self.policy = nn.Linear(128, nb_actions)

    def _preprocessor(self, state):
        """Preprocessor function.

        Args:
            state (numpy.array): environment state.

        Returns:
            x (torch.tensor): preprocessed state.
        """
        # Add batch dimension
        x = np.expand_dims(state, 0)

        # Transform to torch.tensor
        x = torch.from_numpy(x).float().to(self.DEVICE)

        return x

    def forward(self, x):
        """Forward pass.

        Args:
            x (numpy.array): environment state.

        Returns:
            actions_prob (torch.tensor): list with the probability of each
                action over the action space.
        """
        # Preprocessor
        x = self._preprocessor(x)

        # Input layer
        x = F.relu(self.fc0(x))

        # Middle layers
        x = F.relu(self.fc1(x))

        # Policy
        action_prob = F.softmax(self.policy(x), dim=-1)

        return action_prob

In [None]:
# Script to run a deep neural network policy
# ------------------------------------------

# Create environment and policy
env = CartpoleEnvV0()
policy = ActorModelV0()

# Testing mode
policy.eval()
print(policy)

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    # Use the policy to generate the probabilities of each action
    probabilities = policy(state)

    # ---> TODO: how to select an action
    action = policy(state).argmax().item()

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

ActorModelV0(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)
total_reward = 9.0


## 3) REINFORCE algorithm (/6)

We want to find the parameters $\theta$ that maximize the performance measure $J(\theta) = \mathbb{E}_{\pi_\theta}[ G_0 ]$ with $G_t = \sum_{k=0}^{\infty} \beta^k r_{t+k+1}$ and $\beta \in [0, 1]$ being a discount factor.
To do so, we use the gradient ascent method: $\theta_{k+1} = \theta_{k} + \alpha \nabla_{\theta_k} J(\theta_k)$ with $\alpha$ being the learning rate.
The performance measure depends on both the action selection and the distribution of states.
Both are affected by the policy parameters, which make the computation of the gradient challenging.

The policy gradient theorem gives an expression for $\nabla_\theta J(\theta)$ that does not involve the derivative of the state distribution.
The expectation is over all possible state-action trajectories over the policy $\pi_\theta$:
$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \sum_{t=0}^{\infty} G_t \nabla_\theta \ln \pi_\theta(a_t | s_t) ]$.
In the REINFORCE algorithm, we use a Monte-Carlo estimate over one episode, i.e., one trajectory:
$\nabla_\theta J(\theta) = \sum_{t=0}^{\infty} G_t \nabla_\theta \ln \pi_\theta(a_t | s_t)$.

Your objective is to complete the REINFORCE algorithm to train the policy until convergence. To solve the problem, you need to achieve a cumulative reward of at least 500 when training the policy. Below are: the code of the REINFORCE algorithm and a script to test your policy once it is trained.

1.  Write the code to compute the discounted sum of rewards. (/1)
2.  Write the REINFORCE loss for the policy. (/3)
3.  Change the $\alpha$ and the $\beta$ parameters to solve the problem. (/1)
4.  Find a good metric to decide when the training should be stopped. (/1)

After playing around with the learning rate and discount factor, I set the learning rate to 0.001 and the discount factor to 0.99. As seen in how alpha and beta were set, I am biasing towards a slower but more stable convergence. Consistently, I see the first score of 500 around the 300th iteration, since reward tend to be more stable as more iterations occur, I decide to use iterations 1000 as a stopping point. However, if the training consistently yields rewards of 500 for 300 consecutive iterations, I also stop the training. Doing so consistently yields a high cumulative reward.

In [None]:
# Script to train a policy with the REINFORCE algorithm
# ------------------------------------------

# Maximum environment length
HORIZON = 500

# ---> TODO: change the discount factor to solve the problem
DISCOUNT_FACTOR = 0.99

# ---> TODO: change the learning rate to solve the problem
LEARNING_RATE = 0.001


best_reward = -float("inf")

# Create environment and policy
env = CartpoleEnvV0()
actor = ActorModelV0()
actor_path = "./actor_0.pt"

# Training mode
actor.train()
print(actor)

# Create optimizer with the policy parameters
actor_optimizer = optim.Adam(actor.parameters(), lr=LEARNING_RATE)

# ---> TODO: when do we stop the training?

# Run infinitely many episodes
training_iteration = 0
while True:

    # Experience
    # ------------------------------------------

    # Reset the environment
    state, _ = env.reset()

    # During experience, we will save:
    # - the probability of the chosen action at each time step pi(at|st)
    # - the rewards received at each time step ri
    saved_probabilities = list()
    saved_rewards = list()

    # Prevent infinite loop
    for t in range(HORIZON + 1):

        state_tensor = torch.FloatTensor(state).unsqueeze(0)

        # Use the policy to generate the probabilities of each action
        probabilities = actor(state_tensor)

        # Create a categorical distribution over the list of probabilities
        # of actions and sample an action from it
        distribution = Categorical(probabilities)
        action = distribution.sample()

        # Take the action
        state, reward, terminated, _, _ = env.step(action.item())

        # Save the probability of the chosen action and the reward
        saved_probabilities.append(distribution.log_prob(action))
        saved_rewards.append(reward)

        # End episode
        if terminated:
            env.close()
            break

    # Compute discounted sum of rewards
    # ------------------------------------------

    # Current discounted reward
    discounted_reward = 0.0

    # List of all the discounted rewards, for each time step
    discounted_rewards = list()

    # ---> TODO: compute discounted rewards
    # Why is the saved_rewards reversed?
    for r in saved_rewards[::-1]:
        discounted_reward = r + DISCOUNT_FACTOR * discounted_reward
        discounted_rewards.insert(0, discounted_reward)

    # Eventually normalize for stability purposes
    discounted_rewards = torch.tensor(discounted_rewards)
    mean, std = discounted_rewards.mean(), discounted_rewards.std()
    discounted_rewards = (discounted_rewards - mean) / (std + 1e-7)

    # Update policy parameters
    # ------------------------------------------

    # For each time step
    actor_loss = list()
    for log_prob, g in zip(saved_probabilities, discounted_rewards):

        # ---> TODO: compute policy loss
        time_step_actor_loss = -g * log_prob

        # Save it
        actor_loss.append(time_step_actor_loss)

    # Sum all the time step losses
    actor_loss = torch.cat(actor_loss).sum()

    # Reset gradients to 0.0
    actor_optimizer.zero_grad()

    # Compute the gradients of the loss (backpropagation)
    actor_loss.backward()

    # Update the policy parameters (gradient ascent)
    actor_optimizer.step()

    # Logging
    # ------------------------------------------

    # Episode total reward
    episode_total_reward = sum(saved_rewards)

    recent_rewards = []

    # Log results
    log_frequency = 5
    training_iteration += 1
    if training_iteration % log_frequency == 0:
      if training_iteration <= 1000:

        # Save neural network
        torch.save(actor, actor_path)
        recent_rewards.append(episode_total_reward)

        if len(recent_rewards) > 200:
          recent_rewards.pop(0)

        # Stopping Criteria: if the previous 200 ones are all 500, stop.
        if (len(recent_rewards) == 200 and
            all(r >= 500 for r in recent_rewards[-200:]) and
            all(r >= 500 for r in recent_rewards[-100:]) and
            all(r >= 500 for r in recent_rewards[-50:])):
            print("Achieved sustained high performance - stopping training")
            break

        # Print results
        if training_iteration % 100 == 0:
          print("iteration {} - last reward: {:.2f}".format(
              training_iteration, episode_total_reward))
      else:
        break


ActorModelV0(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)
iteration 100 - last reward: 91.00
iteration 200 - last reward: 195.00
iteration 300 - last reward: 500.00
iteration 400 - last reward: 500.00
iteration 500 - last reward: 500.00
iteration 600 - last reward: 500.00
iteration 700 - last reward: 177.00
iteration 800 - last reward: 500.00
iteration 900 - last reward: 500.00
iteration 1000 - last reward: 500.00


In [None]:
# Script to run a deep neural network policy (after training)
# ------------------------------------------

# Create environment and policy
env = CartpoleEnvV0()
policy = ActorModelV0()
actor_path = "./actor_0.pt"

# Testing mode
policy.eval()
print(policy)

# Load the trained policy
policy = torch.load(actor_path)

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    state_tensor = torch.FloatTensor(state).unsqueeze(0)

    # Use the policy to generate the probabilities of each action
    probabilities = policy(state_tensor)

    # Create distribution and sample action
    distribution = Categorical(probabilities)
    action = distribution.sample().item()

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

ActorModelV0(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)


  policy = torch.load(actor_path)


total_reward = 500.0


## 4) Markov property (/5)

Usually, in reinforcement learning, the state can be decomposed in 2 objects: the *state* describing fully the environment and the *observation*, describing what an agent (i.e., a policy) can observe from the state.
Until here, the state and the observation are equal.
In this section, we work on a slightly different version of the environment where the state is different from the observation: the accelerations are removed, and are therefore, not visible by the agent.
Belore are: the new cartpole environment, a new actor and a new REINFORCE script.
1) Because the observation size is then of size 2, the actor is redefined accordingly.
Run the REINFORCE algorithm back again with the refedined environment and actor.
Does it converge?
Why? (/1)
2) Find a way to modify the states in the new environment, without using the accelerations, so that the environment becomes Markovian again.
Update the actor network accordingly and train the model back again with the REINFORCE algorithm.
Does it converge?
Why? (/4)

1. No, the REINFORCE algorithm does not converge in this case because it isn't markovian

2. We've reinstated the markovian decision process by introducing the last known location and angle of the pole, thus hoping that the model will learn the velocity. The model now converges, but at a much slower speed with much more iterations. (x_t-1, x, theta_t-1, theta)

In [None]:
"""Cartpole environment (modified)."""

class CartpoleEnvV1(gym.Env):
    """Cartpole environment."""

    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 60}

    def __init__(self, env_context=None, render_mode=None):
        """Initialize environment.

        Args:
            env_context (dict): environment configuration.
            render_mode (str): render mode.
        """
        # Variables
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = self.masspole + self.masscart
        self.length = 0.5  # Actually half the pole's length
        self.polemass_length = self.masspole * self.length
        self.force_mag = 10.0
        self.tau = 0.02  # Seconds between state updates
        self.kinematics_integrator = "euler"

        # Angle at which to fail the episode
        self.theta_threshold_radians = 12 * 2 * math.pi / 360
        self.x_threshold = 2.4

        # Angle limit set to 2 * theta_threshold_radians so failing observation
        # is still within bounds
        high = np.array([
            self.x_threshold * 2,
            np.finfo(np.float32).max,
            self.theta_threshold_radians * 2,
            np.finfo(np.float32).max,
        ], dtype=np.float32)

        # Action and observation (state) spaces
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(-high, high, dtype=np.float32)

        # Render mode
        self.render_mode = render_mode

        # Others
        self.screen_width = 600
        self.screen_height = 400
        self.screen = None
        self.clock = None
        self.isopen = True
        self.state = None

    def _process_state(self):
        """Process state before returning it.

        Returns:
            state_processed (numpy.array): processed state.
        """
        # Accelerations are removed from the state
        #processed_state = np.array([self.state[0], self.state[2]])

        # ---> TODO: if no accelerations, determine a new working state
        #I'm using the failure threshold, the angle of the pole, and the threshold for the distance of fail, while penalizing further angles disproportionally
        #stability_index = (abs(self.state[2]) / self.theta_threshold_radians)**2 + (abs(self.state[0]) / self.x_threshold)

        #(x_t-1, x, theta_t-1, theta)
        prev_x = self.state[0] - self.tau * self.state[1]
        prev_theta = self.state[2] - self.tau * self.state[3]

        processed_state = np.array([prev_x, self.state[0], prev_theta, self.state[2]])

        return processed_state

    def reset(self, seed=None, options=None):
        """Reset the environment.

        Args:
            seed (int): seed for reproducibility.
            options (dict): additional information.

        Returns:
            state (numpy.array): the processed state.

            info (dict): auxiliary diagnostic information.
        """
        # Reset seed
        if seed is not None:
            self._np_random, seed = seeding.np_random(seed)

        # Current time step
        self._time_step = 0

        # Reset state
        self.state = self.np_random.uniform(low=-0.05, high=0.05, size=(4,))
        self.state = self.state.astype(np.float32)

        # ---> TODO: if no accelerations, determine a new working state
        #initial_theta = self.state[2]
        #initial_x = self.state[0]
        #initial_stability = (abs(initial_theta) / self.theta_threshold_radians)**2 + (abs(initial_x) / self.x_threshold)

        # Eventually render
        if self.render_mode == "human":
            self.render()

        return self._process_state(), {}

    def step(self, action):
        """Go from current step to next one.

        Args:
            action (int): action of the agent.

        Returns:
            state (numpy.array): state.

            reward (float): reward.

            terminated (bool): whether a terminal state is reached.

            truncated (bool): whether a truncation condition is reached.

            info (dict): auxiliary diagnostic information.
        """
        # Check if action is valid
        err_msg = f"{action!r} ({type(action)}) invalid"
        assert self.action_space.contains(action), err_msg
        assert self.state is not None, "Call reset before using step method."

        # Compute variables
        x_tmp = self.state
        x, x_dot, theta, theta_dot = x_tmp[0], x_tmp[1], x_tmp[2], x_tmp[3]
        force = self.force_mag if action == 1 else -self.force_mag
        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        # https://coneural.org/florian/papers/05_cart_pole.pdf
        m = self.polemass_length
        temp = force + m * theta_dot**2 * sintheta / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * temp)
        thetaacc /= 4.0 / 3.0 - self.masspole * costheta**2 / self.total_mass
        thetaacc /= self.length
        xacc = temp - m * thetaacc * costheta / self.total_mass

        # Update system
        if self.kinematics_integrator == "euler":
            x = x + self.tau * x_dot
            x_dot = x_dot + self.tau * xacc
            theta = theta + self.tau * theta_dot
            theta_dot = theta_dot + self.tau * thetaacc
        else:  # Semi-implicit euler
            x_dot = x_dot + self.tau * xacc
            x = x + self.tau * x_dot
            theta_dot = theta_dot + self.tau * thetaacc
            theta = theta + self.tau * theta_dot

        # ---> TODO: if no accelerations, determine a new working state

        # Full system state
        self.state = np.array([
            x,
            x_dot,
            theta,
            theta_dot,
        ], dtype=np.float32)

        # Reward is 1
        reward = 1.0

        # Increase time step
        self._time_step += 1

        # Check if episode if finished
        terminated = bool(
            x < -self.x_threshold
            or x > self.x_threshold
            or theta < -self.theta_threshold_radians
            or theta > self.theta_threshold_radians
            or self._time_step >= 500
        )

        # Eventually render
        if self.render_mode == "human":
            self.render()

        return self._process_state(), reward, terminated, False, {}

    def render(self):
        """Render environment.

        Note:
            Do not pay too much attention to this function. It is just to
            display a nice animation of the environment.
        """
        import pygame
        from pygame import gfxdraw

        # Initialize render mode if needed
        if self.render_mode is None:
            self.render_mode = "human"

        # Initialize objects
        if self.screen is None:
            pygame.init()
            if self.render_mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_width, self.screen_height))
            else:  # mode == "rgb_array"
                self.screen = pygame.Surface(
                    (self.screen_width, self.screen_height))

        # Initialize clock
        if self.clock is None:
            self.clock = pygame.time.Clock()

        # Objects
        world_width = self.x_threshold * 2
        scale = self.screen_width / world_width
        polewidth = 10.0
        polelen = scale * (2 * self.length)
        cartwidth = 50.0
        cartheight = 30.0

        # Get state
        if self.state is None:
            return None
        x = self.state

        # Get surface
        self.surf = pygame.Surface((self.screen_width, self.screen_height))
        self.surf.fill((255, 255, 255))

        # Computations
        l = -cartwidth / 2
        r = cartwidth / 2
        t = cartheight / 2
        b = -cartheight / 2
        axleoffset = cartheight / 4.0
        cartx = x[0] * scale + self.screen_width / 2.0  # MIDDLE OF CART
        carty = 100  # TOP OF CART
        cart_coords = [(l, b), (l, t), (r, t), (r, b)]
        cart_coords = [(c[0] + cartx, c[1] + carty) for c in cart_coords]
        gfxdraw.aapolygon(self.surf, cart_coords, (0, 0, 0))
        gfxdraw.filled_polygon(self.surf, cart_coords, (0, 0, 0))

        l, r, t, b = (
            -polewidth / 2,
            polewidth / 2,
            polelen - polewidth / 2,
            -polewidth / 2,
        )

        pole_coords = []
        for coord in [(l, b), (l, t), (r, t), (r, b)]:
            coord = pygame.math.Vector2(coord).rotate_rad(-x[2])
            coord = (coord[0] + cartx, coord[1] + carty + axleoffset)
            pole_coords.append(coord)
        gfxdraw.aapolygon(self.surf, pole_coords, (202, 152, 101))
        gfxdraw.filled_polygon(self.surf, pole_coords, (202, 152, 101))

        gfxdraw.aacircle(self.surf,
                         int(cartx),
                         int(carty + axleoffset),
                         int(polewidth / 2),
                         (129, 132, 203))
        gfxdraw.filled_circle(self.surf,
                              int(cartx),
                              int(carty + axleoffset),
                              int(polewidth / 2),
                              (129, 132, 203))

        gfxdraw.hline(self.surf, 0, self.screen_width, carty, (0, 0, 0))

        # Display
        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))

        # Human mode
        if self.render_mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        # RGB array mode
        elif self.render_mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)),
                axes=(1, 0, 2)
            )

    def close(self):
        """Close the environment.

        Note:
            Do not pay too much attention to this function. It is just to close
            the environment.
        """
        if self.screen is not None:
            import pygame
            pygame.display.quit()
            pygame.quit()
            self.isopen = False

In [None]:
"""Actor network (modified)."""

class ActorModelV1(nn.Module):
    """Deep neural network."""

    # By default, use CPU
    DEVICE = torch.device("cpu")

    def __init__(self):
        """Initialize model."""
        super(ActorModelV1, self).__init__()
        # ---> TODO: change input and output sizes depending on the environment
        input_size = 4
        nb_actions = 2

        # Build layer objects
        self.fc0 = nn.Linear(input_size, 128)
        self.fc1 = nn.Linear(128, 128)
        self.policy = nn.Linear(128, nb_actions)

    def _preprocessor(self, state):
        """Preprocessor function.

        Args:
            state (numpy.array): environment state.

        Returns:
            x (torch.tensor): preprocessed state.
        """
        # Add batch dimension
        x = np.expand_dims(state, 0)

        # Transform to torch.tensor
        x = torch.from_numpy(x).float().to(self.DEVICE)

        return x

    def forward(self, x):
        """Forward pass.

        Args:
            x (numpy.array): environment state.

        Returns:
            actions_prob (torch.tensor): list with the probability of each
                action over the action space.
        """
        # Preprocessor
        x = self._preprocessor(x)

        # Input layer
        x = F.relu(self.fc0(x))

        # Middle layers
        x = F.relu(self.fc1(x))

        # Policy
        action_prob = F.softmax(self.policy(x), dim=-1)

        return action_prob

In [None]:
# Script to train a policy with the REINFORCE algorithm (modified)
# ------------------------------------------

# Maximum environment length
HORIZON = 500

# ---> TODO: change the discount factor to solve the problem
DISCOUNT_FACTOR = 0.99

# ---> TODO: change the learning rate to solve the problem
LEARNING_RATE = 0.001

# Create environment and policy
env = CartpoleEnvV1()
actor = ActorModelV1()
actor_path = "./actor_1.pt"

# Training mode
actor.train()
print(actor)

# Create optimizer with the policy parameters
actor_optimizer = optim.Adam(actor.parameters(), lr=LEARNING_RATE)

# ---> TODO: when do we stop the training?

# Run infinitely many episodes
training_iteration = 0
while True:

    # Experience
    # ------------------------------------------

    # Reset the environment
    state, _ = env.reset()

    # During experience, we will save:
    # - the probability of the chosen action at each time step pi(at|st)
    # - the rewards received at each time step ri
    saved_probabilities = list()
    saved_rewards = list()

    # Prevent infinite loop
    for t in range(HORIZON + 1):

        # Use the policy to generate the probabilities of each action
        probabilities = actor(state)

        # Create a categorical distribution over the list of probabilities
        # of actions and sample an action from it
        distribution = Categorical(probabilities)
        action = distribution.sample()

        # Take the action
        state, reward, terminated, _, _ = env.step(action.item())

        # Save the probability of the chosen action and the reward
        saved_probabilities.append(probabilities[0][action])
        saved_rewards.append(reward)

        # End episode
        if terminated:
            env.close()
            break

    # Compute discounted sum of rewards
    # ------------------------------------------

    # Current discounted reward
    discounted_reward = 0.0

    # List of all the discounted rewards, for each time step
    discounted_rewards = list()

    # ---> TODO: compute discounted rewards
    for r in saved_rewards[::-1]:
        discounted_reward = r + DISCOUNT_FACTOR * discounted_reward
        discounted_rewards.insert(0, discounted_reward)

    # Eventually normalize for stability purposes
    discounted_rewards = torch.tensor(discounted_rewards)
    mean, std = discounted_rewards.mean(), discounted_rewards.std()
    discounted_rewards = (discounted_rewards - mean) / (std + 1e-7)

    # Update policy parameters
    # ------------------------------------------

    # For each time step
    actor_loss = list()
    for p, g in zip(saved_probabilities, discounted_rewards):

        # ---> TODO: compute policy loss
        time_step_actor_loss = -g * torch.log(p)

        # Save it
        actor_loss.append(time_step_actor_loss)

    # Sum all the time step losses
    actor_loss = torch.cat(actor_loss).sum()

    # Reset gradients to 0.0
    actor_optimizer.zero_grad()

    # Compute the gradients of the loss (backpropagation)
    actor_loss.backward()

    # Update the policy parameters (gradient ascent)
    actor_optimizer.step()

    # Logging
    # ------------------------------------------

    # Episode total reward
    episode_total_reward = sum(saved_rewards)

    # ---> TODO: when do we stop the training?

    # Log results
    log_frequency = 5
    training_iteration += 1
    if training_iteration % log_frequency == 0:
      if training_iteration <= 10000:

        # Save neural network
        torch.save(actor, actor_path)
        recent_rewards.append(episode_total_reward)

        if len(recent_rewards) > 200:
          recent_rewards.pop(0)

        # Stopping Criteria: if the previous 200 ones are all 500, stop.
        if (len(recent_rewards) == 200 and
            all(r >= 400 for r in recent_rewards[-200:]) and
            all(r >= 400 for r in recent_rewards[-100:]) and
            all(r >= 400 for r in recent_rewards[-50:])):
            print("Achieved sustained high performance - stopping training")
            break

        # Print results
        if training_iteration % 100 == 0:
          print("iteration {} - last reward: {:.2f}".format(
              training_iteration, episode_total_reward))
      else:
        break

ActorModelV1(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)
iteration 100 - last reward: 19.00
iteration 200 - last reward: 13.00
iteration 300 - last reward: 16.00
iteration 400 - last reward: 59.00
iteration 500 - last reward: 39.00
iteration 600 - last reward: 48.00
iteration 700 - last reward: 32.00
iteration 800 - last reward: 26.00
iteration 900 - last reward: 41.00
iteration 1000 - last reward: 12.00
iteration 1100 - last reward: 52.00
iteration 1200 - last reward: 31.00
iteration 1300 - last reward: 23.00
iteration 1400 - last reward: 13.00
iteration 1500 - last reward: 45.00
iteration 1600 - last reward: 38.00
iteration 1700 - last reward: 36.00
iteration 1800 - last reward: 30.00
iteration 1900 - last reward: 88.00
iteration 2000 - last reward: 44.00
iteration 2100 - last reward: 10.00
iteration 2200 - last reward: 45.00
iteration 2300 -

In [None]:
# Script to run a deep neural network policy (after training, modified)
# ------------------------------------------

# Create environment and policy
env = CartpoleEnvV1()
policy = ActorModelV1()
actor_path = "./actor_1.pt"

# Testing mode
policy.eval()
print(policy)

# Load the trained policy
policy = torch.load(actor_path)

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    # Use the policy to generate the probabilities of each action
    probabilities = policy(state)

    # ---> TODO: how to select an action
    distribution = Categorical(probabilities)
    action = distribution.sample().item()

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

ActorModelV1(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)


  policy = torch.load(actor_path)


total_reward = 500.0


## 5) Open questions (/5)

You can choose to work on one or several of the following subjects.
You are also encouraged to work on other subjects, if you have any ideas.
Do not hesitate to ask to get the validation for a subject not listed here.

### Different environments (moderately easy)
You can find other reinforcement learning problems with their corresponding environments and adapt your REINFORCE training script and your actor model to solve it.
You could also create your own simple reinforcement learning problem, determine the sates, the actions, the rewards and code your own environment to solve it.

**Your answer**: TODO.

### REINFORCE with baseline (moderately difficult)
To reduce the variance of the REINFORCE method, we can subtract to $G_t$ a baseline $v(s_t)$.
With $v(s)$ being the estimate of the state value given by another neural-network, called critic.
Based on the actor network, build the network for the critic; code are quasi-similar (the outputs are different).
And based on the REINFORCE training script, build the REINFORCE with baseline training script.

**Your answer**: TODO.

In [None]:
# Script to train a policy with the REINFORCE with baseline algorithm
# ------------------------------------------

# Maximum environment length
HORIZON = 500

# ---> TODO: change the discount factor to solve the problem
DISCOUNT_FACTOR = 0.1

# ---> TODO: change the learning rate to solve the problem
LEARNING_RATE = 0.5

# Create environment, policy and critic
env = CartpoleEnvV0()
actor = ActorModelV0()
critic = CriticModel()
actor_path = "./actor_2.pt"
critic_path = "./critic_2.pt"

# Training mode
actor.train()
critic.train()
print(actor)
print(critic)

# Create optimizer with the policy parameters
actor_optimizer = optim.Adam(actor.parameters(), lr=LEARNING_RATE)

# Create optimizer with the critic parameters
critic_optimizer = optim.Adam(critic.parameters(), lr=LEARNING_RATE)

# ---> TODO: based on the REINFORCE script, create the REINFORCE with
# baseline script

NameError: name 'CriticModel' is not defined

In [None]:
# Script to test the policy after training

# Create policy
policy = ActorModelV0()
policy.eval()
print(policy)

# Load the trained policy
policy = torch.load("./actor_2.pt")

# Create environment
env = CartpoleEnvV0()

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    # Use the policy to generate the probabilities of each action
    probabilities = policy(state)

    # ---> TODO: how to select an action
    action = 0

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

### Actor-critic algorithm (moderately difficult)

I built a critic network similar to the actor network, with an identical architecture. The key difference is the output where output is V, a single value representing state value.   

Critic loss is calculated by mean squared loss to help the critic learn to predict expected returns. This training process runs parallel to the actor training process and the critic's predictions are used to compute advantages for the actor where Advantages = Returns - Value predictions.


In [None]:
# ---> TODO: based on the actor network, build a critic network

class CriticModel(nn.Module):
    """Critic network to estimate the value function V(s)."""

    # By default, use CPU
    DEVICE = torch.device("cpu")

    def __init__(self):
        """Initialize model."""
        super(CriticModel, self).__init__()
        # Same input size as actor (state space dimension)
        input_size = 4
        # Output size is 1 since we're estimating a single value V(s)
        output_size = 1

        # Build layer objects - using same architecture to actor
        self.fc0 = nn.Linear(input_size, 128)
        self.fc1 = nn.Linear(128, 128)
        self.value = nn.Linear(128, output_size)

    def _preprocessor(self, state):
        """Preprocessor function.

        Args:
            state (numpy.array): environment state.

        Returns:
            x (torch.tensor): preprocessed state.
        """
        # Add batch dimension if necessary
        x = np.expand_dims(state, 0)

        # Transform to torch.tensor
        x = torch.from_numpy(x).float().to(self.DEVICE)

        return x

    def forward(self, x):
        """Forward pass.

        Args:
            x (numpy.array): environment state.

        Returns:
            state_value (torch.tensor): estimated value V(s) for the input state.
        """
        # Preprocessor
        x = self._preprocessor(x)

        # Input layer
        x = F.relu(self.fc0(x))

        # Middle layers
        x = F.relu(self.fc1(x))

        # Value estimation (no activation function needed since we're predicting a scalar value)
        state_value = self.value(x)

        return state_value

In [None]:
# Script to train a policy with the actor-critic algorithm
# ------------------------------------------

# Maximum environment length
HORIZON = 500

# ---> TODO: change the discount factor to solve the problem
DISCOUNT_FACTOR = 0.99

# ---> TODO: change the learning rate to solve the problem
LEARNING_RATE = 0.001

# Create environment, policy and critic
env = CartpoleEnvV0()
actor = ActorModelV0()
critic = CriticModel()
actor_path = "./actor_3.pt"
critic_path = "./critic_3.pt"

# Training mode
actor.train()
critic.train()
print(actor)
print(critic)

# Create optimizer with the policy parameters
actor_optimizer = optim.Adam(actor.parameters(), lr=LEARNING_RATE)

# Create optimizer with the critic parameters
critic_optimizer = optim.Adam(critic.parameters(), lr=LEARNING_RATE)

# ---> TODO: based on the REINFORCE script, create the actor-critic script
training_iteration = 0
recent_rewards = []

while True:
  state, _ = env.reset()

  #save
  #1. probabilities of chosen actions pi(actions given state)
  #2. rewards r
  #3. state values V from critic
  saved_probabilities = []
  saved_rewards = []
  saved_values = []

  for t in range(HORIZON + 1):
    probabilities = actor(state)
    state_value = critic(state)

    distribution = Categorical(probabilities)
    action = distribution.sample()

    next_state, reward, terminated, _, _ = env.step(action.item())

    saved_probabilities.append(probabilities[0][action])
    saved_rewards.append(reward)
    saved_values.append(state_value)

    # Update state
    state = next_state

    if terminated:
        break

  advantages = []
  returns = []
  discounted_return = 0

  #computing returns in reverse order
  for r, v in zip(reversed(saved_rewards), reversed(saved_values)):
    discounted_return = r + DISCOUNT_FACTOR * discounted_return
    advantage = discounted_return - v.item()
    returns.insert(0, discounted_return)
    advantages.insert(0, advantage)

  advantages = torch.tensor(advantages)
  returns = torch.tensor(returns)

  #Found that training was very unstable, found this stability trick online and implementing it seems to work
  advantages = (advantages - advantages.mean())/ (advantages.std() + 1e-8)

  #actor_loss computation
  actor_loss = []
  for prob, advantage in zip(saved_probabilities, advantages):
    time_step_actor_loss = -advantage * torch.log(prob)
    actor_loss.append(time_step_actor_loss)
  actor_loss = torch.stack(actor_loss).sum()

  critic_loss = []
  for v, r in zip(saved_values, returns):
    time_step_critic_loss = F.mse_loss(v, torch.tensor([r]))
    critic_loss.append(time_step_critic_loss)
  critic_loss = torch.stack(critic_loss).sum()

  #update actor and critic
  actor_optimizer.zero_grad()
  actor_loss.backward()
  actor_optimizer.step()

  critic_optimizer.zero_grad()
  critic_loss.backward()
  critic_optimizer.step()

  episode_total_reward = sum(saved_rewards)
  recent_rewards.append(episode_total_reward)

  log_frequency = 100
  training_iteration += 1
  if training_iteration % log_frequency == 0:
    if training_iteration <= 10000:
      # Save networks
      torch.save(actor, actor_path)
      torch.save(critic, critic_path)

      if len(recent_rewards) > 200:
        recent_rewards.pop(0)

      # Stopping criteria
      if (len(recent_rewards) == 200 and
        all(r >= 400 for r in recent_rewards[-200:]) and
        all(r >= 400 for r in recent_rewards[-100:])):
          print("Achieved sustained high performance - stopping training")
          break

      # Print results every 100 iterations
      if training_iteration % 100 == 0:
        print(f"Iteration {training_iteration} - Last reward: {episode_total_reward:.2f}")
        print(f"Actor loss: {actor_loss.item():.4f}, Critic loss: {critic_loss.item():.4f}")
      else:
          break

env.close()


ActorModelV0(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)
CriticModel(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (value): Linear(in_features=128, out_features=1, bias=True)
)


  time_step_critic_loss = F.mse_loss(v, torch.tensor([r]))


Iteration 100 - Last reward: 24.00
Actor loss: -0.3491, Critic loss: 1167.0490
Iteration 200 - Last reward: 172.00
Actor loss: -3.9792, Critic loss: 96534.5312
Iteration 300 - Last reward: 500.00
Actor loss: -10.3636, Critic loss: 123325.5625
Iteration 400 - Last reward: 500.00
Actor loss: -10.7415, Critic loss: 245413.6562
Iteration 500 - Last reward: 500.00
Actor loss: 7.2687, Critic loss: 181930.6875


KeyboardInterrupt: 

In [None]:
# Script to test the policy after training

# Create policy
policy = ActorModelV0()
policy.eval()
print(policy)

# Load the trained policy
policy = torch.load("./actor_3.pt")

# Create environment
env = CartpoleEnvV0()

# Reset it
total_reward = 0.0
state, _ = env.reset(seed=None)

# While the episode is not finished
terminated = False
while not terminated:

    # Use the policy to generate the probabilities of each action
    probabilities = policy(state)

    # ---> TODO: how to select an action
    distribution = Categorical(probabilities)
    action = distribution.sample().item()

    # One step forward
    state, reward, terminated, _, _ = env.step(action)

    # Render (or not) the environment
    total_reward += reward
    env.render()

# Print reward
env.close()
print("total_reward = {}".format(total_reward))

ActorModelV0(
  (fc0): Linear(in_features=4, out_features=128, bias=True)
  (fc1): Linear(in_features=128, out_features=128, bias=True)
  (policy): Linear(in_features=128, out_features=2, bias=True)
)


  policy = torch.load("./actor_3.pt")


total_reward = 500.0


### Continuous actions (difficult)
In this project, we considered a discrete action space.
How could we use policy gradient methods for continuous action spaces?
You can give your ideas, you can code them, and ultimately, you can even find an environment with a continuous action space and try to solve it.

The result of the netowrk should be a continuous random variable. FOr example, if it were the normal distribution, the result would be 2 neurons representing the mean and variance of the normal distribution. However, normal distribution is bad for this.

**Your answer**: TODO.