# A2C Tutorial Notebook

This notebook is here to guide you through the basics of the frameworks necessary for you to do well on your CS456 mini-project.

In [2]:
import gymnasium as gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Gymnasium environments

One of the main and most spread environment frameworks in the field of RL research is [Gymnasium](https://gymnasium.farama.org/).
 It provides standardized environments offering a large range of difficulties and setups, that are well designed to benchmark performances of RL and Deep RL algorithms.

The main structure is very simple to understand. First, we need to instantiate our environment. We will use an existing environment, but one could also use their structure to design their own environment.

Let's directly work with the CartPole environment that will be used in the project. 

_PS: If you're more curious, feel free to browse the large list available on their website!_

In [3]:
env = gym.make('CartPole-v1')

The environment contains an action space and an observation (state) space. Let's see what these look like.

In [5]:
print(f"Action space: {env.action_space}")
print(f"Observation space: {env.observation_space}")

Action space: Discrete(2)
Observation space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)


In [6]:
print(f"Number of actions available: {env.action_space.n}")
print(f"Observation shape: {env.observation_space.shape}")

Number of actions available: 2
Observation shape: (4,)


As we can see, the action space of that first environment is discrete and contains 2 possible actions: push the cart left or right.

The observation space has a dimension of 4, and you can find what each part represents [here](https://gymnasium.farama.org/environments/classic_control/cart_pole/#observation-space).

Before taking actions, the environment should be reset (or boostrapped).
 **Note: this should be done every time the environment has to be restarted, i.e., at the end of any episode.**

In [8]:
# the second return value is an info dictionary, but it doesn't contain anything in this environment
starting_state, _ = env.reset() 

print(f"Starting state: {starting_state}")

Starting state: [-0.02279779  0.00576854  0.0082887  -0.01319266]


Now that we know what the actions look like and that the environment is ready, we can take actions inside it. This is done using the `env.step` function, that takes an action as input, and returns multiple values. More details on each of them can be found [here](https://gymnasium.farama.org/api/env/#gymnasium.Env.step).

In the project, you will have an agent that will choose an action (based on the policy learned) given the current state. However, for now, we can simply sample actions at random using `action_space.sample()`.

In [9]:
action = env.action_space.sample()
print(f"Sampled action: {action}")
next_state, reward, terminated, truncated, _ = env.step(action) # again, the last return value is an empty info object

print(f"Next state: {next_state}")
print(f"Reward: {reward}")
print(f"Terminated: {terminated}")
print(f"Truncated: {truncated}")

Sampled action: 1
Next state: [-0.02268242  0.20077065  0.00802485 -0.3032489 ]
Reward: 1.0
Terminated: False
Truncated: False


The `terminated` and `truncated`  variables represent the two ways that the episode might be done.
`terminated` indicates an MDP terminal state, and the reward will always be 0 afterward in case the horizon is longer or infinite.
`truncated` indicates an artificial ending of the trajectory when the horizon may be infinite (to not run forever).
Therefore, you should bootstrap the returns with the value function based on only `terminated`.
However, you should use both to decide when to reset the environment:
```
done = terminated or truncated
```

We now have all the pieces necessary to run a full episode!

In [14]:
done = False
state, _ = env.reset()
episode_reward = 0

while not done:
    action = env.action_space.sample()
    next_state, reward, terminated, truncated, _ = env.step(action)

    episode_reward += reward

    state = next_state
    done = terminated or truncated

print(f"Episode reward after taking random actions: {episode_reward}")

Episode reward after taking random actions: 48.0


Now your goal in the project will be to code an agent that can beat that.