<a href="https://colab.research.google.com/github/FernandaHinze/DynamicProgramming/blob/main/Copy_of_Lab_1_1_FER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab session 1.1



In this session we will explore the use of OpenAI Gym frameworks. 

We start off installing some necessary libraries/packages/modules:


In [1]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay #> /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install gym.wrappers > /dev/null 2>&1
!pip install gym[toy_text] > /dev/null 2>&1 # https://www.gymlibrary.dev/environments/toy_text/

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0


Now we import some of these libraries

In [2]:
# Libraries needed to create the virtual display and record the video
import gym
from gym import logger as gymlogger
from gym.wrappers import RecordVideo
gymlogger.set_level(40) #error only
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7f66384d42b0>

In [3]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = RecordVideo(env, './video')#, force=True)
  return env

# Getting started with Open AI gym

In supervised learning different methods can be evaluated on static data sets. In RL, however, the algorithms must be tested on interactive (dynamic) environments. This is where OpenAI Gym comes in.

[OpenAI Gym](https://www.gymlibrary.dev) is a toolkit for comparing RL-algorithms. It contains a wide variety of environments that you can train your agents on, and it is often used for benchmarking new methods in the RL research litterature. 
There are also [leaderboards](https://github.com/openai/gym/wiki/Leaderboard) for different gym-environments, showing which methods has been most successful so far.

In the assignments for this module we will make use of OpenAI gym.

**NOTE:** It is worth mentioning that OpenAI no longer mantains/fix bugs from Gym, but most of the examples we will use remain stable and can be used. For most of the newer implementations, [Gymnasium](https://gymnasium.farama.org) are the current maintainers. Unfortunately some interesting applications (like the ones we will see in the second part of the lab) are not directly implementable there.

To test your installation of OpenAI gym, and learn about basic usage, we will look at the relatively simple *Taxi-environment.*

#[Taxi environment](https://www.gymlibrary.dev/environments/toy_text/taxi/)

In this environment we have a taxi driver that has to pick up a passenger from one of 4 different locations (marked as yellow, green. blue and red), and then drop them off at a different location where the hotel is located.




<img src= 'https://www.gymlibrary.dev/_images/taxi.gif'>



In [4]:
env_taxi = wrap_env(gym.make('Taxi-v3'))
#if you don't need the video, you can just use env_taxi = gym.make('Taxi-v3')
state = env_taxi.reset() # create random starting point
new_step_api=True # can be deleted, but in this case the library asks for this
print('Initial state:', state)


Initial state: 267


The methods used above are:
* `make()`: Creates a gym environment object. In this case we use the Taxi-environment.
* `reset()`: Resets the environment to an initial state, and returns the initial state. 
In the case of the Taxi-environment, the initial state is chosen randomly, so it will be different every time you run `env.reset()`.


The filled square represents the taxi, the letters (R, G, Y and B) represents possible pickup and destination locations, and | represents a wall. The blue letter is the passenger, and the purple is the destination.

Next we take a look at the state space $\mathcal{S}$ (all possible states) and action space $\mathcal{A}$ (all possible actions). 

In [5]:
print("State space:", env_taxi.observation_space)
print("Action space:", env_taxi.action_space)

State space: Discrete(500)
Action space: Discrete(6)


__State space__: We see that the state space contains 500 discrete states. In this case each state corresponds to a position of the taxi (25 possibilities), the passengers position (5 possibilities, including picked up) and the destination (4 possibilities). Hence, there are $25\times5 \times 4 = 500$ possible states.

__Action space__: The six discrete actions correspond to: `0 : south, 1 : north, 2 : east, 3 : west, 4 : pickup, 5 : dropoff. `

***Remark***: You may have noticed that gym uses `observation_space` instead of state space. For the purpose of this lab, the state space is the same as the observations space (fully observable case). However, in some RL-problems the full state cannot be observed, so the space of possible states may not be the same as the space of possible observations. For example, the complete state of an [inverted pendulum](https://www.gymlibrary.dev/environments/classic_control/pendulum/) consists of both the angle and velocity, but often only the angle is meausred directly.

We next see how the agent can interact with the environment.

In [7]:
new_state, reward, done, info = env_taxi.step(1) # Take action 1 (north)
print("New state:", new_state)
print("Reward:", reward)
print("Done:", done)
print("Info:", info)

New state: 167
Reward: -1
Done: False
Info: {'prob': 1.0, 'action_mask': array([1, 1, 1, 1, 0, 0], dtype=int8)}


If it was possible, the taxi should now have moved one step north (if the taxi started at the top row then it will not move). The step-function returns the following information:
* __New state__: The state after the action is taken.
* __Reward__: The immediate reward. In the taxi-environment the reward for illegal "pickup" or "dropoff" is -10, successfully delivering the passenger gives +20, and any other action gives -1.
* __Done__: Is the environment done? In the Taxi-environment this will be false until the passenger is successfully dropped at her destination, or the number of actions taken gets larger than 200.
* __info__: Additional information mainly used for debugging.

The goal of the agent is thus to deliver the passenger to their destination in as few steps as possible. If more than 200 actions are taken, the agent has failed. 

One (quite bad) strategy for the taxi problem is to take a random action every time. Inside a gym-environment this can be done using `env.action_space.sample()`, which samples a random action from the action space. 

Look through the following loop and make sure that you understand whats going on. (We here use `time.sleep()` to pause between each action)

In [9]:
import time
env_taxi.reset() 
# reset gives a new random starting point. If you want keep using the same starting point
# use the command env_taxi.seed(<number here>)
time_step = 0
total_reward = 0
time_limit = 19 # to have fewer steps, agent will stop at time_step=200 anyway
done = False

while not done:
    
    action = env_taxi.action_space.sample()
    ## FER: samples random action from the action space
    state, reward, done, info = env_taxi.step(action)
    ## FER: agent takes one step, information about that step is returned (new state, reward and done)
    total_reward += reward
    ## FER: update the total reward of the agent
    time_step += 1
    ## FER: update the time
    
    print("Time step:", time_step)
    print("Reward:", reward)
    print("Total reward:", total_reward)
    time.sleep(.01)
    if time_step > time_limit:
      done = True

Time step: 1
Reward: -1
Total reward: -1
Time step: 2
Reward: -1
Total reward: -2
Time step: 3
Reward: -10
Total reward: -12
Time step: 4
Reward: -10
Total reward: -22
Time step: 5
Reward: -1
Total reward: -23
Time step: 6
Reward: -1
Total reward: -24
Time step: 7
Reward: -10
Total reward: -34
Time step: 8
Reward: -1
Total reward: -35
Time step: 9
Reward: -10
Total reward: -45
Time step: 10
Reward: -10
Total reward: -55
Time step: 11
Reward: -10
Total reward: -65
Time step: 12
Reward: -10
Total reward: -75
Time step: 13
Reward: -1
Total reward: -76
Time step: 14
Reward: -1
Total reward: -77
Time step: 15
Reward: -1
Total reward: -78
Time step: 16
Reward: -1
Total reward: -79
Time step: 17
Reward: -10
Total reward: -89
Time step: 18
Reward: -10
Total reward: -99
Time step: 19
Reward: -1
Total reward: -100
Time step: 20
Reward: -1
Total reward: -101


In [10]:
observation = env_taxi.reset() # FER: create random starting point
iterations = 0
max_iter = 30
while True:
  
    env_taxi.render()
    iterations += 1
    #your agent goes here
    action = env_taxi.action_space.sample() 
    # action_space.sample() selects at random one action from the action space: {0,1,2,3,4,5}
    observation, reward, done, info = env_taxi.step(action) 
   
        
    if done or iterations > max_iter: 
      break;
            
env_taxi.close()
show_video()

As you can see, random actions is, unsurprisingly, not a good policy. However, if the agent has no prior information about the environment or the goal, what else could it do?

If we know everything about the environment, we could create an array with 500 entries, where each entry tells us what the optimal action is in the corresponding state. Dynamic programming, which we will discuss in Lecture 3, gives a systematic way of computing such an array.

In Lecture 4 and 5 we will discuss how the agent can learn the array without prior knowledge, by just observing the reward recieved for taking different actions in different states.

## End of session 1.1

Go to the  session 1.2 [ here](https://colab.research.google.com/drive/1m38vYstt6V0TYZaLFr_mdO-mTSgB3lA2?usp=sharing)