<a href="https://colab.research.google.com/github/ArjunRameshV/Practical_Reinforcement_Learning/blob/master/Gettign_started_with_Reinforcement_Learning_Environments_using_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A brief Introduction

Some of the main objectives:

* Maximise the reward (maybe instantaneous or future ones). In general the sum of rewards over a certain time frame (also known as the return) 
* Environment, the problem representation 
* Agent, the algorithm representation. 
* Policy, a mapping between the states and possible actions. We do not know the best policy (i.e which action at a particular state gives the best reward) when starting out. The aim is to keep interacting with the environment to find an optimal policy. 
* The observation made by the agent may only be a subset of the environment's state (though they are used interchangeably at times) 



## The Cartpole Environment 

The "Hello World" to reinforcement learning. 

Here, we have a pole fixed to a cart, that moves on a frictionless surface. The pole is initially inclined at 90<sup>o</sup> to the surface of the cart. The goal is to prevent the pole from falling over by controlling the cart. 

In the classical setup, the state of the environment, made available to the cart is a 4D vector consisting of the position, velocity of the cart and the angle of inclination and angular velocity of the pole.

Possible actions that can be taken by the agent are: +1 (move right) or -1 (move left)

A reward 1 is provided for every timestep the pole remains upright. We come to an end (an episode ends) when :

    * The pole tips over some angle limit 
    * The cart moved outside the world's boundaries 
    * The time step is 200

The goal is to learn a policy $\pi(a_{s}|s_{t})$, that finds the maximum return (sum of rewards) in an episode $\sum_{t=0}^{T}\gamma^{t}r_{t}$. Here $\gamma$ is called the discount factor and has a value between [0,1]. The main reason for using gamma is to make sure that we prioritize immediate rewards over future ones so that the best policy is found quickly. 



**Making an environment** 

In general, environments can be made in python or tensorflow. One possible things we will look into is to make an environment in python and use a wrapper to convert it into tensorflow

### Setup

In [1]:
!pip install -q tf-agents

[?25l[K     |▎                               | 10kB 10.4MB/s eta 0:00:01[K     |▌                               | 20kB 16.1MB/s eta 0:00:01[K     |▉                               | 30kB 17.6MB/s eta 0:00:01[K     |█                               | 40kB 15.4MB/s eta 0:00:01[K     |█▍                              | 51kB 10.3MB/s eta 0:00:01[K     |█▋                              | 61kB 8.6MB/s eta 0:00:01[K     |██                              | 71kB 9.5MB/s eta 0:00:01[K     |██▏                             | 81kB 9.7MB/s eta 0:00:01[K     |██▍                             | 92kB 10.0MB/s eta 0:00:01[K     |██▊                             | 102kB 9.3MB/s eta 0:00:01[K     |███                             | 112kB 9.3MB/s eta 0:00:01[K     |███▎                            | 122kB 9.3MB/s eta 0:00:01[K     |███▌                            | 133kB 9.3MB/s eta 0:00:01[K     |███▉                            | 143kB 9.3MB/s eta 0:00:01[K     |████                 

In [2]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [3]:
import abc 
import tensorflow as tf 
import numpy as np 

In [4]:
from tf_agents.environments import py_environment 
from tf_agents.environments import tf_environment 
from tf_agents.environments import tf_py_environment 
from tf_agents.environments import utils 
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers 
from tf_agents.environments import suite_gym 
from tf_agents.trajectories import time_step as ts 

In [5]:
# to make sure we use tensorflow 2 functions 
tf.compat.v1.enable_v2_behavior()

### Python Environment 

The environment should have a function, that takes an action from the agent for the current state and returns the next state along with a reward. In general we need a **step(action) --> next_time_step** that returns:
 
* **observations**: The part of the environment that can be seen by the agent. 
* **reward**: A return from the environment after an agent takes an action. 
* **step_type**: Interactions with an environment are usually part of an episode. This is used to indicate whether this time step is the first (`FIRST`), intermediary (`MID`) or last (`LAST`) in the episode. 
* **discount**: A float representing how much to weight the reward at the next timestep to the current reward. 

These are grouped into a tuple `TimeStep(step_type, reward, discount, observation).`


```
class PyEnvironment(object):
  def __init__(self):
    # return the initial time step 
    self._current_time_step = self._reset()
    return self._current_time_step 

  def step(self, action):
    # apply the action and return the next step 
    if self._current_time_step is None:
      return self.reset()
    self._current_time_step = self._step(action)
    return self._current_time_step
  
  def current_time_step(self):
    return self._current_time_step

  def time_step_spec(self):
    # returns the time_step_spec

  @abc.abstractmethod
  def observation_spec(self):
    # return observation_spec

  @abc.abstractmethod
  def action_spec(self):
    # return action_spec

  @abc.abstractmethod
  def _reset(self):
    # resets the inital_time_step

  @abc.abstractmethod
  def _step(self, action):
    # apply action and return the new time_step

```



### Understanding a standard environment 

Tensorlfow agents have built-in wrappers for many standard environments like Open-AI Gym. Lets see how these wrapped environments look like and the structure of the action and time_step_spec they follow 

In [7]:
environment = suite_gym.load("CartPole-v0")

In [8]:
print('action_spec: ', environment.action_spec())

action_spec:  BoundedArraySpec(shape=(), dtype=dtype('int64'), name='action', minimum=0, maximum=1)


In [11]:
# the observation available to the agent at each time_step
print('time_step_spec.observation: ', environment.time_step_spec().observation)
# the type of step at this time_step (the inital, intermediary or final)
print('time_step_spec.step_type: ', environment.time_step_spec().step_type)
# the gamma factor in return 
print('time_step_spec.discount: ', environment.time_step_spec().discount)
# the reward if action is made from a particular state 
print('time_step_spec.reward: ', environment.time_step_spec().reward)

time_step_spec.observation:  BoundedArraySpec(shape=(4,), dtype=dtype('float32'), name='observation', minimum=[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], maximum=[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38])
time_step_spec.step_type:  ArraySpec(shape=(), dtype=dtype('int32'), name='step_type')
time_step_spec.discount:  BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0)
time_step_spec.reward:  ArraySpec(shape=(), dtype=dtype('float32'), name='reward')


In [12]:
# observaing what happens when we take a fixed action 
action = np.array(1, dtype=np.int32)
time_step = environment.reset()
print("The initial time_step: ", time_step)
while not time_step.is_last():
  # the agent keeps making the same action
  time_step = environment.step(action)
  print(time_step)

The initial time_step:  TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.01817294,  0.00269327,  0.0018532 , -0.00895948], dtype=float32))
TimeStep(step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.0182268 ,  0.1977886 ,  0.00167401, -0.30105713], dtype=float32))
TimeStep(step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.02218258,  0.39288664, -0.00434713, -0.59321165], dtype=float32))
TimeStep(step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.03004031,  0.5880692 , -0.01621137, -0.88726074], dtype=float32))
TimeStep(step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32), observation=array([ 0.04180169,  0.7834074 , -0.03395658, -1.1849954 ], dt

##A custom Black-Jack Environment

Let's construct a simple environment to play the game of black-jack. \\
In this game, we have an infinite deck of cards numbered from 1 to 10. \\
At every turn, the agent can either take a new random card or stop the current round.  \\
 The goal of the game is to get a sum as close to 21 as possible at the end of the round while not exceeding 21.

 \\

 A possible way to construct the environment is by using the following representations:

 * **action**: 0 for taking a new card and 1 to terminate the current round. 

 * **observation**: (the current state of the environment, as seen by the agent) sum of cards in the current round (since the deck is considered as an infinite one, we dont have any advantage if tracking the cards. If the number of cards where finite, then each individual cards could have been included in the observation) 

 * **reward**: At the end of each episode,
> The objective is to get as close to 21 as possible. 
>      If sum_of_cards <= 21: 
          then sum_of_cards-21
        else:
          then -21 



In [16]:
class BlackJack(py_environment.PyEnvironment):
  def __init__(self):
    self._action_spec = array_spec.BoundedArraySpec(
        shape=(), dtype=np.int32, minimum=0, maximum=1, name='action')
    self._observation_spec = array_spec.BoundedArraySpec(
        shape=(1,), dtype=np.int32, minimum=0, name='observation')
    self._state = 0
    self._episode_ended = False 

  def action_spec(self):
    return self._action_spec

  def observation_spec(self):
    return self._observation_spec

  def _reset(self):
    self._state = 0
    self._episode_ended = False
    return ts.restart(np.array([self._state], dtype=np.int32))

  def _step(self, action):
    # if the last action ended the episode
    if self._episode_ended: 
      return self.reset()

    # make sure the episodes terminate 
    if action == 1:
      self._episode_ended = True
    elif action == 0:
      new_card = np.random.randint(1, 11)
      self._state += new_card 
    else:
      raise ValueError(' action should be either 0 or 1')

    if self._episode_ended or self._state == 21:
      reward = self._state - 21 if self._state <= 21 else -21 
      return ts.termination(np.array([self._state], dtype=np.int32), reward)
    else: 
      return ts.transition(np.array([self._state], dtype=np.int32), reward=0.0, discount=1.0)    

Some things to keep in mind when creating an environment, the observations and the time_steps generated follow the correct shapes and types as defined the the specs.

In [17]:
# creating a mock run with random policy that iterates over 5 episodes 
environment = BlackJack()
utils.validate_py_environment(environment=environment, episodes=5)

**Test Run**: Use a fixed policy to generate 3 cards and then end the turn. 

In [18]:
# since action=0 generates a new card and action=1 terminates the current episode
get_new_card_action = np.array(0, dtype=np.int32)
end_round_action = np.array(1, dtype=np.int32)

In [22]:
environment = BlackJack()
# an initial reset
time_step = environment.reset()
print("The initial time step: " , time_step)
cumulative_reward = time_step.reward

for i in range(3):
  time_step = environment.step(get_new_card_action)
  print(f'The {i+1}th time_step: {time_step}')
  cumulative_reward += time_step.reward

time_step = environment.step(end_round_action)
print("The terminal time_step: ", time_step)
cumulative_reward += time_step.reward 
print(f"The final reward: {cumulative_reward}")

The initial time step:  TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([0], dtype=int32))
The 1th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([7], dtype=int32))
The 2th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([16], dtype=int32))
The 3th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([17], dtype=int32))
The terminal time_step:  TimeStep(step_type=array(2, dtype=int32), reward=array(-4., dtype=float32), discount=array(0., dtype=float32), observation=array([17], dtype=int32))
The final reward: -4.0


##Environment Wrapper 

An environment wrapper takes a python environment and returns a modified environment, that is also an instance of py_environment.PyEnvironment. 

Some common wrappers: 
* ActionDiscretizeWrapper: Converts a continuous action space into a discrete action space. 
* RunsStats: Captures the run time statistics of an environment such as number of steps taken, number of episodes completed. 
* TimeLimit: Terminates the episode after a fixed amount of steps.


An example of Action Discretize wrapper, used to discretize the continuous action space for invertedPendulum (a PyBullet environment, that accepts action from the continuous range [-2,2]). 

In [27]:
env = suite_gym.load("Pendulum-v0")
print("Action specification: ", env.action_spec())

discrete_action_env = wrappers.ActionDiscretizeWrapper(env,num_actions=5)
print("Discretized action specification: ", discrete_action_env.action_spec())

print(f"\nThe initial data-type: {env.action_spec().dtype} and the data-type after wrapping: {discrete_action_env.action_spec().dtype}")

Action specification:  BoundedArraySpec(shape=(1,), dtype=dtype('float32'), name='action', minimum=-2.0, maximum=2.0)
Discretized action specification:  BoundedArraySpec(shape=(), dtype=dtype('int32'), name='action', minimum=0, maximum=4)

The initial data-type: float32 and the data-type after wrapping: int32


## TensorFlow Environments

It is very much similar to the python environment, except for the facts that:

* Tensor objects are generated instead of arrays 
* TF environments adds batch dimension to the tensor generated when compared to the python env specs. (An additional dimension will be present when compared to the above examples) 

Similar to the python environment, we have abstract methods that can be overridden or used an abstract methods. The current_time_step() can also initialize the environment if needed. 

```
class TFEnvironment(object):

  def time_step_spec(self):
    # descirbes the time_step tensor returned by step()

  def observation_spec(self):
    # defines the tensor spec for observation provided by the environment 

  def action_spec(self):
    # describes the tensor spec of the action expected by the step(action) 

  def reset(self):
    # returns the current time_step after reseting the environment 
    return self._reset()

  def current_time_step(self):
    # returns the current time_step
    return self._current_time_step()

  def step(self, action):
    # applies the action and returns the new time_step 
    return self._step(action) 

  @abc.abstractMethod
  def _reset(self):
    # returns the current time_step after reseting the environment

  @abc,abstractMethod
  def _current_time_step(self):
    # returns the current time_step 

  @abc.abstractMethod
  def _step(self, action):
    # applies the action and returns the new time_step
```

## Wrapping a python environmetn in tensorflow 


In [29]:
env = BlackJack()
tf_env = tf_py_environment.TFPyEnvironment(env)

print(isinstance(tf_env, tf_environment.TFEnvironment))
print("Python env, time_step spec: ", env.time_step_spec())
print("TF env, time_step_spec: ", tf_env.time_step_spec())
print("\n")
print("Python env, action spec: ", env.action_spec())
print("TF env, action_spec: ", tf_env.action_spec())

True
Python env, time_step spec:  TimeStep(step_type=ArraySpec(shape=(), dtype=dtype('int32'), name='step_type'), reward=ArraySpec(shape=(), dtype=dtype('float32'), name='reward'), discount=BoundedArraySpec(shape=(), dtype=dtype('float32'), name='discount', minimum=0.0, maximum=1.0), observation=BoundedArraySpec(shape=(1,), dtype=dtype('int32'), name='observation', minimum=0, maximum=2147483647))
TF env, time_step_spec:  TimeStep(step_type=TensorSpec(shape=(), dtype=tf.int32, name='step_type'), reward=TensorSpec(shape=(), dtype=tf.float32, name='reward'), discount=BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)), observation=BoundedTensorSpec(shape=(1,), dtype=tf.int32, name='observation', minimum=array(0, dtype=int32), maximum=array(2147483647, dtype=int32)))


Python env, action spec:  BoundedArraySpec(shape=(), dtype=dtype('int32'), name='action', minimum=0, maximum=1)
TF env, action_spec:  BoundedTens

 A few things to observe from above:

 * The change from BoundedArraySpec to BoundedTensorSpec
 * The datatype changes from int and float to tf.int and tf.float
 * We can see the additional dimension in minimum and maximum values. 

In [32]:
# doing a quick run, using the same policy as before

get_new_card_action = tf.constant([0], dtype=tf.int32)
end_round_action = tf.constant([1], dtype=tf.int32)

time_step = tf_env.reset()
num_of_cards_drawn_in_one_episode = 3
cumulative_reward = 0

for i in range(num_of_cards_drawn_in_one_episode):
  time_step = env.step(get_new_card_action)
  print(f"{i+1}th time_step: {time_step}")
  cumulative_reward += time_step.reward 

time_step = tf_env.step(end_round_action)
print(f"Final time_step: {time_step}")
cumulative_reward += time_step.reward 
print("The final reward: ", cumulative_reward)

1th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([10], dtype=int32))
2th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([16], dtype=int32))
3th time_step: TimeStep(step_type=array(1, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([19], dtype=int32))
Final time_step: TimeStep(step_type=<tf.Tensor: shape=(1,), dtype=int32, numpy=array([2], dtype=int32)>, reward=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([-2.], dtype=float32)>, discount=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>, observation=<tf.Tensor: shape=(1, 1), dtype=int32, numpy=array([[19]], dtype=int32)>)
The final reward:  tf.Tensor([-2.], shape=(1,), dtype=float32)


## Running a wrapped environment of the cart-pole

We try to run an example of the cart-pole (explained after the inroduction section) 

In [34]:
# loading the environment 
env = suite_gym.load("CartPole-v0")

In [39]:
# wrapping it to a tensor environment 
tf_env = tf_py_environment.TFPyEnvironment(env)

In [38]:
# simulating one episode 

time_step = tf_env.reset() 
num_steps = 3 
transitions = []
reward = 0

for i in range(num_steps):
  action = tf.constant([i%2])   # decide an action, move right or left 
  next_time_step = tf_env.step(action)   # take the action 
  transitions.append([time_step, action, next_time_step])   # remember the transition information 
  reward += next_time_step.reward   # update the rewad 
  time_step = next_time_step    # update the time_step variable

np_transitions = tf.nest.map_structure(lambda x: x.numpy(), transitions)
print("\n".join(map(str, np_transitions)))
print(f"Total reward: {reward.numpy()}")

[TimeStep(step_type=array([0], dtype=int32), reward=array([0.], dtype=float32), discount=array([1.], dtype=float32), observation=array([[ 0.02394745, -0.04264867,  0.0013553 , -0.00630255]],
      dtype=float32)), array([0], dtype=int32), TimeStep(step_type=array([1], dtype=int32), reward=array([1.], dtype=float32), discount=array([1.], dtype=float32), observation=array([[ 0.02309448, -0.23779003,  0.00122925,  0.2868077 ]],
      dtype=float32))]
[TimeStep(step_type=array([1], dtype=int32), reward=array([1.], dtype=float32), discount=array([1.], dtype=float32), observation=array([[ 0.02309448, -0.23779003,  0.00122925,  0.2868077 ]],
      dtype=float32)), array([1], dtype=int32), TimeStep(step_type=array([1], dtype=int32), reward=array([1.], dtype=float32), discount=array([1.], dtype=float32), observation=array([[ 0.01833868, -0.04268564,  0.00696541, -0.0054873 ]],
      dtype=float32))]
[TimeStep(step_type=array([1], dtype=int32), reward=array([1.], dtype=float32), discount=array([

In [40]:
# running multiple episodes 

time_step = tf_env.reset()
rewards = []
steps = []
num_episodes = 5

for i in range(num_episodes):
  episode_reward = 0
  episode_steps = 0
  while not time_step.is_last():
    action = tf.random.uniform(shape=[1], dtype=tf.int32, minval=0, maxval=2)
    time_step = tf_env.step(action)
    episode_steps += 1
    episode_reward += time_step.reward.numpy() 
  print(f"Episode {i+1}, reward: {episode_reward} and #steps: {episode_steps}")
  rewards.append(episode_reward)
  steps.append(episode_steps)
  time_step = tf_env.reset()

num_steps = np.sum(steps)
avg_length = np.mean(steps)
avg_reward = np.mean(rewards)

print(f"total num of episodes: {num_episodes}, total number of steps: {num_steps}")
print(f"average length of steps: {avg_length}, average reward: {avg_reward}")

Episode 1, reward: [10.] and #steps: 10
Episode 2, reward: [38.] and #steps: 38
Episode 3, reward: [17.] and #steps: 17
Episode 4, reward: [19.] and #steps: 19
Episode 5, reward: [15.] and #steps: 15
total num of episodes: 5, total number of steps: 99
average length of steps: 19.8, average reward: 19.799999237060547
