Custom notebook I did for the course "Fundamentals of Deep Reinforcement Learning" by LVx, as the original one is not available anymore on edX.

#PART 2 : The Bellman equation

In this notebook, we will try to reproduce the environement described at 11:55 in the Bellman equation video, and we will try to found the state-value function for a random policy

In [1]:
import numpy as np

![let's add a number to each state](../pictures/shemas.png)

In [2]:
class environement():
    """implementation of the environement described in the picture above"""
    def __init__(self):
        self.states = list(range(8))
        self.possible_action = np.array([[1, 2], [3, 4], [4, 5], [6, 4], [3, 5], [4, 7]]) # one line per state, from 0 to 5, and one possible destination per column based on the state
        self.rewards = np.array([[-2, 3], [3, 1], [15, 5], [-6, 3], [3, -4], [-4, -2]]) # for each state (one per line) give the reward associated with the action on the previous array
        self.current_state = 0

    def get_actions(self):
        return self.possible_action[self.current_state]
    
    def make_action(self, action):
        # reward = self.rewards[self.current_state][np.where(self.possible_action[self.current_state] == action)[0][0]] # we find in the reward table what is the reward for the transition from current state to state "action"
        self.current_state = action
        # return reward
    
    def get_reward_info(self, action):
        reward = self.rewards[self.current_state][np.where(self.possible_action[self.current_state] == action)[0][0]] # we find in the reward table what is the reward for the transition from current state to state "action"
        return reward
    
    def check_end(self):
        if self.current_state == 6 or self.current_state == 7:
            return True
        else:
            return False
    
    def get_current_state(self):
        return self.current_state
    
    def restart(self):
        self.current_state = 0    

Ok, now that we have an environement to interact with, let's start making an agent following a random policy

In [3]:
class agent_class():
    def __init__(self, policy):
        self.policy = policy
    
    def act(self,possible_action):
        if self.policy == "random_policy":
            return np.random.choice(possible_action)
        elif self.policy == "75left":
            return np.random.choice(possible_action, p=[0.75, 0.25])
        elif self.policy == "75right":
            return np.random.choice(possible_action, p=[0.25, 0.75])
    def action_proba(self):
        """return the probability of choosing action 1 and action 2"""
        if self.policy == "random_policy":
            return [0.5,0.5]
        elif self.policy == "75left":
            return [0.75, 0.25]
        elif self.policy == "75right":
            return [0.25, 0.75]

We will initialise the state-value function randomly (except from the two terminal states, on which we know the value is 0)

In [4]:
# v = np.random.randint(0, 10, 6).tolist() + [0, 0] # 6 random numbers for the 6 first states, and two zeros for the terminal states
v = [0] * 8

Let's make ou agent interact with the environement and we will update the state-value mapping at each choice

In [5]:
def play(num_episode, env, agent, v):
    for i in range(num_episode):
        env.restart()
        while env.check_end() == False:
            current_state = env.get_current_state()
            possible_actions = env.get_actions()
            r0 = env.get_reward_info(possible_actions[0]) # reward associated with the choice 1 at current state
            r1 = env.get_reward_info(possible_actions[1]) # // choice 2
            action_probability = agent.action_proba()     # probability for the agent to choose choice 1 and choice 2

            # given that the environement is deterministic, ie : it give the same reward and next state given a certain state and action, the equation can be simplified as :
            # the sum over all action probabilities of the associated reward and next state value function (multiplied by the discount factor)
            v[current_state] =  (
                action_probability[0] * (gamma * v[possible_actions[0]] + r0) + 
                action_probability[1] * (gamma * v[possible_actions[1]] + r1)
            )

            action = agent.act(possible_actions)
            env.make_action(action)
    return v

In [6]:
gamma = 0.9 # discount factor
num_episode = 1000

env = environement()
agent = agent_class("random_policy")

v_final = play(num_episode, env, agent, v)
print(v_final)
# The results are coherent : for exemple it is much more valuable to be in state 2 rather than 1 because state 2 offers 2 good rewards in comparaison of state 1, and less chances of the max penalty of -6.

[2.496638655462184, -1.444012605042017, 5.880987394957982, -3.409663865546219, -4.243697478991597, -4.909663865546219, 0, 0]


You may have notice that we are using the knowledge of both choices reward in order to update the value function, which in a real case is often not possible as the agent will only explore one choice and therefore know only the associated reward, at each step.

Bonus : We can try with a different policy, for exemple we can use one that has 75 % chance of going left, and 25 % of chance of going right. This should make the state 2 even more valuable than in the previous random policy, as there is a greater chance of having the +15 reward on the left (but this state is harder to reach)

In [7]:
gamma = 0.9 # discount factor
num_episode = 1000

env = environement()
agent = agent_class("75left")

v_final = play(num_episode, env, agent, v)
print("values when going more often left : " + str(v_final))

agent = agent_class("75right")
v_final = play(num_episode, env, agent, v)
print("values when going more often right : " + str(v_final))

# We can see that having a greater chance of going left indeed make more valuable the state 2.

values when going more often left : [0.5555579555655292, -1.1510491472172353, 9.255627244165169, -4.418536804308797, -2.9712746858168764, -5.5056104129263925, 0, 0]
values when going more often right : [3.6648504824955115, -2.807063509874328, 3.772503366247755, -2.903725314183125, -5.412926391382407, -3.7179084380610417, 0, 0]
