## Value iteration gridworld


In project we will implement the Value Iteration algorithm to compute an optimal policy for three different (but related) Markov Decision Processes. The pseudo-code for the algorithm is reproduced below from the textbook (Reinforcement Learning, Sutton & Barto, 1998). 

<img src="images/value_iteration.png" style="width: 800px;"/>

The set $\mathcal{S}$ contains all non-terminal states, whereas $\mathcal{S}^+$ is the set of all states (terminal and non-terminal). The reward $r = r(s, a, s')$ is the expected immediate reward on transition from state $s$ to the next state $s'$ under action $a$. 

<img src="images/bombs and gold numbers.png" style="width: 300px;" align="left" caption="Figure 1"/>

The three problems you will solve use variants of the gridworld environment shown on the left. You should be familiar with this environment from the lectures and from your previous lab exercise. The grid squares in the figure are numbered as shown. In all three problems, the following is true: 

**Actions available:** The agent has four possible actions in each grid square. These are _west_, _north_, _south_, and _east_. If the direction of movement is blocked by a wall (for example, if the agent executes action south at grid square 1), the agent remains in the same grid square. 

**Collecting gold:** On its first arrival at a grid square that contains gold, the agent collects the gold. In order to collect the gold, the agent needs to transition into the grid square (containing the gold) from a different grid square. 

**Hitting the bomb:** On arrival at a grid square that contains the bomb, the agent activates the bomb. 

** Terminal states:** The game terminates when all gold is collected or when the bomb is activated. In Exercises 1 and 2, you can define terminal states to be grid squares 18 and 23. In Exercise 3, you will need to define terminal state(s) differently.


### Instructions ###
Set parameter $\theta$ to 1 to the power of -10.

Set all initial state values $V(s)$ to zero.


We will use the reward function: $-1$ for each navigation action (including when the action results in hitting the wall), an additional $+10$ for collecting each piece of gold, and an additional $-10$ for activating the bomb. For example, the immediate reward for transitioning into a square with gold is $-1 + 10 = +9$. 


## Deterministic environment

In this first exercise, the agent is able to move in the intended direction with certainty. For example, if it executes action _north_ in grid square 0, it will transition to grid square 5 with probability 1. In other words, we have a deterministic environment. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import random


In [2]:
class Gridworld:
    def __init__(self):
        self.num_rows = 5
        self.num_cols = 5
        self.gold_reward = 10
        self.bomb_reward = -10
        self.gold_positions = np.array([23])
        self.bomb_positions = np.array([18])
        self.random_move_probability = 0.2

        self.actions = ["UP", "RIGHT", "DOWN", "LEFT"]
        self.num_actions = len(self.actions)
        self.num_fields = self.num_cols * self.num_rows
        
        self.rewards = np.zeros(shape=self.num_fields)
        self.rewards[self.bomb_positions] = self.bomb_reward
        self.rewards[self.gold_positions] = self.gold_reward

        self.step = 0
        self.cumulative_reward = 0
        self.agent_position = np.random.randint(0, 5)
        self.availableMoves = np.zeros(self.num_actions)
        self.stochasticMoves = np.zeros(shape=(self.num_fields,self.num_actions))
        
    def availablePositions(self, oldstate):
        # Determine available positions and check whether the agent hits a wall.
        #[up,right,down,left]
        
        #up
        candidate_position = oldstate + self.num_cols
        if candidate_position < self.num_fields:
            self.availableMoves[0] = candidate_position
        else:
            self.availableMoves[0] = oldstate
        
        #right
        candidate_position = oldstate + 1
        if candidate_position % self.num_cols > 0:
            self.availableMoves[1] = candidate_position
        else:
            self.availableMoves[1] = oldstate
        
        #down
        candidate_position = oldstate - self.num_cols
        if candidate_position >= 0:
            self.availableMoves[2] = candidate_position
        else:
            self.availableMoves[2] = oldstate
        
        #left
        candidate_position = oldstate - 1
        if candidate_position % self.num_cols < self.num_cols - 1:
            self.availableMoves[3] = candidate_position
        else:
            self.availableMoves[3] = oldstate
        
        
        return self.availableMoves

    def reset(self):
        self.agent_position = np.random.randint(0, 5)

    def is_terminal_state(self):
        # The following statement returns a boolean. It is 'True' when the agent_position
        # coincides with any bomb_positions or gold_positions.
        return np.append(self.bomb_positions, self.gold_positions)

In [3]:
# function to play the game
def play(environment, episodes=500):
    deter = vcalc(environment)
    for episode in range(0, episodes):
        environment.reset()
        
        for i in range(environment.num_fields):
            state = i
            positions = environment.availablePositions(state)

            vtab, bestpol = deter.buildValTable(state, positions, environment)

    return vtab, bestpol

In [4]:
# class to calculate value iteration
class vcalc:
    def __init__(self, environment, epsilon=0.05, alpha=0.1, gamma=1):
        self.environment = environment

        self.valtable = np.zeros(shape=(self.environment.num_fields))
        self.transProb = np.zeros(shape=(self.environment.num_fields, self.environment.num_actions,self.environment.num_fields))
        self.bestPolicy = np.empty(shape=(self.environment.num_fields),dtype=str)
        
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
    def calcTransProbs ():
        pass
    
    def buildValTableStoch(self,state,positions, environment):
        #get rewards of neigbours
        
        positions = [int(x) for x in positions]
        reward = np.empty(len(positions))
        for i in range(len(positions)):

            reward[i] = environment.rewards[int(positions[i])]-1

        
        if (state==environment.is_terminal_state()).any():
            
            self.valtable[state] = 0
            
            #so passes assessment check
            self.bestPolicy[state] = 'n'
            
        else:
            availableVs = self.valtable[positions]

            reward_x_v = reward + availableVs
            
            #calculate stochastic v values for neighbour states
            stochV = np.zeros(4)
            
            stochV[0] = 0.85*reward_x_v[0] + 0.05*reward_x_v[1] + 0.05*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[1] = 0.05*reward_x_v[0] + 0.85*reward_x_v[1] + 0.05*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[2] = 0.05*reward_x_v[0] + 0.05*reward_x_v[1] + 0.85*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[3] = 0.05*reward_x_v[0] + 0.05*reward_x_v[1] + 0.05*reward_x_v[2] +0.85*reward_x_v[3]
            

            stochV = np.array(stochV)
            
            max_v_value = np.max(stochV)
             
            max_v_pos_list = list(np.nonzero(stochV == max_v_value))[0]
        
            max_v_pos = random.choice(max_v_pos_list)
            
            policies = ['n','e','s','w']
            
            self.valtable[state] = max_v_value
            self.bestPolicy[state] = policies[max_v_pos]

        
        return self.valtable,self.bestPolicy
        
    
    def buildValTable(self,state,positions, environment):
        #rewards for all positions
        #calculate rewards for all posible moves
        #builds the v table
        
        positions = [int(x) for x in positions]
        reward = np.empty(len(positions))
        
        #calc v for nearbouring states
        
        #get rewards for neighbouring states
        for i in range(len(positions)):

            reward[i] = environment.rewards[int(positions[i])]-1
            
        #check if terminal state and v is 0
        if (state==environment.is_terminal_state()).any():
            
            self.valtable[state] = 0
            
            #so passes assessment check
            self.bestPolicy[state] = 'n'
            
        #calculate v values by adding reward to vs and return max  
        else:
            availableVs = self.valtable[positions]

            reward_x_v = reward + availableVs

            max_v_value = np.max(reward_x_v)

            max_v_pos = int(np.where(reward_x_v == max_v_value)[0][0])
            
            policies = ['n','e','s','w']
            
            self.valtable[state] = max_v_value
            self.bestPolicy[state] = policies[max_v_pos]

        
        return self.valtable,self.bestPolicy



In [5]:
# Now we will use the code to compute the values of policy and v using the value iteration algorithm

environment = Gridworld()

v,policy = play(environment, episodes=500)

vdet = np.fliplr(v[::-1].reshape(5,5))
print(vdet)


#policy table
poldet = np.fliplr(policy[::-1].reshape(5,5))
print(poldet)

[[7. 8. 9. 0. 9.]
 [6. 7. 8. 0. 8.]
 [5. 6. 7. 6. 7.]
 [4. 5. 6. 5. 6.]
 [3. 4. 5. 4. 5.]]
[['e' 'e' 'e' 'n' 'w']
 ['n' 'n' 'n' 'n' 'n']
 ['n' 'n' 'n' 'e' 'n']
 ['n' 'n' 'n' 'n' 'n']
 ['n' 'n' 'n' 'n' 'n']]


#### Check the data types of your solution! 
The following tests allow check whether the variables `policy` and `v` ave the correct data types.

In [6]:
# Print the values you computed
print("This is your 'policy':")
print(policy)
print("These are your state values 'v':")
print(v)

# Check whether both policy and v are numpy arrays.
import numpy as np
assert(isinstance(policy, np.ndarray))
assert(isinstance(v, np.ndarray))

# Check correct shapes of numpy arrays.
assert(policy.shape == (25, ))
assert(v.shape == (25, ))

# Check whether the numpy arrays have the correct data types.
assert(np.issubdtype(policy.dtype, np.unicode_)) # policy.dtype should be '<U1'
assert(np.issubdtype(v.dtype, np.float64))

# Check whether all policy values are either "n", "w", "s", or "e".
assert(np.all(np.isin(policy, np.array(["n", "w", "s", "e"])))) 

# Arrays with CORRECT data types (but WRONG values!) would be, for example:
# policy = np.array(["n", "w", "s", "w", "e", "n", "w", "s", "w", "e", 
#                    "n", "w", "s", "w", "e", "n", "w", "s", "w", "e", 
#                    "n", "w", "s", "w", "e"])
# v = np.random.rand(25)
# DO NOT UNCOMMENT THE PREVIOUS lines... otherwise they will overwrite the arrays that you computed!

This is your 'policy':
['n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'n' 'e' 'n' 'n' 'n' 'n'
 'n' 'n' 'e' 'e' 'e' 'n' 'w']
These are your state values 'v':
[3. 4. 5. 4. 5. 4. 5. 6. 5. 6. 5. 6. 7. 6. 7. 6. 7. 8. 0. 8. 7. 8. 9. 0.
 9.]


In [7]:
# DO NOT DELETE THIS CELL. 
# Your code for Exercise 1 is tested here. 


## Stochastic environment

In this second exercise, the agent is not always able to execute its actions as intended. With probability 0.8, it moves in the intended direction. With probability 0.2, it moves in a random direction. For example, from grid square 0, if the agent executes action _north_, with probability 0.8, the action will work as intended. But with probability 0.2, the agent's motor control system will move in a random direction (including north). For example, with probability 0.05, it will try to move west (where it will be blocked by the wall and hence remain in grid square 0). Notice that the total probability of moving to square 5 (as intended) is 0.8 + 0.05 = 0.85.
 
Compute the optimal policy using Value Iteration.

In [8]:
def playStoch(environment, episodes=500):
    
    
    stoch = vcalc(environment)
    for episode in range(0, episodes):
        environment.reset()
    
        #loop through all states
        for i in range(environment.num_fields):
            state = i
            
            #find next positions for state
            positions = environment.availablePositions(state)
            
            #get v and policy tables

            vtabstoch, bestpolstoch = stoch.buildValTableStoch(state, positions, environment)


    return vtabstoch, bestpolstoch

v,policy = playStoch(environment, episodes=500)

print(np.fliplr(v[::-1].reshape(5,5)))

print(np.fliplr(policy[::-1].reshape(5,5)))

[[6.04169329 7.28756636 8.61359951 0.         8.69262311]
 [4.86185111 5.99087587 6.37082431 0.         6.46721593]
 [3.67550938 4.69621388 4.99441863 3.2189158  5.10250988]
 [2.48699534 3.40945989 3.66922967 2.64122933 3.78610115]
 [1.35979208 2.19733672 2.42878751 1.57272161 2.55202451]]
[['e' 'e' 'e' 'n' 'w']
 ['n' 'n' 'n' 'n' 'n']
 ['n' 'n' 'n' 'e' 'n']
 ['n' 'n' 'n' 'e' 'n']
 ['n' 'n' 'n' 'n' 'n']]


In [9]:
# Do not delete this cell. Your code for Exercise 2 is tested here.

## Exercise 3: Stochastic environment with two pieces of gold (3 marks)

<img src="images/bomb and two gold.png" style="width: 300px;" align="left" caption="Figure 1"/> In this third exercise, the environment is identical to the environment in Exercise 2 with the following exception: there is an additional piece of gold on grid square 12. Recall from earlier instructions that the terminal state is reached only when _all_ gold is collected or when the bomb is activated.

Compute the optimal policy using Value Iteration. 

In [10]:
class Gridworld1:
    def __init__(self):
        self.num_rows = 5
        self.num_cols = 5
        self.num_grids = 3

        self.gold_reward = 10
        self.bomb_reward = -10
        
        self.gold_positions = np.array([12,23,37,73])
        
        self.gold_positions_terminal = np.array([37,73])
        

        self.bomb_positions = np.array([18, 43, 68])

        

        self.actions = ["UP", "RIGHT", "DOWN", "LEFT"]
        self.num_actions = len(self.actions)
        self.num_fields = self.num_cols * self.num_rows * self.num_grids

        self.rewardsW1 = np.array([12,23])
        self.rewardsw2 = np.array([48])
        self.rewardsw3 = np.array([62])
        
        
        self.terminalw2 = np.array([48])
        self.terminalw3 = np.array([62])
        
        
        self.step = 0
        self.cumulative_reward = 0
        self.agent_position = np.random.randint(0, 5)
        self.availableMoves = np.zeros(self.num_actions)
        self.stochasticMoves = np.zeros(shape=(self.num_fields,self.num_actions))
        self.gridNum = 1
        self.terminalStates = self.bomb_positions
        
    def gridSelect(self, state):
        if state == self.gold_position1:
            self.gridNum == 2
            self.terminalStates = np.append(self.terminalStates,self.gold_position4)
            newstate = self.gold_position3
            return newstate
            
        elif state == self.gold_position2:
            self.gridNum == 3
            self.terminalStates=np.append(self.terminalStates,self.gold_position3)
            newstate = self.gold_position4
            return newstate
        return state
    
    
    def getRewards (self, state):
        self.rewards = np.zeros(shape=self.num_fields)
        self.rewards[self.bomb_positions] = self.bomb_reward
        
        if state < 25:
            self.rewards[self.rewardsW1] = self.gold_reward
        elif state < 50:
            self.rewards[self.rewardsw2] = self.gold_reward
        else:
            self.rewards[self.rewardsw3] = self.gold_reward
        return self.rewards
    
    def terminal (self, state):
        
        self.terminalStates = np.append(self.bomb_positions,self.terminalw2)
        
        
        self.terminalStates = np.append(self.terminalStates, self.terminalw3)
    
        return self.terminalStates
        
    
    def availablePositions(self, oldstate):
        # Determine new available positions and check whether the agent hits a wall.
        #[up,right,down,left]
    
        #check world and make sure dont transition between worlds
        
        if oldstate < 25:
            world = 1
            upper = 25
            lower = 0
        elif oldstate < 50:
            world = 2
            upper = 50
            lower = 25
        else:
            world =3
            upper = 75
            lower = 50
        
        #up
        candidate_position = oldstate + self.num_cols
        if candidate_position < upper:
            self.availableMoves[0] = candidate_position
        else:
            self.availableMoves[0] = oldstate
        
        #right
        candidate_position = oldstate + 1
        if candidate_position % self.num_cols > 0:
            self.availableMoves[1] = candidate_position
        else:
            self.availableMoves[1] = oldstate
        
        #down
        candidate_position = oldstate - self.num_cols
        if candidate_position >= lower:
            self.availableMoves[2] = candidate_position
        else:
            self.availableMoves[2] = oldstate
        
        #left
        candidate_position = oldstate - 1
        if candidate_position % self.num_cols < self.num_cols - 1:
            self.availableMoves[3] = candidate_position
        else:
            self.availableMoves[3] = oldstate
        
        
        return self.availableMoves

    def reset(self):
        self.agent_position = np.random.randint(0, 5)

    def is_terminal_state(self):
        # The following statement returns a boolean. It is 'True' when the agent_position
        # coincides with any bomb_positions or gold_positions.
        #return np.append(self.bomb_positions, self.gold_positions)
        return np.append(self.bomb_positions, self.gold_positions_terminal)

In [11]:
class vcalc1:
    def __init__(self, environment, epsilon=0.05, alpha=0.1, gamma=1):
        self.environment = environment

        self.valtable = np.zeros(shape=(self.environment.num_fields))
        self.transProb = np.zeros(shape=(self.environment.num_fields, self.environment.num_actions,self.environment.num_fields))
        self.bestPolicy = np.empty(shape=(self.environment.num_fields),dtype=str)

        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
    
    def buildValTableStoch(self,state,positions, environment):      
        
        positions = [int(x) for x in positions]
        
        reward = np.empty(len(positions))
        for i in range(len(positions)):
            
            reward[i] = environment.getRewards(state)[int(positions[i])]-1
            
        terminal = environment.terminal(state)
        
        #check if terminal state
        if state in terminal:
            
            self.valtable[state] = 0
            
            #so passes assessment check
            self.bestPolicy[state] = 'n'
            
        else:
            availableVs = self.valtable[positions]

            reward_x_v = reward + availableVs
                        
        #calculate stochatic v values for each position
            stochV = np.zeros(4)
            
            stochV[0] = 0.85*reward_x_v[0] + 0.05*reward_x_v[1] + 0.05*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[1] = 0.05*reward_x_v[0] + 0.85*reward_x_v[1] + 0.05*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[2] = 0.05*reward_x_v[0] + 0.05*reward_x_v[1] + 0.85*reward_x_v[2] +0.05*reward_x_v[3]

            stochV[3] = 0.05*reward_x_v[0] + 0.05*reward_x_v[1] + 0.05*reward_x_v[2] +0.85*reward_x_v[3]
            
        
            stochV = np.array(stochV)
            
            max_v_value = np.max(stochV)
            
            
            max_v_pos_list = list(np.nonzero(stochV == max_v_value))[0]
            
            #calculate v table and pos table
            
            max_v_pos = random.choice(max_v_pos_list)

            
            policies = ['n','e','s','w']
            
            self.valtable[state] = max_v_value
            self.bestPolicy[state] = policies[max_v_pos]

        
        return self.valtable,self.bestPolicy

In [12]:
def play2gold(environment, episodes=500):
    

    stoch = vcalc1(environment)
    for episode in range(0, episodes):
        environment.reset()

        #while step < max_steps_per_episode and not game_over:
        for state in range(environment.num_fields):
            #if one gold selected move to next world
            if state == 12:
                positions = environment.availablePositions(37)
            elif state == 23: 
                positions = environment.availablePositions(73)
            else:
                positions = environment.availablePositions(state)
            

            vtabstoch, bestpolstoch = stoch.buildValTableStoch(state, positions, environment)


    return vtabstoch, bestpolstoch

In [13]:
# Please write your code for Exercise 3 here. Your code should compute the 
# values of policy and v from scratch when this cell is excuted, using the value 
# iteration algorithm. We will mark your coursework by checking the values of 
# the variables policy and v in the following cell. 
environment = Gridworld1()


v,policy = play2gold(environment, episodes=500)


policy = policy[:25]

v = v[:25]


print(np.fliplr(policy[::-1].reshape(5,5)))

print(np.fliplr(v[::-1].reshape(5,5)))


[['e' 'e' 'e' 'w' 'w']
 ['e' 's' 's' 'n' 'n']
 ['e' 'e' 'n' 'w' 'w']
 ['e' 'n' 'n' 'w' 'w']
 ['n' 'n' 'n' 'n' 'w']]
[[10.65103994 11.79603433 13.00848756  4.28547547 12.97048714]
 [11.1861353  12.32932372 12.5121464   0.         10.61568556]
 [12.28702748 13.59365638  4.99441863 12.41930416 11.19974428]
 [11.17522837 12.35165962 13.59092292 12.28122217 11.051285  ]
 [10.06409806 11.17488271 12.28045984 11.10816476  9.99389366]]


In [14]:
# Do not delete this cell. Your code for Exercise 3 is tested here.
