# Planning-Lab Exam

## Exercise 1

Consider the following environment:

<img src="images/frozen_lake.gif"/>

The agent starts in cell $(0, 0)$ and has to cross a frozen lake to reach the gift in cell $(3,3)$ without falling into any Holes(H) by walking over the Frozen(F) lake. In addition to avoid falling into water holes (cells $(1,1)$, $(1,3)$, $(2,3)$ and $(3,0)$), the agent has to reach its destination as fast as possible, since walking on the frozen surface hurts him.

However, the agent may not always move in the chosen direction due to the slippery nature of the frozen lake! In fact, he will move in the chosen direction with probability of 1/2 else it will move in the perpendicular directions with equal probability of 1/4 for each direction.

In [2]:
import os, sys 
# ESEMPI ENV.GRID
module_path = os.path.abspath(os.path.join('tools'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym, envs
from utils.ai_lab_functions import *
from utils.exam_checker import *
import numpy as np
from timeit import default_timer as timer
from tqdm import tqdm as tqdm

env_name = 'SlipperyFrozenLakeEnv-v0'
env = gym.make(env_name)

env.render()

print("\nActions encoding: ", env.actions)

# Remember that you can know the type of a cell whenever you need by accessing the grid element of the environment:
print("Cell type of start state: ",env.grid[env.startstate])
print("Cell type of goal state: ",env.grid[env.goalstate])
state = 7 # a generic state of the environment,
print("Cell type of start state: ",env.grid[env.startstate])
print(f"Cell type of cell {state}: ",env.grid[state])

[['S' 'F' 'F' 'F']
 ['F' 'H' 'F' 'H']
 ['F' 'F' 'F' 'H']
 ['H' 'F' 'F' 'G']]

Actions encoding:  {0: 'L', 1: 'R', 2: 'U', 3: 'D'}
Cell type of start state:  S
Cell type of goal state:  G
Cell type of start state:  S
Cell type of cell 7:  H


### 1.1) Given the environment reported above, and assuming that the discount factor is set to 0.9, find the optimal way for the agent to reach the treasure and print the solution using the provided code: [20 Points]

In [3]:
def your_solution(environment, maxiters=300, discount=0.9, max_error=1e-3): #feel free to add all the arguments you need!
    
    # YOUR CODE HERE
    U_1 = [0 for _ in range(environment.observation_space.n)] # vector of utilities for states S
    delta = 0 # maximum change in the utility o any state in an iteration
    U = U_1.copy()

    # Code Here!
    while True:
        delta = 0
        U = U_1.copy()

        for state in range(environment.observation_space.n): 
            
            prob_vect = []

            for a in range(environment.action_space.n):
                p = 0
                for s in range(environment.observation_space.n):
                    p += environment.T[state, a, s] * U[s]
                prob_vect.append(p)
            
            if environment.grid[state] == 'H' or environment.grid[state] == 'G':
                U_1[state] = environment.RS[state]
            else:
                U_1[state] = environment.RS[state] + discount * max(prob_vect)    

            if abs(U_1[state]-U[state]) > delta:
                delta = abs(U_1[state]-U[state])
                
        if delta < (max_error * (1 - discount)/discount):
            break
        
    #
    return values_to_policy(np.asarray(U), env) # automatically convert the value matrix U to a policy
    
    
    # you should return the policy as a 1-d array of action identifiers, where the 𝑖-th action refers to the 𝑖-th state:
    #return np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [34]:
t = timer()

solution = your_solution(env)

#print(env.T[env.pos_to_state(0, 1), 0, env.pos_to_state(0, 0)])

print("\nEXECUTION TIME: \n{}\n".format(round(timer() - t, 4)))

# this function converts your solution to the corresponding 2-d action matrix:
solution_render = np.vectorize(env.actions.get)(solution.reshape(env.rows, env.cols)) 
check_1_1(env_name, solution_render)


EXECUTION TIME: 
0.0113

Environment: SlipperyFrozenLakeEnv-v0

Your solution:
[['D' 'R' 'D' 'L']
 ['D' 'D' 'D' 'L']
 ['R' 'D' 'D' 'D']
 ['R' 'R' 'R' 'L']]

[1m[92mYour solution is correct!
[0m


### 1.2) Now focus on state (2,1) and assume the value funcion is given (see code). Discuss which action is the optimal one for that specific state and why. Use code to motivate your answer: [15 Points]

In [4]:
actions = {0: "L", 1: "R", 2: "U", 3: "D"}

value_function = [-0.33, -0.36, -0.38, -0.38, -0.3, -1, -0, -1, -0.22, 0.02, -0.03, -1, -1, 0.23, 0.52, 1]
id_start_state = 9
gamma = 0.9


values_ex = [0, 0, 0, 0]

'''
YOUR CODE HERE
'''
prob_vect = []

for a in range(len(actions)):
    p = 0
    for s in range(len(value_function)):
        p += env.T[id_start_state, a, s] * value_function[s]
    prob_vect.append(p)

import numpy as np
correct_action = np.argmax(prob_vect)
    
print(f'The correct action to perform should be: {correct_action}')
check_2_1(correct_action)

The correct action to perform should be: 3
[1m[92mYour solution is correct!
[0m


### 1.3) Now suppose that we want the agent to avoid holes at all costs, as shown in the policy below:
<img src="images/policy.png"/>

### Change the reward array in order to force the agent to behave in this specific way. [15 Points]
(the start and goal positions are the same as in exercise 1.1)

In [5]:
env_name = 'ModifiableFrozenLakeEnv-v0'
env = gym.make(env_name)

# Change the reward specification above, the reward array will be updated by a dedicated funcion:
rewards = {"F": -0.04, "S": -0.04, "H": -5, "G": 4} 
env.RS = update_reward_array(rewards)

solution = your_solution(env) #Use your solution from exercise 1.1 to test your changes
solution_render = np.vectorize(env.actions.get)(solution.reshape(env.rows, env.cols)) 
check_1_3(env_name, solution_render)

Environment: ModifiableFrozenLakeEnv-v0

Your solution:
[['L' 'U' 'U' 'U']
 ['L' 'L' 'D' 'U']
 ['U' 'D' 'L' 'D']
 ['R' 'R' 'R' 'L']]

[1m[92mYour solution is correct!
[0m


## Exercise 2

Now consider the following environment:
    <img src="images/frozen_lake_big.png" style="zoom: 30%;"/>
    
The lake is much bigger, but the actions are no longer stochastic. **S=(0,0)** and **G=(7,7)** are the starting and goal positions respectively. Consider the problem of finding a minimum cost path from S to G assuming the agent can move in the four directions (except when a wall is present) and that each movement has a unitary cost. Answer the following questions:

In [6]:
env_name = 'BigFrozenLakeEnv-v0'
env = gym.make(env_name)
env.render() 

[['S' 'W' 'F' 'F' 'F' 'F' 'W' 'F']
 ['F' 'W' 'F' 'W' 'F' 'F' 'F' 'F']
 ['F' 'F' 'W' 'W' 'F' 'F' 'F' 'F']
 ['W' 'F' 'F' 'F' 'F' 'W' 'F' 'F']
 ['F' 'F' 'F' 'W' 'F' 'F' 'F' 'W']
 ['F' 'W' 'W' 'F' 'W' 'F' 'W' 'F']
 ['F' 'W' 'F' 'F' 'W' 'F' 'W' 'F']
 ['F' 'F' 'F' 'W' 'F' 'F' 'F' 'G']]


### 2.1) First of all, verify whether the Euclidean distance (l2_norm) is an  *admissible* heuristic in this environment. In particular, you should implement a function that checks whether for every state, the admissibility condition is verified. The function should return true if the heuristic is consistent and false otherwise. Keep in mind that every action has unitary cost. [15 Points]

Hint: The function below can be used as an oracle to know the real cost of going from a node n to the goal state.

In [7]:
print("Cost from start to goal: ", real_cost_to_goal(env.startstate))
print("Cost from position (3,3) to goal: ", real_cost_to_goal(env.pos_to_state(3,3)))

Cost from start to goal:  14
Cost from position (3,3) to goal:  8


#### From now on consider hole cells as unaccessible for the agent (they are in fact referred to as walls, "W", in the environment)

In [30]:
def check_euristic(environment):
    goalpos = environment.state_to_pos(environment.goalstate)
    for s in range(environment.observation_space.n): 
        if env.grid[s] == "W":
            continue
        '''
        YOUR CODE HERE
        '''
        for action in range(environment.action_space.n): # Per ogni azione considero lo stato successivo s'
            
            p1 = environment.state_to_pos(environment.sample(s, action)) # environment.sample(s, action) -> s' -> stato che ottengo dopo che eseguo l'azione a nello stato s
            
            if env.grid[environment.sample(s, action)] == "W":
                continue
            
            if Heu.l2_norm(environment.state_to_pos(s), goalpos) > (1 + Heu.l2_norm(p1, goalpos)): # Controllo se l2_norm è consistente
                return False
    return True
        
        # You should return a boolean value!

In [31]:
env = gym.make(env_name)
admissible = check_euristic(env)
if admissible is None:
    print("Provide a boolean value that states the admissibility of the euristic!")
elif admissible:
    print("You proved that the euristic is admissible.\n")
elif not admissible:
    print("You proved that the euristic is not admissible.\n")
check_2_1(check_euristic(env))

You proved that the euristic is admissible.

[1m[92mYour solution is correct!
[0m



### 2.2) Given the results of exercise 2.1, choose the version of A* that finds a minimum cost path from S (0,0) to G (7,7) guaranteeing optimality: [20 Points]

In [55]:
def present_with_higher_cost(queue, node):
    if node.state in queue:
        if queue[node.state].value > node.value: 
            return True
    return False

def a_star(environment): #feel free to add all the arguments you need!
    
    path = []
    time_cost = 1
    space_cost = 1
    
    '''
        YOUR CODE HERE
    '''
    goalpos = environment.state_to_pos(environment.goalstate)
    queue = PriorityQueue()
    
    queue.add(Node(environment.startstate))
    
    while True:
        
        if queue.is_empty(): 
            return None, time_cost, space_cost
        
        # Retrieve node from the queue
        node = queue.remove()
        if node.state == environment.goalstate: 
            return build_path(node), time_cost, space_cost
        
        # Look around
        for action in range(environment.action_space.n):
            
            p1 = environment.state_to_pos(environment.sample(node.state, action))
            
            # Child node where pathcost is the pathcost of parent + 1 and value is L2_norm
            child = Node(environment.sample(node.state, action), node, node.pathcost + 1, Heu.l2_norm(p1, goalpos) + node.pathcost + 1)  
            time_cost += 1
            
            queue.add(child)
                
            if present_with_higher_cost(queue, child):
                queue.replace(child)
                
        space_cost = max(space_cost, len(queue))
    
    
    
    return path, time_cost, space_cost

In [56]:
t = timer()

# Uncomment the version you want to use!
# a_star_version = ""
a_star_version = "Tree search"  
# a_star_version = "Graph search"

path, time_cost, memory_cost = a_star(env)

print("\nEXECUTION TIME: \n{}\n".format(round(timer() - t, 4)))

print("Chosen A* version: ", a_star_version)
print("Solution: {}".format(solution_2_string(path, env)))
print("N° of nodes explored: {}".format(time_cost))
print("Max n° of nodes in memory: {}\n".format(memory_cost))

check_2_2(env_name, solution_2_string(path, env), time_cost, memory_cost)


EXECUTION TIME: 
0.5323

Chosen A* version:  Tree search
Solution: [(1, 0), (2, 0), (2, 1), (3, 1), (3, 2), (3, 3), (3, 4), (4, 4), (4, 5), (5, 5), (6, 5), (7, 5), (7, 6), (7, 7)]
N° of nodes explored: 6473
Max n° of nodes in memory: 4855

Environment: BigFrozenLakeEnv-v0

[1m[92mYour solution is correct!
[0m


### 2.3) Now focus on the environment in figure:

   <img src="images/frozen_lake_left.png" style="zoom: 40%;"/>

### Assume that the actions are again fully deterministic, and that moving left (action id: 0) doubles the cost of that move for the agent. Choose the best way to compute the optimal solution assuming that no heuristic is given. Motivate your choice.  [15 Points]

In [59]:
env_name = 'LeftFrozenLakeEnv-v0'
env = gym.make(env_name)
env.render()

[['S' 'F' 'F' 'F' 'F' 'F']
 ['F' 'W' 'W' 'W' 'W' 'F']
 ['F' 'W' 'G' 'F' 'F' 'F']
 ['F' 'W' 'F' 'F' 'F' 'W']
 ['F' 'W' 'F' 'W' 'F' 'F']
 ['F' 'W' 'F' 'F' 'F' 'F']
 ['F' 'F' 'F' 'F' 'W' 'F']]


#### Pay attention: it is on you to calculate the cost of each action! Recall that the encoding for actions is the following: {0: 'L', 1: 'R', 2: 'U', 3: 'D'}

In [62]:
def solution(environment): #feel free to add all the arguments you need!
    
    path = []
    time_cost = 1  # il costo iniziale è 1 e non 0!  
    space_cost = 1  # il costo iniziale è 1 e non 0!
    
    '''
        YOUR CODE HERE
    '''
    queue = PriorityQueue()
    queue.add(Node(environment.startstate))
    
    explored = set()
    
    while True:
        if queue.is_empty(): 
            return None
        
        # Retrieve node from the queue
        node = queue.remove()  
        if node.state == environment.goalstate: 
            return build_path(node), time_cost, space_cost
        
        explored.add(node.state)
        
        # Look around
        for action in range(environment.action_space.n):
            
            if action == 0: # action è 'left', quindi raddoppio il costo -> +2
                child = Node(environment.sample(node.state, action), node, node.pathcost + 2 , node.pathcost + 2)
            else:
                # Child node where value and pathcost are both the pathcost of parent + 1
                child = Node(environment.sample(node.state, action), node, node.pathcost + 1, node.pathcost + 1)  
            time_cost += 1
            
            if child.state not in queue and child.state not in explored:
                queue.add(child)
                
            elif present_with_higher_cost(queue, child):
                queue.replace(child)
                
        space_cost = max(space_cost, len(queue) + len(explored))
    
    return path, time_cost, space_cost

In [63]:
t = timer()

path, time_cost, space_cost = solution(env)
strategy = 'UCS' #indicate the search strategy you chose to use

print("\nEXECUTION TIME: \n{}\n".format(round(timer() - t, 4)))

print("Search strategy: ", strategy)
print("Your solution: ", solution_2_string(path, env))
print("Time cost: ", time_cost)
print("Space cost: ", space_cost)

check_2_3(env_name, solution_2_string(path, env), time_cost, space_cost)


EXECUTION TIME: 
0.0048

Search strategy:  UCS
Your solution:  [(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0), (6, 1), (6, 2), (5, 2), (4, 2), (3, 2), (2, 2)]
Time cost:  113
Space cost:  31
Environment: LeftFrozenLakeEnv-v0

[1m[92mYour solution is correct!
[0m
