## EXERCISE 1: What to do at the airport?

You are travelling and have some time to kill at the aiport. There are three things you could spend your time doing:
  
1) You could have a coffee.

This has a probability of $0.8$ of giving you time to relax with a tasty beverage, and a utility of $10$. 
It also has a probability of $0.2$ of providing you with a nasty cup from over-roasted beans that annoys you,
and outcome with a utility of $-5$.

2) You could shop for clothes.

This has a probability of $0.1$ that you will find a great outfit at a good price, utility $20$. However, it 
has a probability of $0.9$ that you end up wasting money on over-priced junk, utility $-10$.

3) You could have a bite to eat.

This has a probability of $0.8$ that you find something rather mediocre that prevents you from being too hungry 
during your flight, utility $2$, and a probability of $0.2$ that you find something filling and tasty, utility $5$.

> __QUESTION 1(a):__ What should you do if you take the principle of maximum expected utility to be your decision criterion?

> __QUESTION 1(b):__ What should you do if you take the principle of maximax decision criterion to be your decision criterion?

> __QUESTION 1(c):__ What should you do if you take the principle of maximin decision criterion to be your decision criterion?
    

In [1]:
import numpy as np

coffee_outcomes=['relax','annoy']
u_coffee_outcomes=np.array([10,-5])
p_coffee_outcomes=np.array([0.8,0.2])

eu_coffee=np.sum(p_coffee_outcomes*u_coffee_outcomes)

print("eu coffee: ", eu_coffee)

clothes_outcomes=['good','wasting']
u_clothes_outcomes=np.array([20,-10])
p_clothes_outcomes=np.array([0.1,0.9])

eu_clothes=np.sum(p_clothes_outcomes*u_clothes_outcomes)

print("eu clothes: ",eu_clothes)

eat_outcomes=['tasty','mediocre']
u_eat_outcomes=np.array([5,2])
p_eat_outcomes=np.array([0.8,0.2])

eu_eat=np.sum(p_eat_outcomes*u_eat_outcomes)

print("eu eat: ",eu_eat)
print("-----------------")

print('choice meu: ',np.max(np.array([eu_coffee,eu_clothes,eu_eat])))
print('choice maxdc: ',np.max(np.array([np.max(u_clothes_outcomes),np.max(u_coffee_outcomes),np.max(u_eat_outcomes)])))
print('choice mindc: ',np.max(np.array([np.min(u_clothes_outcomes),np.min(u_coffee_outcomes),np.min(u_eat_outcomes)])))

eu coffee:  7.0
eu clothes:  -7.0
eu eat:  4.4
-----------------
choice meu:  7.0
choice maxdc:  20
choice mindc:  2


## EXERCISE 2: Solving a MDP with MDP toolbox

We have four states and four actions.

The actions are: 0 is Right, 1 is Left, 2 is Up and 3 is Down.

The states are 0, 1, 2, 3, and they are arranged like this:
    
$$
\begin{array}{cc}
2 & 3\\
0 & 1\\
\end{array}
$$

The motion model provides:
*   0.8 probability of moving in the direction of the action,
*   0.1 probability of moving in each of the directions perpendicular to that of the action.

So that 2 is Up from 0 and 1 is Right of 0, and so on. The cost of any action (in any state) is -0.04.

In case of "infeasible" movements, the agent remains in the current state.

The reward for state 3 is 1, and the reward for state 1 is -1, and the agent does not leave those states.

Set discount factor equal to 0.99.

> __QUESTION 2(a):__ What is the policy based on the Value iteration algorithm?

> __QUESTION 2(b):__ What is the policy based on the Policy iteration algorithm?

> __QUESTION 2(c):__ What is the policy based on the Q-Learning algorithm?

> __QUESTION 2(d):__ Look at the **setVerbose**() function and the time attribute of the MDP objects in MDPToolbox and use them to compare the number of iterations (hint: see the iter attribute) and the CPU time used to come up with a solution (hint: see the time attribute) in the Value iteration algorithm and Policy iteration algorithm resolutions.


In [2]:
!pip install pymdptoolbox
import mdptoolbox

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymdptoolbox
  Downloading pymdptoolbox-4.0-b3.zip (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pymdptoolbox
  Building wheel for pymdptoolbox (setup.py) ... [?25l[?25hdone
  Created wheel for pymdptoolbox: filename=pymdptoolbox-4.0b3-py3-none-any.whl size=25656 sha256=46d32458902d5927e2e8f798c84d62c4747bc46bdef90a21153a0540a45abc34
  Stored in directory: /root/.cache/pip/wheels/2b/e7/c7/d7abf9e309f3573a934fed2750c70bd75d9e9d901f7f16e183
Successfully built pymdptoolbox
Installing collected packages: pymdptoolbox
Successfully installed pymdptoolbox-4.0b3


In [3]:
# 0 1 2 3
P1=np.array([
            [[0.1,0.8,0.1,0], #azione 0 stato 0
            [0,1,0,0],#azione 0 stato 1
            [0.1,0,0.1,0.8], #azione 0 stato 2
            [0,0,0,1]],#azione 0 stato 3
             
            [[0.9,0,0.1,0],#azione 1 stato 0
             [0,1,0,0],#azione 1 stato 1
             [0.1,0,0.9,0],#azione 1 stato 2
             [0,0,0,1]],#azione 1 stato 3
             
            [[0.1,0.1,0.8,0],#azione 2 stato 0
             [0,1,0,0],#azione 2 stato 1
             [0,0,0.9,0.1],#azione 2 stato 2
             [0,0,0,1]],#azione 2 stato 3
             
            [[0.9,0.1,0,0],#azione 3 stato 0
             [0,1,0,0],#azione 3 stato 1
             [0.8,0,0.1,0.1],#azione 3 stato 2
             [0,0,0,1]],#azione 3 stato 3

             ])

R1=np.array([[-1, -0.04,-0.04,-0.04], [-1, -1, -1, -1], [1,-0.04,-0.04, -0.04], [1, 1, 1, 1]])

mdptoolbox.util.check(P1, R1)

v1=mdptoolbox.mdp.ValueIteration(P1,R1,0.99)
#v1.setVerbose()
print("--------------------------------")
v1.run()
print("policiy value iteration: ",v1.policy)

v1=mdptoolbox.mdp.PolicyIteration(P1,R1,0.99)
#v1.setVerbose()
print("--------------------------------")
v1.run()
print("policy policy iteration: ",v1.policy)

v1=mdptoolbox.mdp.QLearning(P1,R1,0.99)
#v1.setVerbose()
print("--------------------------------")
v1.run()
print("policy q-learning: ",v1.policy)

--------------------------------
policiy value iteration:  (1, 0, 0, 0)
--------------------------------
policy policy iteration:  (1, 0, 0, 0)
--------------------------------
policy q-learning:  (2, 1, 0, 0)
