# CS486 - Artificial Intelligence
## Lesson 14 - Markov Decision Processes

*Expectimax* is a way to search a tree for the best action when outcomes are ucertain. In practice, however, we can rarely  search to the root of an expectimax tree. 

`Markov Decision Processes` (MDPs) are a way of formulating problems such that we can use an *expectimax* approach to establish a **policy** for selecting optimal actions in states without having to perform a search every time.  

In [None]:
import helpers
from aima.mdp import *
from aima.notebook import psource

## A *Draw HiLo* Policy

Last time used `expectimax` to decide which action to choose for a given draw. Our implementation was limited our us to 5 draws since, in theory, the game tree is infinite. 

Instead of running `expectimax` every time a new card is drawn, let's see if we can use an MDP to create a **policy** which lists the best action to take any given state. Here's what the transition graph looks like for the 5 draw: 

<img src='images/hilo.svg'>

Let's use that information to build an MDP.

In [None]:
psource(MDP)

In [None]:
rewards = {"bet": -1, "win": 1, "lose": -1}
actions = {"win": ["draw"],"lose":["exit"],"bet":["draw"]}
transitions = {"win": {"draw": []}, "lose": {"exit": []},"bet":{"draw":[]}}

for card in range(1,14):
    rewards[card] = 0
    actions[card] = ["higher","lower"]
    
    transitions["win"]["draw"].append([1/13,card])
    transitions["bet"]["draw"].append([1/13,card]) 
    
    transitions[card] = {
        "higher": [[(13-card)/13,"win"], [(card-1)/13,"lose"]],
        "lower":  [[(13-card)/13,"lose"], [(card-1)/13,"win"]]
    }

class HiLo(MDP):
    def __init__(self):
        MDP.__init__(
            self,
            init="start", 
            actlist=actions,
            terminals=["lose"], 
            transitions=transitions, 
            reward=rewards, 
            states=None, 
            gamma=1)

    def actions(self,state):
        return self.actlist[state]
    
hilo = HiLo()

So now we have our MDP, and we want to know what policy, which is the optimal action from each state. To do that, we need to know what the **expected utility** is at each state. We can use **value iteration** on our MDP to do that. Value iteration is a method to satisfy the **Bellman equations**, which characterizes the optimal expected value of each state.

$$ V_{0}(s) \leftarrow \overrightarrow{0} $$

$$ V_{k+1}(s) \leftarrow \max_{a} \sum_{s'}T(s,a,s')\left[R(s,a,s')+\gamma V_{k}(s')\right] $$

We perform value iteration on our MDP using the `value_iteration` function which iterates until the difference between the expected values in two successive iterations are less than $\epsilon $:

In [None]:
psource(value_iteration)

In [None]:
expected_utilities = value_iteration(hilo)
expected_utilities

So the expected value of a bet in our version *Draw HiLo* is $0.71$. But if you must play, what set of action - or policy - should you follow?

In [None]:
best_policy(hilo,expected_utilities)

## Discounting

Consider the following MDP for a grid world. You start in the middle of a three grids and you can move left, move right, or stay where you are. If you move left or right you fall into a pit and die. There is a reward for every step that you stay alive. 

Let's define it:

In [None]:
class GridLocked(MDP):
    def __init__(self, gamma=1):
        MDP.__init__(
            self,
            init=(1,0), 
            actlist={
                (0,0): ["exit"],
                (1,0): ["stay","left","right"],
                (2,0): ["exit"]
            },
            terminals=[(0,0),(2,0)],
            transitions={
                (0,0): { "exit": [] },
                (1,0): {
                    "stay":  [[1,(1,0)]],
                    "left":  [[1,(0,0)]],
                    "right": [[1,(2,0)]]
                },
                (2,0): { "exit": [] },
            }, 
            reward={
                (0,0): -100,
                (1,0): 1,
                (2,0): -100
            }, 
            states=None, 
            gamma=gamma)

    def actions(self,state):
        return self.actlist[state]

What happens when we use value iteration to compute the expected utility value vector for the MDP?

In [None]:
gridlocked = GridLocked()
best_policy(gridlocked,value_iteration(gridlocked))

There's no way to exit! We can either stay still and rack up points or move and die. This is the same problem with `expectimax`, if there is a never-ending path, the algorithm will continue searching it. So what do we do? We discount! 

In [None]:
gridlocked = GridLocked(gamma=0.9)
best_policy(gridlocked,value_iteration(gridlocked))

We only continue to iterate so long as each iteration is bringing a sufficient change in value. 

## Grid World

Let's answer the following questions:

* How much time does discounting save us?
* How much discounting can we aply and still get an optimal solution?

Use the following `GridMDP` to answer these questions. The grid defines the following maze, which has a reward in the upper right corner:

![title](aima/images/maze.png)

In [None]:
%%time 
from utils import print_table

maze = GridMDP(grid = [
    [None, None, None, None, None, None, None, None, None, None, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, +5.0, None], 
    [None, -0.1, None, None, None, None, None, None, None, -0.1, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], 
    [None, -0.1, None, None, None, None, None, None, None, None, None], 
    [None, -0.1, None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None], 
    [None, -0.1, None, None, None, None, None, -0.1, None, -0.1, None], 
    [None, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, -0.1, None, -0.1, None], 
    [None, None, None, None, None, -0.1, None, -0.1, None, -0.1, None], 
    [None, -5.0, -0.1, -0.1, -0.1, -0.1, None, -0.1, None, -0.1, None], 
    [None, None, None, None, None, None, None, None, None, None, None]
], terminals=[(9, 9)], gamma=1)

expected_values = value_iteration(maze)
pi = best_policy(maze, expected_values)
print_table(maze.to_arrows(pi))