# Module 11 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Reinforcement Learning with Value Iteration

These are the same maps from Module 1 but the "physics" of the world have changed. In Module 1, the world was deterministic. When the agent moved "south", it went "south". When it moved "east", it went "east". Now, the agent only succeeds in going where it wants to go *sometimes*. There is a probability distribution over the possible states so that when the agent moves "south", there is a small probability that it will go "east", "north", or "west" instead and have to move from there.

There are a variety of ways to handle this problem. For example, if using A\* search, if the agent finds itself off the solution, you can simply calculate a new solution from where the agent ended up. Although this sounds like a really bad idea, it has actually been shown to work really well in video games that use formal planning algorithms (which we will cover later). When these algorithms were first designed, this was unthinkable. Thank you, Moore's Law!

Another approach is to use Reinforcement Learning which covers problems where there is some kind of general uncertainty in the actions. We're going to model that uncertainty a bit unrealistically here but it'll show you how the algorithm works.

As far as RL is concerned, there are a variety of options there: model-based and model-free, Value Iteration, Q-Learning and SARSA. You are going to use Value Iteration.

## The World Representation

As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:

```
token   terrain    cost 
.       plains     1
*       forest     3
^       hills      5
~       swamp      7
x       mountains  impassible
```

When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges).

There are quite a few differences between A\* Search and Reinforcement Learning but one of the most salient is that A\* Search returns a plan of N steps that gets us from A to Z, for example, A->C->E->G.... Reinforcement Learning, on the other hand, returns  a *policy* that tells us the best thing to do in **every state.**

For example, the policy might say that the best thing to do in A is go to C. However, we might find ourselves in D instead. But the policy covers this possibility, it might say, D->E. Trying this action might land us in C and the policy will say, C->E, etc. At least with offline learning, everything will be learned in advance (in online learning, you can only learn by doing and so you may act according to a known but suboptimal policy).

Nevertheless, if you were asked for a "best case" plan from (0, 0) to (n-1, n-1), you could (and will) be able to read it off the policy because there is a best action for every state. You will be asked to provide this in your assignment.

We have the same costs as before. Note that we've negated them this time because RL requires negative costs and positive rewards:

In [1]:
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
costs

{'.': -1, '*': -3, '^': -5, '~': -7}

and a list of offsets for `cardinal_moves`. You'll need to work this into your **actions**, A, parameter.

In [2]:
cardinal_moves = [(0,-1), (1,0), (0,1), (-1,0)]

In [3]:
from copy import deepcopy

For Value Iteration, we require knowledge of the *transition* function, as a probability distribution.

The transition function, T, for this problem is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if the agent selects "north" then 70% of the time, it will go "north" but 10% of the time it will go "east", 10% of the time it will go "west", and 10% of the time it will go "south". If agent is at the edge of the map, it simply bounces back to the current state.

You need to implement `value_iteration()` with the following parameters:

+ world: a `List` of `List`s of terrain (this is S from S, A, T, gamma, R)
+ costs: a `Dict` of costs by terrain (this is part of R)
+ goal: A `Tuple` of (x, y) stating the goal state.
+ reward: The reward for achieving the goal state.
+ actions: a `List` of possible actions, A, as offsets.
+ gamma: the discount rate

you will return a policy: 

`{(x1, y1): action1, (x2, y2): action2, ...}`

Remember...a policy is what to do in any state for all the states. Notice how this is different than A\* search which only returns actions to take from the start to the goal. This also explains why reinforcement learning doesn't take a `start` state.

You should also define a function `pretty_print_policy( cols, rows, policy)` that takes a policy and prints it out as a grid using "^" for up, "<" for left, "v" for down and ">" for right. Use "x" for any mountain or other impassable square. Note that it doesn't need the `world` because the policy has a move for every state. However, you do need to know how big the grid is so you can pull the values out of the `Dict` that is returned.

```
vvvvvvv
vvvvvvv
vvvvvvv
>>>>>>v
^^^>>>v
^^^>>>v
^^^>>>G
```

(Note that that policy is completely made up and only illustrative of the desired output). Please print it out exactly as requested: **NO EXTRA SPACES OR LINES**.

* If everything is otherwise the same, do you think that the path from (0,0) to the goal would be the same for both A\* Search and Q-Learning?
* What do you think if you have a map that looks like:

```
><>>^
>>>>v
>>>>v
>>>>v
>>>>G
```

has this converged? Is this a "correct" policy? What are the problems with this policy as it is?


In [4]:
def read_world(filename):
    result = []
    with open(filename) as f:
        for line in f.readlines():
            if len(line) > 0:
                result.append(list(line.strip()))
    return result

---

In [5]:
read_world("small.txt")

[['.', '.', '.', '.', '.', '.'],
 ['.', '*', '*', '*', '*', '.'],
 ['.', '*', '*', '*', '*', '.'],
 ['.', '*', '*', 'x', '*', '.'],
 ['.', '*', '*', '*', '*', '.'],
 ['.', '.', '.', '.', '.', '.'],
 ['.', '.', '.', '.', '.', '.']]

### <a id="initialize_V"></a> initialize_V

Formal Parameters:

**world** The gridworld map -- a list of lists of characters

**costs** The costs dictionary

**goal** The end goal

**rewards** The reward at the goal

**return** The tuple `(V,policy)`, a tuple containing V-- the initial values, and policy, the initial policy(a list of lists of `None`)

Initializes the starting values for [value_iteration](#value_iteration) to whatever the costs dictionary maps the world to. If the world has an `'x'`, then the corresponding value will be `None`, since it doesnt make sense to ask what an agent should do if he starts on impassible terrain.

In [6]:
def initialize_V(world,costs,goal,rewards):
    V = []
    policy = []
    for i in world:
        V.append([0]*len(i))
        policy.append([None]*len(i))
        
    for i in range(len(world)):
        for j in range(len(world[i])):
            if world[i][j] == 'x':
                V[i][j] = None
    V[goal[0]][goal[1]] = rewards+costs[world[goal[0]][goal[1]]]
    return V,policy
    

In [7]:
initialize_V([['.','*','^','~'],['.','*','^']],costs,[1,1],5)

([[0, 0, 0, 0], [0, 2, 0]], [[None, None, None, None], [None, None, None]])

### <a id="initialize_Q"></a> initialize_Q

Formal Parameters:

**world** The gridworld map -- a list of lists of characters

**costs** The costs dictionary

**goal** The end goal

**rewards** The reward at the goal

**action** The reward at the goal

**actions** The reward at the goal

**return** `Q`, a list of lists containing the starting Q values for the particular action.


The initial Q values for any action at any state is the weighted cost of moving to the adjacent square, with the possibility of making an unintended action instead.

In [8]:
def initialize_Q(world,costs,goal,rewards,action,actions):
    Q = []
    for i in world:
        Q.append([0]*len(i))
    for i in range(len(Q)):
        for j in range(len(Q[i])):
            if world[i][j] == 'x':
                Q[i][j] = None
                continue
            try:
                if i+action[0]< 0 or j+action[1]<0:
                    raise IndexError
                Q[i][j] += costs[world[i+action[0]][j+action[1]]]*.7
                if goal[0] == i+action[0] and goal[1] == j+action[1]:
                    Q[i][j] += rewards*.7
            except:
                Q[i][j] += costs[world[i][j]]*.7
            for a in actions:
                if a == action:
                    continue
                try:
                    if i+a[0]< 0 or j+a[1]<0:
                        raise IndexError
                    Q[i][j] += costs[world[i+a[0]][j+a[1]]]*.1
                    if goal[0] == i+a[0] and goal[1] == j+a[1]:
                        Q[i][j] += rewards*.1
                except:
                    Q[i][j] += costs[world[i][j]]*.1
    return Q

In [9]:
initialize_Q(read_world("small.txt"),costs,[1,1],5,cardinal_moves[0],cardinal_moves)

[[-0.9999999999999999,
  -0.7,
  -1.2000000000000002,
  -1.2000000000000002,
  -1.2000000000000002,
  -0.9999999999999999],
 [-0.7000000000000001,
  -1.4000000000000001,
  0.7000000000000003,
  -2.7999999999999994,
  -2.5999999999999996,
  -2.4],
 [-1.2000000000000002,
  -1.1,
  -2.999999999999999,
  -2.999999999999999,
  -2.8,
  -2.4],
 [-1.2000000000000002, -1.6, -2.999999999999999, None, -2.8, -2.4],
 [-1.2000000000000002,
  -1.4000000000000001,
  -2.8,
  -2.8,
  -2.5999999999999996,
  -2.4],
 [-0.9999999999999999, -1.2, -1.2, -1.2, -1.2, -0.9999999999999999],
 [-0.9999999999999999,
  -0.9999999999999999,
  -0.9999999999999999,
  -0.9999999999999999,
  -0.9999999999999999,
  -0.9999999999999999]]

### <a id="max_diff"></a> max_diff

Formal Parameters:

**V** The current values at each square

**prev_V** The previous values at each square

**return** The maximum difference between V and prev_V at any particular square

This function is used to find the maximum difference of V values at each point.  It is used as a stopping condition in [value_iteration](#value_iteration) (stop when max diff < epsilon).

In [10]:
def max_diff(V,prev_V):
    diff = 0
    for i in range(len(V)):
        for j in range(len(V[i])):
            try:
                new_diff = abs(V[i][j] - prev_V[i][j])
                if new_diff>diff:
                    diff = new_diff
            except:
                pass
    return diff

### <a id="next_V"></a> next_V

Formal Parameters:

**Q_dict** A dictionary of Q values for each action

**V** The current V list

**policy** The current policy

**return** a tuple containing the updated V list and policy list

This function is part of the next step in [value_iteration](#value_iteration). At each point, it takes the max value for Q and the corresponding best action to build the return tuple

In [11]:
def next_V(Q_dict,V,policy):
    for i in range(len(V)):
        for j in range(len(V[i])):
            if V[i][j] == None:
                continue
            a = None
            best_val = float('-inf')
            for action in Q_dict.keys():
                try:
                    if Q_dict[action][i][j] > best_val:
                        best_val = Q_dict[action][i][j]
                        a = action
                except:
                    continue
            V[i][j] = best_val
            policy[i][j] = a
    return V,policy
                
    

### <a id="next_Q"></a> next_Q

Formal Parameters:

**Q_dict** A dictionary of Q values for each action

**action** The particular action associated with this Q list

**actions** The dictionary of actions

**gamma** The discount factor

**world** The gridworld

**return** the updated list of Q values for a particular action

This function is part of the next step in [value_iteration](#value_iteration). At each point, it gives the weighted cost associated with a particular action, combined with the weighted cost of an unintended action

In [12]:
def next_Q(V,action,actions,gamma,world):
    Q = []
    for i in world:
        Q.append([0]*len(i))
    for i in range(len(Q)):
        for j in range(len(Q[i])):
            if world[i][j] == 'x':
                Q[i][j] = None
                continue
            try:
                if i+action[0]< 0 or j+action[1]<0:
                    raise IndexError
                Q[i][j] += V[i+action[0]][j+action[1]]*.7*gamma
                
            except:
                Q[i][j] += V[i][j]*.7*gamma
            for a in actions:
                if a == action:
                    continue
                try:
                    if i+a[0]< 0 or j+a[1]<0:
                        raise IndexError
                    Q[i][j] += V[i+a[0]][j+a[1]]*.1*gamma
                    
                except:
                    Q[i][j] += V[i][j]*.1*gamma
    return Q
    

### <a id="value_iteration"></a> value_iteration

Formal Parameters:


**world** The gridworld

**costs** The costs dictionary mapping terrain strings to costs

**goal** A tuple containing the location of the goal

**rewards** The reward for reaching the goal

**actions** The dictionary of actions

**gamma** The discount factor

**MAGIC CONSTANT** `EPSILON = .01` part of the stopping condition for value iteration - When the [maximum difference between succesive V values](#max_diff) is less than `EPSILON`, the value iteration loop ends.  Is magic because prof didn't include any epsilon in the method signature. 

**return** `policy` -- The list conataining the actions that any agent should take at any particular point.

This is the driving stochastic value iteration algorithm. It first initializes V, policy, and Q, then iterates over best decisions using the discount factor.

In [13]:
def value_iteration(world, costs, goal, rewards, actions, gamma):
    goal = (goal[1],goal[0])
    EPSILON = .01
    V,policy = initialize_V(world,costs,goal,rewards)
    prev_V = deepcopy(V)
    Q_dict = {action: initialize_Q(world,costs,goal,rewards,action,actions) for action in actions}
    first = True
    while first or max_diff(V,prev_V) > EPSILON:
        first = False
        prev_V = deepcopy(V)
        V,policy = next_V(Q_dict,V,policy)
        for action in Q_dict.keys():
            Q_dict[action] = next_Q(V,action,actions,gamma,world)
    return policy
    

### <a id="pretty_print_policy"></a> pretty_print_policy

Formal Parameters:


**cols** columns in gridworld--unused

**rows**-rows in gridworld--unused

**policy**--policy: the list of lists containing the best action at any point

**goal**--A tuple containing the position of the goal

**prints** Some text art representing the policy, `'G'` at the goal, and `'X'` at impassible terrain,

**return** `None`

Pretty prints policy. `cols` and `rows` are unused because I want the option to have non-rectangular gridworlds.  I believe I made the rest of the code robust enough to handle this.  Prints `'X'` at impassible terrain because it isn't clear whether starting at impassible terrain is a degenerate case, just like a policy isn't printed at the goal.

In [14]:
def pretty_print_policy( cols, rows, policy, goal):
    for i in range(len(policy)):
        for j in range(len(policy[i])):
            if i==goal[1] and j ==goal[0]:
                print("G",end="")
            elif policy[i][j] == (0,1):
                print(">",end="")
            elif policy[i][j] == (0,-1):
                print("<",end="")
            elif policy[i][j] == (1,0):
                print("v",end="")
            elif policy[i][j] == (-1,0):
                print("^",end="")
            else:
                print("X",end="")
        print()

## Value Iteration

### Small World

In [15]:
small_world = read_world( "small.txt")

In [16]:
goal = (len(small_world[0])-1, len(small_world)-1)
gamma = 0.9
reward = 5
small_policy = value_iteration(small_world, costs, goal, reward, cardinal_moves, gamma)

In [17]:
cols = len(small_world[0])
rows = len(small_world)
test_policy = small_policy
pretty_print_policy(cols, rows, test_policy, goal)

v>>>vv
vvv>vv
vvv>vv
vvvXvv
v>>>vv
>>>>>v
>>>>>G


### Large World

In [18]:
large_world = read_world( "large.txt")

In [19]:
goal = (len(large_world[0])-1, len(large_world)-1) # Lower Right Corner FILL ME IN
gamma = 0.9
reward = 100

large_policy = value_iteration(large_world, costs, goal, reward, cardinal_moves, gamma)

In [20]:
cols = len(large_world[0])
rows = len(large_world)

pretty_print_policy( cols, rows, large_policy, goal)

v>>>>>>>>>>>>>>vv>>>>>>>>vv
vv>>>>>>>>>>>>vvv<XXXXXXXvv
vvvvXX>>>>>>>>>vvXXXvvvXXvv
vvvvvXXX>>>>>>>>>>>vvv<XXvv
vvvv<XXv>>>>>>>>>>>vvvXXXvv
vvv<XXvvv>>>>>>>>v>>>>vXvvv
vvvXXvvvvv>^XXX>>v>>>>>vvvv
v>>>>vvvvvv^<<XXX>>>>>>vvvv
v>>>vvvvvvv<<<XX>>>>>>>vvvv
vv>>vvvvvv<XXXX>>>>>>>>vvvv
v>>>>vvvv<XXX>>>>>>vXXXvvvv
>>>>>vvvvXXv>>>>>>>>vXXvvvv
>>>>>vvvvXXv>>>>>>>>vX>vvvv
>>>>>vv>>>vvv>>>>>>>>>>vvvv
vv>^X>v>>vvv<>>>>>>>>^Xvvvv
vv<XXX>>>>vvXXX>>>>>^XXvvvv
vvXX>>>>>>>>>vXXX>^XXXvvvvv
vvvXX>>>>>>>>>>vXXXX>>>vvvv
vvvXXX>>>>>>>>>>>>>>>>>vvvv
vv>vXXX>>>>>>>>>>>>>>>vvvvv
v>>>>vXX>>>>>^X>>>>>>>>vvvv
v>>>>>vXXX>^XX>>>>>>>>vvvvv
>>>>>>>>vXXXX>>>>>>v>v>vvvv
>>>>>>>>>vv>>>>>^XX>>v>vvvv
vX>>>>>>>vvXXX>^XXvXX>>vvvv
vXXX>>>>>>>vXXXX>>>vXXXvvvv
>>>>>>>>>>>>>>>>>>>>>>>>>>G


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

My only concerns are if we should consider an agent that starts his journey on impassible terrain, and why there is no epsilon parameter in professor's given method signature for value iteration.