# Module 11 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Reinforcement Learning with Value Iteration

These are the same maps from Module 1 but the "physics" of the world have changed. In Module 1, the world was deterministic. When the agent moved "south", it went "south". When it moved "east", it went "east". Now, the agent only succeeds in going where it wants to go *sometimes*. There is a probability distribution over the possible states so that when the agent moves "south", there is a small probability that it will go "east", "north", or "west" instead and have to move from there.

There are a variety of ways to handle this problem. For example, if using A\* search, if the agent finds itself off the solution, you can simply calculate a new solution from where the agent ended up. Although this sounds like a really bad idea, it has actually been shown to work really well in video games that use formal planning algorithms (which we will cover later). When these algorithms were first designed, this was unthinkable. Thank you, Moore's Law!

Another approach is to use Reinforcement Learning which covers problems where there is some kind of general uncertainty in the actions. We're going to model that uncertainty a bit unrealistically here but it'll show you how the algorithm works.

As far as RL is concerned, there are a variety of options there: model-based and model-free, Value Iteration, Q-Learning and SARSA. You are going to use Value Iteration.

## The World Representation

As before, we're going to simplify the problem by working in a grid world. The symbols that form the grid have a special meaning as they specify the type of the terrain and the cost to enter a grid cell with that type of terrain:

```
token   terrain    cost 
.       plains     1
*       forest     3
^       hills      5
~       swamp      7
x       mountains  impassible
```

When you go from a plains node to a forest node it costs 3. When you go from a forest node to a plains node, it costs 1. You can think of the grid as a big graph. Each grid cell (terrain symbol) is a node and there are edges to the north, south, east and west (except at the edges).

There are quite a few differences between A\* Search and Reinforcement Learning but one of the most salient is that A\* Search returns a plan of N steps that gets us from A to Z, for example, A->C->E->G.... Reinforcement Learning, on the other hand, returns  a *policy* that tells us the best thing to do in **every state.**

For example, the policy might say that the best thing to do in A is go to C. However, we might find ourselves in D instead. But the policy covers this possibility, it might say, D->E. Trying this action might land us in C and the policy will say, C->E, etc. At least with offline learning, everything will be learned in advance (in online learning, you can only learn by doing and so you may act according to a known but suboptimal policy).

Nevertheless, if you were asked for a "best case" plan from (0, 0) to (n-1, n-1), you could (and will) be able to read it off the policy because there is a best action for every state. You will be asked to provide this in your assignment.

We have the same costs as before. Note that we've negated them this time because RL requires negative costs and positive rewards:

In [1]:
costs = { '.': -1, '*': -3, '^': -5, '~': -7}
costs

{'.': -1, '*': -3, '^': -5, '~': -7}

and a list of offsets for `cardinal_moves`. You'll need to work this into your **actions**, A, parameter.

In [2]:
cardinal_moves = [(0,-1), (1,0), (0,1), (-1,0)]

For Value Iteration, we require knowledge of the *transition* function, as a probability distribution.

The transition function, T, for this problem is 0.70 for the desired direction, and 0.10 each for the other possible directions. That is, if the agent selects "north" then 70% of the time, it will go "north" but 10% of the time it will go "east", 10% of the time it will go "west", and 10% of the time it will go "south". If agent is at the edge of the map, it simply bounces back to the current state.

You need to implement `value_iteration()` with the following parameters:

+ world: a `List` of `List`s of terrain (this is S from S, A, T, gamma, R)
+ costs: a `Dict` of costs by terrain (this is part of R)
+ goal: A `Tuple` of (x, y) stating the goal state.
+ reward: The reward for achieving the goal state.
+ actions: a `List` of possible actions, A, as offsets.
+ gamma: the discount rate

you will return a policy: 

`{(x1, y1): action1, (x2, y2): action2, ...}`

Remember...a policy is what to do in any state for all the states. Notice how this is different than A\* search which only returns actions to take from the start to the goal. This also explains why reinforcement learning doesn't take a `start` state.

You should also define a function `pretty_print_policy( cols, rows, policy)` that takes a policy and prints it out as a grid using "^" for up, "<" for left, "v" for down and ">" for right. Use "x" for any mountain or other impassable square. Note that it doesn't need the `world` because the policy has a move for every state. However, you do need to know how big the grid is so you can pull the values out of the `Dict` that is returned.

```
vvvvvvv
vvvvvvv
vvvvvvv
>>>>>>v
^^^>>>v
^^^>>>v
^^^>>>G
```

(Note that that policy is completely made up and only illustrative of the desired output). Please print it out exactly as requested: **NO EXTRA SPACES OR LINES**.

* If everything is otherwise the same, do you think that the path from (0,0) to the goal would be the same for both A\* Search and Q-Learning?
* What do you think if you have a map that looks like:

```
><>>^
>>>>v
>>>>v
>>>>v
>>>>G
```

has this converged? Is this a "correct" policy? What are the problems with this policy as it is?


In [3]:
def read_world(filename):
    result = []
    with open(filename) as f:
        for line in f.readlines():
            if len(line) > 0:
                result.append(list(line.strip()))
    return result

---

In [4]:
from typing import List, Dict, Tuple

## `initialize_reward_gradient` <a id="initialize_reward_gradient"></a>

**Description:**  
This function initializes the reward gradiant. The reward gradiant helps accurately give the value iterator an easier time determining where to go as the further from goal you go the less reward you get. While this is not completely necessary for value iteration it does aid in it being able to converge easier overall.

**Parameters:**  
- `values` (`List[List[float]]`): An empty list of lists of values for the entire world 
- `goal` (`Tuple[int, int]`):  Where the goal is in the world
- `max_reward` (`int`): Reward for reaching the goal
- `decay` (`float`): How much the reward decays as we move further from goal

**Returns:**  
- No Explicit return here, instead we are just etting the values of values in a gradiant

In [5]:
def initialize_reward_gradient(values: List[List[float]], goal: Tuple[int, int], max_reward: int, decay: float):
    gx, gy = goal
    for x in range(len(values)):
        for y in range(len(values[0])):
            distance = abs(gx - x) + abs(gy - y)
            values[x][y] = max_reward / (1 + decay * distance)

In [6]:
world = [['.','.'],['.','.']]
rows, cols = len(world), len(world[0])
values = [[0 for _ in range(cols)] for _ in range(rows)]

initialize_reward_gradient(values, (1,1), 0, 0) 
assert values == [[0.0, 0.0], [0.0, 0.0]] #As there is no reward or decay should be 0's

initialize_reward_gradient(values, (1,1), 40, 0)  
assert values == [[40.0, 40.0], [40.0, 40.0]] # No decay means all should be 40

initialize_reward_gradient(values, (1,1), 40, 10)  
assert values[0][0] < 40 and values[0][1] < 40 and values[1][0] < 40 and values[1][1] == 40 # Decay means everything but the goal should be less than 40

## `calculate_expected_value` <a id="calculate_expected_value"></a>

**Description:**  
This function calculates the expected reward for taking a specific ation at a given state. This is done by accounting for the planned and surprise probabilities based on moving the direction inteded and the directions unintended. This is essential for value iteration as it evaluates the potential outcomes of each state allowing it to actually converge to the correct position given the correct information.

**Parameters:**  
- `x` (`int`): X of the location were attempting an action
- `y` (`int`): Y of the location were attempting an action
- `action` (`Tuple[int, int]`): The action being attempted
- `values` (`List[List[float]`): The value of every spot in the world
- `costs` (`Dict[str, int]`): Terrain costs
- `world` (`List[List[str]]`): The representation of the world

**Returns:**  
- `value_sum` (`float`): The expcected value of taking the specific action

In [7]:
def calculate_expected_value(x: int, y: int, action: Tuple[int, int], values: List[List[float]], 
                             costs: Dict[str, int], world: List[List[str]]) -> float:
    rows, cols = len(world), len(world[0])
    actions = [(0, -1), (1, 0), (0, 1), (-1, 0)] 
    value_sum = 0
    suprise = 0.1
    planned = 0.7

    for i, (alt_dx, alt_dy) in enumerate(actions):
        prob = planned if (action == (alt_dx, alt_dy)) else suprise
        nx, ny = x + alt_dx, y + alt_dy

        if 0 <= nx < rows and 0 <= ny < cols:
            value_sum += prob * (values[nx][ny] - costs.get(world[x][y], 0)) 
        else:
            value_sum += prob * (values[x][y] - costs.get(world[x][y], 0))
    return value_sum

In [8]:
world = [['.','.'],['.','.']]
rows, cols = len(world), len(world[0])
values = [[0 for _ in range(cols)] for _ in range(rows)]

assert type(calculate_expected_value(0, 0, (0,0), values, costs, world)) == float #Make sure its a float

values1 = [[0 for _ in range(cols)] for _ in range(rows)]
initialize_reward_gradient(values, (1,1), 40, 10) 

assert calculate_expected_value(0, 0, (0,0), values, costs, world) != calculate_expected_value(0, 0, (0,0), values1, costs, world) # Different values should make different outcomes

costs1 = {'.': -10}
assert calculate_expected_value(0, 0, (0,0), values, costs, world) != calculate_expected_value(0, 0, (0,0), values, costs1, world) # Different costs should make different outcomes

## `calculate_policy` <a id="calculate_policy"></a>

**Description:**  
This function calculates the optimal policy that value iteration will follow for each state based on the expected values. For each state and each action within that state it determines the best move. The policy provides the agent which actions to take in order to maximize long term rewards. Without this the agent would have no policy to follow.

**Parameters:**  
- `rows` (`int`): Length of rows in world
- `cols` (`int`): Length of cols in world
- `rewards` (`int`): Reward for hitting the end goal 
- `world` (`List[List[str]]`): The representation of the world
- `policy` (`dict`): Empty Dictionary to be filled with policy for each state
- `actions` (`List[Tuple[int, int]]`): The list of possible actions
- `values` (`List[List[float]`): The value of every spot in the world
- `gamma` (`float`): The discount rate

**Returns:**  
- `new_values` (`List[List[int]]`): New values to base off of for the next iteration

In [9]:
def calculate_policy(rows: int, cols: int, goal: Tuple[int, int], rewards: int, world: List[List[str]], policy: dict, actions: List[Tuple[int, int]], values: List[List[int]], gamma: float):
    new_values = [[0 for _ in range(cols)] for _ in range(rows)]

    for x in range(rows):
        for y in range(cols):
            if (x, y) == (goal[1], goal[0]):
                new_values[x][y] = rewards
            elif world[x][y] == 'x': 
                policy[(y, x)] = None
            else:
                R = costs.get(world[x][y], 0) # Initial Terrain Cost
                q_values = {}
                
                for action in actions:
                    q_values[action] = calculate_expected_value(x, y, action, values, costs, world)
                
                best_action = max(q_values, key=q_values.get)
                best_value = q_values[best_action]
                new_values[x][y] = R + gamma * best_value
                policy[(y, x)] = best_action
    return new_values

In [10]:
goal = (1, 1)
rewards = 100 
actions = [(0, -1), (1, 0), (0, 1), (-1, 0)]  
world = [['.','.'],['x','.']]
rows, cols = len(world), len(world[0])
values = [[0 for _ in range(cols)] for _ in range(rows)]
policy = {}
gamma = 0.8

new_values = calculate_policy(rows, cols, goal, rewards, world, policy, actions, values, gamma)
assert (new_values[goal[1]][goal[0]] == rewards) # Goal local should be the exact same number as reward
assert (new_values[1][0] == 0) # X should be a 0 for value as its impassable
assert (policy != {}) # Creates a policy

## `value_iteration` <a id="value_iteration"></a>

**Description:**  
This is the function that actually perform value iteration. It starts by initializing reward gradiant, then it loops through 1000 iterations to calculate the policy. Finally it returns the policy. 

**Parameters:**  
- `world` (`List[List[str]]`): The representation of the world
- `costs` (`Dict[str, int]`): Terrain costs
- `goal` (`Tuple[int, int]`): Goal location in world
- `rewards` (`int`): Reward for hitting the end goal 
- `actions` (`List[Tuple[int, int]]`): The list of possible actions
- `gamma` (`float`): The discount rate

**Returns:**  
- `policy` (`dict`): Policy for an agent to follow

In [11]:
def value_iteration(world: List[List[str]], costs: Dict[str, int], goal: Tuple[int, int], rewards: int, actions: List[Tuple[int, int]], gamma: float) -> List[List[float]]:
    rows, cols = len(world), len(world[0])
    values = [[0 for _ in range(cols)] for _ in range(rows)]
    policy = {}

    initialize_reward_gradient(values, goal, max_reward=rewards, decay=10) 

    for iteration in range(1000):
        new_values = calculate_policy(rows, cols, goal, rewards, world, policy, actions, values, gamma)
        values = new_values
        
    return policy
    

In [12]:
world = [['.','.'],['x','.']]
goal = (1, 1)
gamma = 0.8
reward = 100

policy = value_iteration(world, costs, goal, reward, cardinal_moves, gamma)
assert (policy == {(0, 0): (0, 1), (1, 0): (1, 0), (0, 1): None}) # Should converge to goal and avoid the x

world1 = [['.','x'],['x','.']]
policy1 = value_iteration(world1, costs, goal, reward, cardinal_moves, gamma)
assert (policy1 == {(0, 0): (1, 0), (1, 0): None, (0, 1): None}) # Cant pass 1, 0 or 0, 1 so it can never get to the goal thus cant have a dict local for it

world2 = [['.','.'],['~','.']]
policy2 = value_iteration(world2, costs, goal, reward, cardinal_moves, gamma)
assert (policy2 == {(0, 0): (0, 1), (1, 0): (1, 0), (0, 1): (0, 1)}) # All paths lead to goal, swap even has a way to go incase its landed on

## `pretty_print_policy` <a id="pretty_print_policy"></a>

**Description:**  
This function prints out the policy that an agent would follow for the world specified. This is not necessary for anything except visualization of the policy.

**Parameters:**  
- `cols` (`int`): Number of columns
- `rows` (`int`): Number of rows
- `policy` (`dict`): Policy for an it to print
- `goal` (`Tuple[int, int]`): Goal location in world

In [13]:
def pretty_print_policy(cols: int, rows: int, policy: List[List[Tuple[int, int]]], goal: Tuple[int, int]):
    action_symbols = {
        (0, -1): '<',   
        (1, 0): 'v',    
        (0, 1): '>',    
        (-1, 0): '^',
    }
    
    goal_x, goal_y = goal[1], goal[0] if goal else (None, None) # Goal setup inverse in examples had to do this
    
    for x in range(rows):
        row_symbols = []
        for y in range(cols):
            if goal and (x, y) == (goal_x, goal_y):
                row_symbols.append("G")  # Goal state
            elif policy.get((y, x)) is None:
                row_symbols.append("x")  # Impassable terrain
            else:
                action = policy.get((y, x), (0, 0))
                row_symbols.append(action_symbols.get(action, '?')) # Set action or ?, ? should only happen if things went wrong
        print("".join(row_symbols))

In [14]:
# Dont know how to assert for these so I am just visually asserting
world = [['.','.'],['x','.']]
goal = (1, 1)
gamma = 0.8
reward = 100
rows, cols = len(world), len(world[0])

policy = value_iteration(world, costs, goal, reward, cardinal_moves, gamma)
pretty_print_policy(cols, rows, policy, goal) # Should have way to Goal, goal, and an x
print('\n')

world1 = [['.','x'],['x','.']]
policy1 = value_iteration(world1, costs, goal, reward, cardinal_moves, gamma)
pretty_print_policy(cols, rows, policy1, goal) # Should have No path to goal and 2 x's
print('\n')

world2 = [['.','.'],['~','.']]
policy2 = value_iteration(world2, costs, goal, reward, cardinal_moves, gamma)
pretty_print_policy(cols, rows, policy2, goal) # ALl paths lead to goal despite swamp
print('\n')

>v
xG


vx
xG


>v
>G




## Value Iteration

### Small World

In [15]:
small_world = read_world( "small.txt")

In [16]:
goal = (len(small_world[0])-1, len(small_world)-1)
gamma = 0.8
reward = 100

small_policy = value_iteration(small_world, costs, goal, reward, cardinal_moves, gamma)

In [17]:
cols = len(small_world[0])
rows = len(small_world)

pretty_print_policy(cols, rows, small_policy, goal)

v>>>>v
vv>>vv
vvv>vv
vvvxvv
vvvvvv
>>>>vv
>>>>>G


### Large World

In [18]:
large_world = read_world( "large.txt")

In [19]:
goal = (len(large_world[0])-1, len(large_world)-1) # Lower Right Corner FILL ME IN
gamma = 0.99
reward = 10000

large_policy = value_iteration(large_world, costs, goal, reward, cardinal_moves, gamma)

In [20]:
cols = len(large_world[0])
rows = len(large_world)

pretty_print_policy( cols, rows, large_policy, goal)

>>>>>>>>>>>>>vvv<<<<<>>>>>v
^^^^^^>>>>>>>vvv<<xxxxxxx>v
^^^^xx^>>>>>>>vvvxxxvv<xx>v
vv^<<xxx>>>>>>>vvvvvv<<xx>v
vv<<<xx>>>>>>>>>>vvvvvxxx>v
vv<<xx>>>>^^^^>>>>>vvvvxvvv
vvvxxv>>>^^^xxx^>>>>vvvvvvv
vvvvvvvv^^^<<<xxx>>>>>>>vvv
vvvvvvvv<^^<<<xx>>>>>>>>>vv
vvvvvvvv<<^xxxx>>>^^^^>>>vv
>>>>>vvv<<xxxv>>>^^^xxx>>vv
>>>>>>vv<xx>>>>>>^^<<xx>>vv
>>>>>>vvvxx>>>>>^^^<<x>>>vv
^^^^>>>vvvvv>^>^^^^<>>>>>vv
^^^^x>>>vvv<<^^^^^^^^^x>>vv
^^^xxx>>vvvvxxx^^^^^^xx>>vv
^^xx>>>>>vvvvvxxx^^xxxv>vvv
^<<xx>>>>>>vvvvvxxxxvvvvvvv
^<<xxx>>>>>>>>>vvvvvvvvvvvv
^<<<xxx>^>^^^^>>>vvvvvvvvvv
^<<<<vxx^^^^^^x>>>>>>>>>vvv
^^<<<<<xxx^^xx>>>>>>>>>>vvv
^^^<<<<<vxxxx>>>^^^>>>>>>vv
^^^^^<<<<<>>>>^^^xx^>>>>>vv
^x^^^^<<<<<xxx^^xxvxx^>>>vv
^xxx^^^<<<<<xxxxvvvvxxx>>>v
^<>>>^^<<<<>>>>>>>>>>>>>>>G


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.