# Miniproject 2

## Problems

### Warmup 1

We have provided a new RescueMDP class for you, that models the robot's motion
as noisy. This class is in many ways similar to the ChaseMDP class in homework
7, in that it is a maze MDP with obstacles. However, there's just the one
agent (the robot, no bunny).

Recall that an MDP is defined by the following tuple:
* States: The state space is a set of coordinates. The set is defined over a
grid, but states that aren't obstacles aren't in the state space.
* Actions: The robot can take four actions, up, down, left, right.
* A transition function: The probability the action succeeds at moving the
robot in the given direction is given by the class field
`correct_transition_probability`. If this probability is less than 1, then the
rest of the probability mass is uniformly distributed among the other three
directions. Any probability mass for a motion that would move the robot into
an obstacle or out of the map is mapped to the robot staying the same
place. If the robot is in the goal state and the MDP has the flag
`goal_is_terminal = True`, it cannot transition out of the goal state. If the flag
`goal_is_terminal = False`, the robot is free to leave the goal state and
re-enter it.
* Reward function: The robot gets a reward of `living_reward` for each action
it takes. It gets reward of `goal_reward` every time it enters the goal
state.

Let's begin solving a small version of the RescueMDP. The maze is 4x5 (remember
that arrays are row-major order). The person to be rescued (goal_location) is
at (0, 0).

In the first warmup problem, please solve
for the optimal value function for this problem. We have provided the reference
implementation for value iteration. 

For reference, our solution is **4** lines of code.

In [None]:
def warmup_1():
  """Creates a SmallRescueMDP() and returns the optimal value function.

  Args:
    None

  Returns:
    value function: a dict of states to values
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### Warmup 2

We also want you to be able to examine the policy that results from solving
for the optimal value function. Please use the helper function
`plot_value_function` to generate a plot of both the value function and the
corresponding policy.

<!-- first_experiment() -->
<div class="question question-multiplechoice">
<b>Submission Material 1:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 1.
</div>


### Warmup 3
Please hand-code a list of actions that will get the robot from (3,2) to the goal location (0, 0) under the optimal policy. This should be a python list of actions, e.g., [down, down, down, down].

For reference, our solution is **2** lines of code.

In [None]:
def warmup3():
  """Hand-code a list of actions that the robot will take under the optimal
  policy from a start state of (3, 2) to the goal state of (0, 0).

  Returns:
    actions: A list of str actions that get the robot to the goal state.
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 2

In the MDP class we have provided, the goal state is a terminal state. Once
the agent is in the terminal state, it can't leave and it doesn't receive any
more reward. Let's investigate what happens if we don't make the goal state a
terminal state.

Please create a SmallRescueMDP, set the goal_is_terminal to be False, and
compute the value function.

<div class="question question-multiplechoice">
<b>Submission Material 2:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 2.

In your pdf, please also answer the following questions:

* Compared to the previous question, you should have seen the number of
iterations change before value iteration converged. Why is that? What extra
work are we making value iteration do in this case?

<!-- Because the goal state is not terminal, the agent has the option to leave
the goal state and return, and accumulate the goal reward over and over
again. The future rewards are discounted, and so this accumulation of reward
will eventually converge, but the value iteration process needs to run to
convergence to add up that geometric series. -->

* At the same time, the optimal policy did not change anywhere except possibly
the goal. Why is the policy the same, even though it took more iterations to
converge?

<-- The extra work is really adding up the extra rewards at the goal state, so
the policy shouldn't change anywhere else. -->

* Is there some property of the reward function in general for MDPs that makes
solving for the optimal policy with value iteration easier or harder?

<-- If the distribution of states that the policy visits in the limit of time
has zero expected reward, then there's no geometric series of discounted
rewards to add up. A terminal state with zero reward makes that happen. -->

* Optional: In this and the following questions, when we ask you to compare
policies, we want you to ignore the goals state.  The value at the goal
state will usually not change, but the action returned by the policy might have. Why
is that?

<-- The goal state, all actions are equal, so the policy returns the first
action in the list of mdp.action_space. This is a set, and the order of items
returned by a set is not guaranteed in python, so the first item returned is
non-deterministic. -->

</div>


For reference, our solution is **5** lines of code.

In [None]:
def MDP_2():
  """Creates a SmallRescueMDP(), sets goal_is_terminal to False and returns the optimal value function.

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 3

Let's now turn our attention to a larger MDP. Please create a LargeRescueMDP
and compute the optimal policy.

<div class="question question-multiplechoice">
<b>Submission Material 3:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 3.
</div>


For reference, our solution is **3** lines of code.

In [None]:
def MDP_3():
  """Creates a LargeRescueMDP(), and returns the optimal value function and
  number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 4

The default transition model for the LargeRescueMDP is pretty close
deterministic. There is a .97 chance that each action succeeds in the intended
direction (unless there's an obstacle in the way or its the edge of the map)
and only a .01 chance of ending up in one of the other 3 neighbouring grid
cells.

What happens if we make the transition model noisier? Please create a
LargeRescueMDP and set the correct_transition_probability to be .76. This is
not too noisy, but it has implications for our ability to solve for the
optimal policy.

<div class="question question-multiplechoice">
<b>Submission Material 4:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 4.

In your pdf, please also answer the following questions:

* Compared to the previous question, you should have seen the number of
iterations change before value iteration converged. Why is that? What extra
work are we making value iteration do in this case?

<!-- Because the motion to the next state is noisy, it can take more steps to get
to the goal. Value iteration converges when the expected reward of the policy
from each state to the goal has been computed. Since it takes more steps to
get to the goal, it takes more iterations to compute the expected reward. -->

* At the same time, the optimal policy did not change. Why didn't the policy
change under the noisy dynamics?

<!-- There isn't a way for the agent to compensate for the noisy dynamics, so
there's no change to the policy that could improve the progress towards the
goal. -->

* You should have seen the value function decrease in nearly all cells. Why
did it go down? Where is the loss in value coming from?

<!-- The loss in value at each state comes from the fact that more actions are
likely to be required to get to the goal, and so the value of the goal state
is likely to be discounted more, and the discounted expected value
of a sequence of actions will be reduced. -->

* What would happen to the number of iterations, to the policy, and to the
value function if the dynamics were even noisier, say
correct_transition_probability was set to 0.5?

<!-- The number of iterations would go up, the value function would be reduced
further, but the policy would not change. -->

</div>


For reference, our solution is **4** lines of code.

In [None]:
def MDP_4():
  """Creates a LargeRescueMDP(), sets the correct_transition_probability to
  .76 and returns the optimal value function and
  number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 5

The default temporal_discount_factor is .9. This is in general quite a low
discount factor. An action 20 steps away will have $.9^{20} \approx .12$ impact on later
actions.

What happens if we increase the discount factor? Please create a
LargeRescueMDP, set the temporal_discount_factor to be .99, and also set
the correct_transition_probability to be .76. Then please solve for the
optimal value function.

<div class="question question-multiplechoice">
<b>Submission Material 5:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 5.

In your pdf, please also answer the following questions:

* Compared to the previous question, you should have seen the number of
iterations increase before value iteration converged. Why is that? What
extra work are we making value iteration do in this case?

<!-- This is a subtle question. If the value of a state is the expected return
of a discounted sequence of rewards, the value function converges when we've
done enough backups that the effect of one more backup and one more discounted
reward is no longer measurable numerically. When we make the value function
close to one, then it takes a longer sequence of backups for the discounted
reward to no longer be measurable numerically. -->

* At the same time, the optimal policy did not change. Why is the policy the
same, but it took more iterations to converge?

<!-- There's no reason to do anything different, we're just making the value
function compute the expected value of a longer sequence of possible outcomes
under the policy. -->

* You should have seen the value function increase in nearly all cells. Why
did it go up? Where did the increase in value come from?

<!-- The increase in value came from the fact that we're not down-weighting
the future as much. For the same sequence of actions, we expect to get a
higher reward because we value the future more. -->

* What would happen to the number of iterations, to the policy, and to the
value function if we left the temporal_discount_factor at .99 but made the
dynamics fairly deterministic again, say correct_transition_probability = .99?
(You should try this!)

<!-- The number of iterations will go down because the path is nearly
deterministic now --- from each state, the policy will get to the goal in
(mostly) a fixed number of steps. The fact that we can take expectations over
longer sequences of actions doesn't change the value function or make us
compute more iterations because once we get to the goal, we get no more
reward. The policy won't change, but the value function will stay high. -->

</div>


For reference, our solution is **5** lines of code.

In [None]:
def MDP_5():
  """Creates a LargeRescueMDP(), sets the correct_transition_probability to
  .76 and the temporal_discount_factor to .99 and returns the optimal value function and
  number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 6

Let's make the problem a little harder. In addition to obstacles, our
RescueMDP class supports hazards, which are states that have a penalty for
entering.

Let's start with relatively noise-free dynamics. Please create a
LargeRescueMDP and set a hazard at `{(1, 4)}'. Then please solve for the
optimal value function.

<div class="question question-multiplechoice">
<b>Submission Material 6:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 6.

In your pdf, please also answer the following questions:

* Compared to question three (LargeRescueMDP, discount factor = .9,
correct_transition_probability = .97, no hazards), the number of iterations to
solve for the optimal value function should be more or less the same, and the
policy is the same too. Why didn't the hazard change the policy or require
more iterations?

<!-- The location of the hazard wasn't on the most likely path to the goal for
most states, and with nearly-deterministic dynamics, there was rarely a need
to react to the hazard. -->

* You should have seen the value function decrease a little bit some but not
all states. Where did the value function change and why?

<!-- The states where the expected trajectory took the agent past the hazard
saw a small reduction in value, due to the small chance that the stochastic
dynamcs would take the agent into the hazard, e.g. (0, 0). The states where
the expected trajectory took the agent nowhere near the hazard saw no change
in value because for those states, the problem looks exactly the same as
before. -->

</div>


For reference, our solution is **4** lines of code.

In [None]:
def MDP_6():
  """Creates a LargeRescueMDP(), and sets a hazard at `{(1, 4)}' and returns
  the optimal value function and number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 7

Given the hazards, let's make the dynamics noisier. Please create a
LargeRescueMDP, set a hazard at `{(1, 4)}' and set
correct_transition_probability to be .76. Then please solve for the optimal
value function.

<div class="question question-multiplechoice">
<b>Submission Material 7:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 7.

In your pdf, please also answer the following questions:

* Compared to the previous question, with these noisier dynamics, the policy
changed slightly. Please enumerate which states had a different policy
relative to the previous question. Why did the policy change?

<!-- The policy changed in states (4,1) and (3,0). For those two states, the
risk of encountering the hazard, and the associated penalty, increased enough
relative to the last problem that it made more sense to go down and around the
red obstacle, rather than up and over. Those two states were right on the
borderline of whether the higher value policy was to go up and over or down
and under. When the dynamics are deterministic,
it's cheaper to go up and over. The increased noise made it cheaper to go down
and under and avoid the hazard.

-->

</div>


For reference, our solution is **5** lines of code.

In [None]:
def MDP_7():
  """Creates a LargeRescueMDP(), and sets a hazard at `{(1, 4)}' and returns
  the optimal value function and number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 8

Now let's make the hazard really hazardous. Please create a LargeRescueMDP,
set a hazard at `{(1, 4)}', set correct_transition_probability to be .76 and
set the hazard_cost to be -1000. Then please solve for the optimal value
function.

<div class="question question-multiplechoice">
<b>Submission Material 8:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 8.

In your pdf, please also answer the following questions:

* Compared to the previous question, with these noisier dynamics, the policy
changed a lot. Where did the policy change (you don't need to enumerate
states, just give a general description) and why?

<!-- The policy changed in the top left. It's cheaper now for all the states
to the left of (0, 4) to go down and around, to avoid the large negative
hazard.
-->

* Compared to the previous question, value iteration should have taken a
little longer. What extra work are we making value iteration do in this
case?

<!-- Because the policy is taking the states from the top left of the state
space down and around the obstacles, more iterations are required to compute
the expected rewards for those states. -->

</div>


For reference, our solution is **5** lines of code.

In [None]:
def MDP_8():
  """Creates a LargeRescueMDP(), sets the correct_transition_probability to
  .76 and the temporal_discount_factor to .99 and returns the optimal value function and
  number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')

### MDP Question 9

What if we make the hazard even more hazardous, but have nearly-deterministic dynamics?
Please create a LargeRescueMDP, set a hazard at `{(1, 4)}', leave the
correct_transition_probability at its default value, but set the hazard_cost to be
-10000. Then please solve for the optimal value function.


<div class="question question-multiplechoice">
<b>Submission Material 9:</b> In your submitted pdf, please include a single
plot of the value function with the action under the optimal policy drawn for
each state. Name this figure as Figure 9.

In your pdf, please also answer the following questions:

* Compared to the previous question, with these cleaner dynamics, the policy
didn't change. Why is that?

<!-- Even with nearly deterministic dynamics, the cost of the hazard is so
high that the agent would still prefer to avoid the hazard at all costs.
-->

* Compared to the previous question, value iteration should have been
faster. Why is that?

<!-- Because the dynamics are nearly-deterministic, the path to the goal from
each state is going to be shorter. Value iteration needs to account for fewer
contingencies. -->


</div>


For reference, our solution is **5** lines of code.

In [None]:
def MDP_9():
  """Creates a LargeRescueMDP(), sets the correct_transition_probability to
  .76 and the temporal_discount_factor to .99 and returns the optimal value function and
  number of iterations required to converge

  Args:
    None

  Returns:
    value function: a dict of states to values
    it: the number of iterations required to solve for the value function
  """
  raise NotImplementedError("Implement me!")

Tests

In [None]:

print('Tests passed.')